Skip to content

feat(runtimes): Add KAI Scheduler plugin for gang-scheduling support#3186

Open
Raakshass wants to merge 1 commit intokubeflow:masterfrom
Raakshass:kai-scheduler-2628
Open

feat(runtimes): Add KAI Scheduler plugin for gang-scheduling support#3186
Raakshass wants to merge 1 commit intokubeflow:masterfrom
Raakshass:kai-scheduler-2628

Conversation

@Raakshass
Copy link

Description

This PR implements support for NVIDIA's KAI Scheduler as a new gang-scheduling backend in Kubeflow Trainer, addressing the need for advanced GPU scheduling in AI/ML workloads.

What is KAI Scheduler?

KAI Scheduler is NVIDIA's Kubernetes AI Scheduler that provides:

  • Gang scheduling for distributed training jobs
  • GPU-aware bin-packing for optimal resource utilization
  • Topology-aware placement (NVLink, NVSwitch)
  • Queue-based multi-tenant scheduling
  • Native integration with NVIDIA GPU Operator

Changes Made

API Types (pkg/apis/trainer/v1alpha1/trainingruntime_types.go)

  • Added KAI field to PodGroupPolicySource struct
  • Defined KAIPodGroupPolicySource struct with:
    • QueueName: Optional queue for multi-tenant scheduling
    • ScheduleTimeoutSeconds: Timeout before failing unschedulable PodGroups

Plugin Implementation (pkg/runtime/framework/plugins/kai/kai.go)

  • EnforcePodGroupPolicy: Sets KAI-specific pod labels
    • scheduling.kai.io/pod-group: Associates pods with their PodGroup
    • scheduling.kai.io/queue: Assigns pods to KAI queue
  • Build: Creates scheduler-plugins PodGroup resources (compatible with KAI)
    • Aggregates MinMember from all PodSets
    • Calculates total MinResources from pod requests
    • Sets ScheduleTimeoutSeconds (default: 60s)
    • Configures proper owner references

Registry (pkg/runtime/framework/plugins/registry.go)

  • Registered kai.New in the plugin registry

Tests (pkg/runtime/framework/plugins/kai/kai_test.go)

  • 8 comprehensive test cases covering:
    • Nil info/trainJob handling
    • PodGroup creation with proper MinMember/MinResources
    • Queue name label assignment
    • API error handling
    • Existing PodGroup skip logic
    • Default timeout values

Usage Example

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainingRuntime
metadata:
  name: kai-distributed-training
spec:
  mlPolicy:
    torch:
      numProcPerNode: "4"
  podGroupPolicy:
    kai:
      queueName: "high-priority"
      scheduleTimeoutSeconds: 120
  template:
    spec:
      containers:
      - name: trainer
        resources:
          limits:
            nvidia.com/gpu: 4

Testing

  • All 8 unit tests pass
  • go build ./... succeeds
  • controller-gen manifests generated

Related Issues

Fixes #2628

Checklist

  • Code follows existing patterns (similar to coscheduling plugin)
  • Unit tests added with comprehensive coverage
  • API types include proper JSON tags and documentation
  • Generated deepcopy functions updated
  • CRD manifests regenerated

Base Branch

master

Files Changed

  • pkg/apis/trainer/v1alpha1/trainingruntime_types.go - Added KAI API types
  • pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go - Generated deepcopy
  • pkg/runtime/framework/plugins/kai/kai.go - KAI plugin implementation
  • pkg/runtime/framework/plugins/kai/kai_test.go - Comprehensive tests
  • pkg/runtime/framework/plugins/registry.go - Plugin registration
  • manifests/base/crds/*.yaml - Updated CRDs

This PR implements support for NVIDIA's KAI Scheduler as a new
gang-scheduling backend in Kubeflow Trainer.

Changes:
- Added KAI field to PodGroupPolicySource struct
- Defined KAIPodGroupPolicySource with QueueName and ScheduleTimeoutSeconds
- Implemented kai plugin with EnforcePodGroupPolicy and Build methods
- Added 8 comprehensive unit tests
- Registered kai.New in the plugin registry

Fixes kubeflow#2628

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@jaiakash
Copy link
Member

jaiakash commented Feb 9, 2026

/ok-to-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support KAI Scheduler in Kubeflow Trainer

2 participants