Skip to content

feat(runtimes): Support Elastic PyTorch in TrainJob#3099

Open
SoumyaRaikwar wants to merge 1 commit intokubeflow:masterfrom
SoumyaRaikwar:feature/elastic-pytorch-2903
Open

feat(runtimes): Support Elastic PyTorch in TrainJob#3099
SoumyaRaikwar wants to merge 1 commit intokubeflow:masterfrom
SoumyaRaikwar:feature/elastic-pytorch-2903

Conversation

@SoumyaRaikwar
Copy link

@SoumyaRaikwar SoumyaRaikwar commented Jan 17, 2026

What this PR does / why we need it:
This PR implements full support for Elastic PyTorch in TrainJob by handling the ElasticPolicy in the Torch runtime plugin. Specifically, it:

  • Sets the PET_NNODES environment variable to the "minNodes:maxNodes" format required for elastic training when ElasticPolicy is defined.
  • Updates the initial PodSet replica count to MinNodes to ensure the job starts with the minimum required nodes.
  • Implements the ComponentBuilderPlugin interface in the Torch plugin to generate a HorizontalPodAutoscaler (HPA) resource.
  • Maps ElasticPolicy.Metrics to HPA metrics, enabling dynamic scaling based on resource usage or custom metrics.

This feature allows users to run fault-tolerant and elastic PyTorch jobs that can auto-scale within the specified minNodes and maxNodes range, fully utilizing the ElasticPolicy API.

Which issue(s) this PR fixes:
Fixes #2903

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Implements ElasticPolicy support in Torch plugin, including HPA generation and setting PET_NNODES.

Signed-off-by: SoumyaRaikwar <somuraik@gmail.com>
@SoumyaRaikwar SoumyaRaikwar force-pushed the feature/elastic-pytorch-2903 branch from 770b641 to dff5532 Compare January 17, 2026 12:13
@SoumyaRaikwar SoumyaRaikwar changed the title feat(torch): Support Elastic PyTorch in TrainJob eat(runtimes): Support Elastic PyTorch in TrainJob Jan 17, 2026
@SoumyaRaikwar SoumyaRaikwar changed the title eat(runtimes): Support Elastic PyTorch in TrainJob feat(runtimes): Support Elastic PyTorch in TrainJob Jan 17, 2026
@SoumyaRaikwar
Copy link
Author

@andreyvelich this pr implements a full support for elastic pytorch in trainjob, could you please take a look

numNodes := "1"
if info.RuntimePolicy.MLPolicySource.Torch.ElasticPolicy != nil {
elasticPolicy := info.RuntimePolicy.MLPolicySource.Torch.ElasticPolicy
minNodes := ptr.Deref(elasticPolicy.MinNodes, 1)
Copy link
Contributor

@akshaychitneni akshaychitneni Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you run additional validations if elasticPolicy is configured here https://github.com/kubeflow/trainer/pull/3099/files#diff-11623baf504a9103729b32afb9576aa348c619218a5c49ec92072312afe13d33R228? to make sure have - minNodes and maxNodes are configured and maxNodes >= minNodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also validate if ElasticPolicy.Metrics be > 0 in the policy?

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @SoumyaRaikwar!
We might have blocker to allow mutation of parallelism, completions, and replica APIs in the JobSet: kubernetes-sigs/jobset#463 (comment)
@aniket2405 is working on the KEP to support it in JobSet.

After that, we can introduce elasticity to the TrainJob APIs.

/hold

cc @kannon92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Elastic PyTorch in TrainJob

3 participants