feat(runtimes): Support Elastic PyTorch in TrainJob by SoumyaRaikwar · Pull Request #3099 · kubeflow/trainer

SoumyaRaikwar · 2026-01-17T12:11:44Z

What this PR does / why we need it:
This PR implements full support for Elastic PyTorch in TrainJob by handling the ElasticPolicy in the Torch runtime plugin. Specifically, it:

Sets the PET_NNODES environment variable to the "minNodes:maxNodes" format required for elastic training when ElasticPolicy is defined.
Updates the initial PodSet replica count to MinNodes to ensure the job starts with the minimum required nodes.
Implements the ComponentBuilderPlugin interface in the Torch plugin to generate a HorizontalPodAutoscaler (HPA) resource.
Maps ElasticPolicy.Metrics to HPA metrics, enabling dynamic scaling based on resource usage or custom metrics.

This feature allows users to run fault-tolerant and elastic PyTorch jobs that can auto-scale within the specified minNodes and maxNodes range, fully utilizing the ElasticPolicy API.

Which issue(s) this PR fixes:
Fixes #2903

github-actions · 2026-01-17T12:11:54Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

google-oss-prow · 2026-01-17T12:12:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Implements ElasticPolicy support in Torch plugin, including HPA generation and setting PET_NNODES. Signed-off-by: SoumyaRaikwar <somuraik@gmail.com>

SoumyaRaikwar · 2026-01-18T12:22:26Z

@andreyvelich this pr implements a full support for elastic pytorch in trainjob, could you please take a look

akshaychitneni · 2026-01-19T04:22:45Z

pkg/runtime/framework/plugins/torch/torch.go

+	numNodes := "1"
+	if info.RuntimePolicy.MLPolicySource.Torch.ElasticPolicy != nil {
+		elasticPolicy := info.RuntimePolicy.MLPolicySource.Torch.ElasticPolicy
+		minNodes := ptr.Deref(elasticPolicy.MinNodes, 1)


Should you run additional validations if elasticPolicy is configured here https://github.com/kubeflow/trainer/pull/3099/files#diff-11623baf504a9103729b32afb9576aa348c619218a5c49ec92072312afe13d33R228? to make sure have - minNodes and maxNodes are configured and maxNodes >= minNodes

Should also validate if ElasticPolicy.Metrics be > 0 in the policy?

andreyvelich

Thank you for this PR @SoumyaRaikwar!
We might have blocker to allow mutation of parallelism, completions, and replica APIs in the JobSet: kubernetes-sigs/jobset#463 (comment)
@aniket2405 is working on the KEP to support it in JobSet.

After that, we can introduce elasticity to the TrainJob APIs.

/hold

cc @kannon92

google-oss-prow bot added the size/L label Jan 17, 2026

google-oss-prow bot requested review from jinchihe and kuizhiqing January 17, 2026 12:12

feat(runtimes): Support Elastic PyTorch in TrainJob

dff5532

Implements ElasticPolicy support in Torch plugin, including HPA generation and setting PET_NNODES. Signed-off-by: SoumyaRaikwar <somuraik@gmail.com>

SoumyaRaikwar force-pushed the feature/elastic-pytorch-2903 branch from 770b641 to dff5532 Compare January 17, 2026 12:13

SoumyaRaikwar changed the title ~~feat(torch): Support Elastic PyTorch in TrainJob~~ eat(runtimes): Support Elastic PyTorch in TrainJob Jan 17, 2026

SoumyaRaikwar changed the title ~~eat(runtimes): Support Elastic PyTorch in TrainJob~~ feat(runtimes): Support Elastic PyTorch in TrainJob Jan 17, 2026

akshaychitneni reviewed Jan 19, 2026

View reviewed changes

andreyvelich reviewed Jan 19, 2026

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Jan 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtimes): Support Elastic PyTorch in TrainJob#3099

feat(runtimes): Support Elastic PyTorch in TrainJob#3099
SoumyaRaikwar wants to merge 1 commit intokubeflow:masterfrom
SoumyaRaikwar:feature/elastic-pytorch-2903

SoumyaRaikwar commented Jan 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

google-oss-prow bot commented Jan 17, 2026

Uh oh!

SoumyaRaikwar commented Jan 18, 2026

Uh oh!

akshaychitneni Jan 19, 2026 •

edited

Loading

Uh oh!

akshaychitneni Jan 19, 2026

Uh oh!

andreyvelich left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SoumyaRaikwar commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

google-oss-prow bot commented Jan 17, 2026

Uh oh!

SoumyaRaikwar commented Jan 18, 2026

Uh oh!

akshaychitneni Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akshaychitneni Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SoumyaRaikwar commented Jan 17, 2026 •

edited

Loading

akshaychitneni Jan 19, 2026 •

edited

Loading