feat(runtimes): Support Elastic PyTorch in TrainJob#3099
feat(runtimes): Support Elastic PyTorch in TrainJob#3099SoumyaRaikwar wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Implements ElasticPolicy support in Torch plugin, including HPA generation and setting PET_NNODES. Signed-off-by: SoumyaRaikwar <somuraik@gmail.com>
770b641 to
dff5532
Compare
|
@andreyvelich this pr implements a full support for elastic pytorch in trainjob, could you please take a look |
| numNodes := "1" | ||
| if info.RuntimePolicy.MLPolicySource.Torch.ElasticPolicy != nil { | ||
| elasticPolicy := info.RuntimePolicy.MLPolicySource.Torch.ElasticPolicy | ||
| minNodes := ptr.Deref(elasticPolicy.MinNodes, 1) |
There was a problem hiding this comment.
Should you run additional validations if elasticPolicy is configured here https://github.com/kubeflow/trainer/pull/3099/files#diff-11623baf504a9103729b32afb9576aa348c619218a5c49ec92072312afe13d33R228? to make sure have - minNodes and maxNodes are configured and maxNodes >= minNodes
There was a problem hiding this comment.
Should also validate if ElasticPolicy.Metrics be > 0 in the policy?
andreyvelich
left a comment
There was a problem hiding this comment.
Thank you for this PR @SoumyaRaikwar!
We might have blocker to allow mutation of parallelism, completions, and replica APIs in the JobSet: kubernetes-sigs/jobset#463 (comment)
@aniket2405 is working on the KEP to support it in JobSet.
After that, we can introduce elasticity to the TrainJob APIs.
/hold
cc @kannon92
What this PR does / why we need it:
This PR implements full support for Elastic PyTorch in TrainJob by handling the
ElasticPolicyin the Torch runtime plugin. Specifically, it:PET_NNODESenvironment variable to the"minNodes:maxNodes"format required for elastic training whenElasticPolicyis defined.MinNodesto ensure the job starts with the minimum required nodes.HorizontalPodAutoscaler(HPA) resource.ElasticPolicy.Metricsto HPA metrics, enabling dynamic scaling based on resource usage or custom metrics.This feature allows users to run fault-tolerant and elastic PyTorch jobs that can auto-scale within the specified
minNodesandmaxNodesrange, fully utilizing theElasticPolicyAPI.Which issue(s) this PR fixes:
Fixes #2903