feat(runtimes): add AMD ROCm torch distributed runtime ref #2335#3097
feat(runtimes): add AMD ROCm torch distributed runtime ref #2335#3097JEETDESAI25 wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
62b0c84 to
b0cfb72
Compare
|
/assign @andreyvelich |
What this PR does / why we need it:
Adds AMD ROCm PyTorch distributed training runtime, enabling distributed training on AMD GPUs (e.g., MI300X accelerators).
This follows the same pattern as the existing NVIDIA torch runtime added in #2328, using the official
rocm/pytorchimage with a pinned version matching the PyTorch version used in the NVIDIA runtime (2.7.1).Which issue(s) this PR fixes :
#2335
Checklist: