Skip to content

feat(runtimes): add AMD ROCm torch distributed runtime ref #2335#3097

Open
JEETDESAI25 wants to merge 1 commit intokubeflow:masterfrom
JEETDESAI25:feat-rocm-torch-runtime-2335
Open

feat(runtimes): add AMD ROCm torch distributed runtime ref #2335#3097
JEETDESAI25 wants to merge 1 commit intokubeflow:masterfrom
JEETDESAI25:feat-rocm-torch-runtime-2335

Conversation

@JEETDESAI25
Copy link

What this PR does / why we need it:

Adds AMD ROCm PyTorch distributed training runtime, enabling distributed training on AMD GPUs (e.g., MI300X accelerators).
This follows the same pattern as the existing NVIDIA torch runtime added in #2328, using the official rocm/pytorch image with a pinned version matching the PyTorch version used in the NVIDIA runtime (2.7.1).

Which issue(s) this PR fixes :

#2335

Checklist:

  • [ ] Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏



Signed-off-by: Jeet <113221510+JEETDESAI25@users.noreply.github.com>
@JEETDESAI25 JEETDESAI25 force-pushed the feat-rocm-torch-runtime-2335 branch from 62b0c84 to b0cfb72 Compare January 16, 2026 18:18
@JEETDESAI25
Copy link
Author

/assign @andreyvelich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants