feat(runtimes): add AMD ROCm torch distributed runtime ref #2335 by JEETDESAI25 · Pull Request #3097 · kubeflow/trainer

JEETDESAI25 · 2026-01-16T18:14:40Z

What this PR does / why we need it:

Adds AMD ROCm PyTorch distributed training runtime, enabling distributed training on AMD GPUs (e.g., MI300X accelerators).
This follows the same pattern as the existing NVIDIA torch runtime added in #2328, using the official rocm/pytorch image with a pinned version matching the PyTorch version used in the NVIDIA runtime (2.7.1).

Which issue(s) this PR fixes :

#2335

Checklist:

[ ] Docs included if any changes are user facing

google-oss-prow · 2026-01-16T18:14:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-01-16T18:14:49Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Signed-off-by: Jeet <113221510+JEETDESAI25@users.noreply.github.com>

JEETDESAI25 · 2026-01-16T18:19:41Z

/assign @andreyvelich

google-oss-prow bot requested review from jinchihe and kuizhiqing January 16, 2026 18:14

google-oss-prow bot added the size/S label Jan 16, 2026

feat(runtimes): add AMD ROCm torch distributed runtime ref kubeflow#2335

b0cfb72

Signed-off-by: Jeet <113221510+JEETDESAI25@users.noreply.github.com>

JEETDESAI25 force-pushed the feat-rocm-torch-runtime-2335 branch from 62b0c84 to b0cfb72 Compare January 16, 2026 18:18

google-oss-prow bot assigned andreyvelich Jan 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtimes): add AMD ROCm torch distributed runtime ref #2335#3097

feat(runtimes): add AMD ROCm torch distributed runtime ref #2335#3097
JEETDESAI25 wants to merge 1 commit intokubeflow:masterfrom
JEETDESAI25:feat-rocm-torch-runtime-2335

JEETDESAI25 commented Jan 16, 2026

Uh oh!

google-oss-prow bot commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

JEETDESAI25 commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JEETDESAI25 commented Jan 16, 2026

Uh oh!

google-oss-prow bot commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

JEETDESAI25 commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants