Skip to content

feat: add Megatron-Core GPT Tensor Parallelism example notebook#3201

Open
XploY04 wants to merge 1 commit intokubeflow:masterfrom
XploY04:megatron
Open

feat: add Megatron-Core GPT Tensor Parallelism example notebook#3201
XploY04 wants to merge 1 commit intokubeflow:masterfrom
XploY04:megatron

Conversation

@XploY04
Copy link

@XploY04 XploY04 commented Feb 13, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #3178

References used:

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Copilot AI review requested due to automatic review settings February 13, 2026 17:31
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a Jupyter notebook example demonstrating how to train a GPT model using Megatron-Core with Tensor Parallelism on Kubeflow Trainer. The example showcases distributed training across 2 GPUs using the torch-distributed runtime.

Changes:

  • Adds a complete end-to-end example for Megatron-Core GPT Tensor Parallelism training
  • Demonstrates the integration of Megatron-LM with Kubeflow Trainer's torch-distributed runtime
  • Shows checkpoint saving/loading with Megatron's distributed checkpointing

Comment on lines +237 to +238
" ckpt_path = os.getcwd() + \"/ckpt\"\n",
" Path(ckpt_path).mkdir(exist_ok=True)\n",
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String concatenation with "/" is used for path construction, which is not portable across operating systems. Since Path from pathlib is already imported and used on line 238, it would be more consistent and portable to use Path(os.getcwd()) / "ckpt" instead of os.getcwd() + "/ckpt".

Suggested change
" ckpt_path = os.getcwd() + \"/ckpt\"\n",
" Path(ckpt_path).mkdir(exist_ok=True)\n",
" ckpt_path = Path(os.getcwd()) / \"ckpt\"\n",
" ckpt_path.mkdir(exist_ok=True)\n",

Copilot uses AI. Check for mistakes.
"source": [
"## Define the Training Function\n",
"\n",
"The first step is to create function to train GPT model using Megatron-Core with Tensor Parallelism.\n",
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing article "a" before "function". Should be "create a function to train".

Suggested change
"The first step is to create function to train GPT model using Megatron-Core with Tensor Parallelism.\n",
"The first step is to create a function to train GPT model using Megatron-Core with Tensor Parallelism.\n",

Copilot uses AI. Check for mistakes.
Comment on lines +280 to +283
"for runtime in client.list_runtimes():\n",
" print(runtime)\n",
" if runtime.name == \"torch-distributed\":\n",
" torch_runtime = runtime"
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable torch_runtime is assigned only if a runtime with name "torch-distributed" is found in the loop, but there's no check to ensure it was actually found before it's used on line 313. If the runtime is not found, this will cause a NameError. Consider adding a check like "if not torch_runtime: raise ValueError('torch-distributed runtime not found')" after the loop, or use a break statement and initialize torch_runtime = None before the loop (similar to the pattern in examples/pytorch/audio-classification/audio-classification.ipynb lines 403-410).

Suggested change
"for runtime in client.list_runtimes():\n",
" print(runtime)\n",
" if runtime.name == \"torch-distributed\":\n",
" torch_runtime = runtime"
"torch_runtime = None\n",
"for runtime in client.list_runtimes():\n",
" print(runtime)\n",
" if runtime.name == \"torch-distributed\":\n",
" torch_runtime = runtime\n",
" break\n",
"\n",
"if torch_runtime is None:\n",
" raise ValueError(\"torch-distributed runtime not found\")"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distributed Tensor Parallelism with Megatron-LM and TrainJob

1 participant