feat: add Megatron-Core GPT Tensor Parallelism example notebook by XploY04 · Pull Request #3201 · kubeflow/trainer

XploY04 · 2026-02-13T17:31:35Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #3178

References used:

megatron-LM examples: run_simple_mcore_train_loop.py

Checklist:

Docs included if any changes are user facing

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

review-notebook-app · 2026-02-13T17:31:41Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

google-oss-prow · 2026-02-13T17:31:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-13T17:31:45Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR adds a Jupyter notebook example demonstrating how to train a GPT model using Megatron-Core with Tensor Parallelism on Kubeflow Trainer. The example showcases distributed training across 2 GPUs using the torch-distributed runtime.

Changes:

Adds a complete end-to-end example for Megatron-Core GPT Tensor Parallelism training
Demonstrates the integration of Megatron-LM with Kubeflow Trainer's torch-distributed runtime
Shows checkpoint saving/loading with Megatron's distributed checkpointing

Copilot · 2026-02-13T17:36:31Z

examples/megatron/tensor-parallelism/megatron-core-gpt-tp.ipynb

+    "    ckpt_path = os.getcwd() + \"/ckpt\"\n",
+    "    Path(ckpt_path).mkdir(exist_ok=True)\n",


String concatenation with "/" is used for path construction, which is not portable across operating systems. Since Path from pathlib is already imported and used on line 238, it would be more consistent and portable to use Path(os.getcwd()) / "ckpt" instead of os.getcwd() + "/ckpt".

Suggested change

" ckpt_path = os.getcwd() + \"/ckpt\"\n",

" Path(ckpt_path).mkdir(exist_ok=True)\n",

" ckpt_path = Path(os.getcwd()) / \"ckpt\"\n",

" ckpt_path.mkdir(exist_ok=True)\n",

Copilot · 2026-02-13T17:36:31Z

examples/megatron/tensor-parallelism/megatron-core-gpt-tp.ipynb

+   "source": [
+    "## Define the Training Function\n",
+    "\n",
+    "The first step is to create function to train GPT model using Megatron-Core with Tensor Parallelism.\n",


Missing article "a" before "function". Should be "create a function to train".

Suggested change

"The first step is to create function to train GPT model using Megatron-Core with Tensor Parallelism.\n",

"The first step is to create a function to train GPT model using Megatron-Core with Tensor Parallelism.\n",

Copilot · 2026-02-13T17:36:32Z

examples/megatron/tensor-parallelism/megatron-core-gpt-tp.ipynb

+    "for runtime in client.list_runtimes():\n",
+    "    print(runtime)\n",
+    "    if runtime.name == \"torch-distributed\":\n",
+    "        torch_runtime = runtime"


The variable torch_runtime is assigned only if a runtime with name "torch-distributed" is found in the loop, but there's no check to ensure it was actually found before it's used on line 313. If the runtime is not found, this will cause a NameError. Consider adding a check like "if not torch_runtime: raise ValueError('torch-distributed runtime not found')" after the loop, or use a break statement and initialize torch_runtime = None before the loop (similar to the pattern in examples/pytorch/audio-classification/audio-classification.ipynb lines 403-410).

Suggested change

"for runtime in client.list_runtimes():\n",

" print(runtime)\n",

" if runtime.name == \"torch-distributed\":\n",

" torch_runtime = runtime"

"torch_runtime = None\n",

"for runtime in client.list_runtimes():\n",

" print(runtime)\n",

" if runtime.name == \"torch-distributed\":\n",

" torch_runtime = runtime\n",

" break\n",

"\n",

"if torch_runtime is None:\n",

" raise ValueError(\"torch-distributed runtime not found\")"

feat: add Megatron-Core GPT Tensor Parallelism example notebook

7d04dfd

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

Copilot AI review requested due to automatic review settings February 13, 2026 17:31

google-oss-prow bot requested review from akshaychitneni and kuizhiqing February 13, 2026 17:31

google-oss-prow bot added the size/L label Feb 13, 2026

Copilot started reviewing on behalf of XploY04 February 13, 2026 17:32 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Megatron-Core GPT Tensor Parallelism example notebook#3201

feat: add Megatron-Core GPT Tensor Parallelism example notebook#3201
XploY04 wants to merge 1 commit intokubeflow:masterfrom
XploY04:megatron

XploY04 commented Feb 13, 2026

Uh oh!

review-notebook-app bot commented Feb 13, 2026

Uh oh!

google-oss-prow bot commented Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 13, 2026

Uh oh!

Copilot AI Feb 13, 2026

Uh oh!

Copilot AI Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		" ckpt_path = os.getcwd() + \"/ckpt\"\n",
		" Path(ckpt_path).mkdir(exist_ok=True)\n",

	"The first step is to create function to train GPT model using Megatron-Core with Tensor Parallelism.\n",
	"The first step is to create a function to train GPT model using Megatron-Core with Tensor Parallelism.\n",

Conversation

XploY04 commented Feb 13, 2026

Uh oh!

review-notebook-app bot commented Feb 13, 2026

Uh oh!

google-oss-prow bot commented Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant