feat: add Megatron-Core GPT Tensor Parallelism example notebook#3201
feat: add Megatron-Core GPT Tensor Parallelism example notebook#3201XploY04 wants to merge 1 commit intokubeflow:masterfrom
Conversation
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR adds a Jupyter notebook example demonstrating how to train a GPT model using Megatron-Core with Tensor Parallelism on Kubeflow Trainer. The example showcases distributed training across 2 GPUs using the torch-distributed runtime.
Changes:
- Adds a complete end-to-end example for Megatron-Core GPT Tensor Parallelism training
- Demonstrates the integration of Megatron-LM with Kubeflow Trainer's torch-distributed runtime
- Shows checkpoint saving/loading with Megatron's distributed checkpointing
| " ckpt_path = os.getcwd() + \"/ckpt\"\n", | ||
| " Path(ckpt_path).mkdir(exist_ok=True)\n", |
There was a problem hiding this comment.
String concatenation with "/" is used for path construction, which is not portable across operating systems. Since Path from pathlib is already imported and used on line 238, it would be more consistent and portable to use Path(os.getcwd()) / "ckpt" instead of os.getcwd() + "/ckpt".
| " ckpt_path = os.getcwd() + \"/ckpt\"\n", | |
| " Path(ckpt_path).mkdir(exist_ok=True)\n", | |
| " ckpt_path = Path(os.getcwd()) / \"ckpt\"\n", | |
| " ckpt_path.mkdir(exist_ok=True)\n", |
| "source": [ | ||
| "## Define the Training Function\n", | ||
| "\n", | ||
| "The first step is to create function to train GPT model using Megatron-Core with Tensor Parallelism.\n", |
There was a problem hiding this comment.
Missing article "a" before "function". Should be "create a function to train".
| "The first step is to create function to train GPT model using Megatron-Core with Tensor Parallelism.\n", | |
| "The first step is to create a function to train GPT model using Megatron-Core with Tensor Parallelism.\n", |
| "for runtime in client.list_runtimes():\n", | ||
| " print(runtime)\n", | ||
| " if runtime.name == \"torch-distributed\":\n", | ||
| " torch_runtime = runtime" |
There was a problem hiding this comment.
The variable torch_runtime is assigned only if a runtime with name "torch-distributed" is found in the loop, but there's no check to ensure it was actually found before it's used on line 313. If the runtime is not found, this will cause a NameError. Consider adding a check like "if not torch_runtime: raise ValueError('torch-distributed runtime not found')" after the loop, or use a break statement and initialize torch_runtime = None before the loop (similar to the pattern in examples/pytorch/audio-classification/audio-classification.ipynb lines 403-410).
| "for runtime in client.list_runtimes():\n", | |
| " print(runtime)\n", | |
| " if runtime.name == \"torch-distributed\":\n", | |
| " torch_runtime = runtime" | |
| "torch_runtime = None\n", | |
| "for runtime in client.list_runtimes():\n", | |
| " print(runtime)\n", | |
| " if runtime.name == \"torch-distributed\":\n", | |
| " torch_runtime = runtime\n", | |
| " break\n", | |
| "\n", | |
| "if torch_runtime is None:\n", | |
| " raise ValueError(\"torch-distributed runtime not found\")" |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #3178
References used:
Checklist: