[WIP] feat(api): Replace PodTemplateOverrides with TemplateOverrides#3199
[WIP] feat(api): Replace PodTemplateOverrides with TemplateOverrides#3199andreyvelich wants to merge 6 commits intokubeflow:masterfrom
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR introduces a breaking API change to replace PodTemplateOverrides with manager-scoped TemplateOverrides, aiming to group override ownership boundaries more clearly across controllers/users.
Changes:
- Replaces
TrainJobSpec.PodTemplateOverrideswithTrainJobSpec.TemplateOverrideskeyed bymanager. - Introduces new API types for
TemplateOverride, including job-level and pod-level override histories. - Updates the v2 proposal/KEP documentation to describe the new API shape and examples.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| pkg/apis/trainer/v1alpha1/trainjob_types.go | Updates TrainJob API types to add manager-keyed TemplateOverrides and new override structs. |
| docs/proposals/2170-kubeflow-trainer-v2/README.md | Updates the proposal to document TemplateOverrides, including rationale and YAML examples. |
| // JobTemplateOverride represents a custom override that will be applied to the JobTemplateSpec | ||
| type JobTemplateOverride struct { | ||
| // Time is the timestamp of when the JobTemplateOverride entry was added. | ||
| // +required | ||
| Time metav1.Time `json:"time,omitempty"` | ||
|
|
There was a problem hiding this comment.
In the docs code snippet, Time is marked as +required and uses a non-pointer metav1.Time, which doesn’t match the actual API types in pkg/apis/trainer/v1alpha1/trainjob_types.go where Time is optional (*metav1.Time).
| // +listType=map | ||
| // +listMapKey=time | ||
| JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"` | ||
|
|
||
| // podTemplateOverrides defines overrides that applied to PodTemplateSpec | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides" | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted" | ||
| // +listType=map | ||
| // +listMapKey=time |
There was a problem hiding this comment.
JobTemplateOverrides/PodTemplateOverrides are declared as +listType=map with +listMapKey=time, but the keyed field Time is optional (*metav1.Time), which makes the map key potentially unset and breaks map-list semantics (unique/stable keys for merge/validation).
| // +listType=map | |
| // +listMapKey=time | |
| JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"` | |
| // podTemplateOverrides defines overrides that applied to PodTemplateSpec | |
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides" | |
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted" | |
| // +listType=map | |
| // +listMapKey=time | |
| JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"` | |
| // podTemplateOverrides defines overrides that applied to PodTemplateSpec | |
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides" | |
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted" |
| TargetJobs []TemplateOverrideTargetJob `json:"targetJobs,omitempty"` | ||
|
|
||
| // metadata overrides the Job template metadata or JobSet metadata. | ||
| // If targetJobs is specified, these values are merged with the specific ReplicatedJob's Job template metadata. | ||
| // If targetJobs is empty, these values are merged with the JobSet object metadata. | ||
| // +optional | ||
| Metadata *metav1.ObjectMeta `json:"metadata,omitempty"` |
There was a problem hiding this comment.
I’ve been thinking that we could start using JobTemplateOverride instead of the dedicated Labels and Annotations fields we currently expose in the TrainJob.spec API.
The idea would be:
- If targetJob is omitted, the override is applied to the JobSet
- If targetJob is set, the override is applied to the specific Job
One concern is that once we introduce JobTemplateSpecOverride, it could potentially contain fields relevant to both Job and JobSet, which may introduce ambiguity. I’m not entirely sure what the better way to handle that would be, though I also don’t see a clearly better alternative at the moment.
@tenzen-y @kaisoz @mimowo @astefanutti @kannon92 , I’d really appreciate your thoughts on this approach.
| // Time is the timestamp of when the JobTemplateOverride entry was added. If value is omitted, | ||
| // controller defaults this value to the current timestamp. | ||
| // +optional | ||
| Time *metav1.Time `json:"time,omitempty"` |
There was a problem hiding this comment.
Time will be set server-side by Trainer admission mutating webhook when TrainJob is created/updated.
| JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"` | ||
|
|
||
| // podTemplateOverrides defines overrides that applied to PodTemplateSpec | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides" | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted" | ||
| // +listType=map | ||
| // +listMapKey=time | ||
| PodTemplateOverrides []PodTemplateOverride `json:"pod,omitempty"` |
There was a problem hiding this comment.
Do you prefer pod and job or podTemplateOverrides and jobTemplateOverrides?
|
cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @akshaychitneni |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
This BREAKING CHANGE will replace
PodTemplateOverridewithTemplateOverridesAPI.We would like to group overrides by manager for clear ownership boundaries.
This PR updates KEP, APIs, and implementation.
TemplateOverrides will have:
Related: #3020