Skip to content

[WIP] feat(api): Replace PodTemplateOverrides with TemplateOverrides#3199

Open
andreyvelich wants to merge 6 commits intokubeflow:masterfrom
andreyvelich:template-override-api
Open

[WIP] feat(api): Replace PodTemplateOverrides with TemplateOverrides#3199
andreyvelich wants to merge 6 commits intokubeflow:masterfrom
andreyvelich:template-override-api

Conversation

@andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Feb 10, 2026

This BREAKING CHANGE will replace PodTemplateOverride with TemplateOverrides API.

We would like to group overrides by manager for clear ownership boundaries.

This PR updates KEP, APIs, and implementation.

TemplateOverrides will have:

  • PodTemplateOverrides - for Pod overrides
  • JobTemplateOverrides - for JobSet/Job overrides

Related: #3020

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copilot AI review requested due to automatic review settings February 10, 2026 13:58
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a breaking API change to replace PodTemplateOverrides with manager-scoped TemplateOverrides, aiming to group override ownership boundaries more clearly across controllers/users.

Changes:

  • Replaces TrainJobSpec.PodTemplateOverrides with TrainJobSpec.TemplateOverrides keyed by manager.
  • Introduces new API types for TemplateOverride, including job-level and pod-level override histories.
  • Updates the v2 proposal/KEP documentation to describe the new API shape and examples.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
pkg/apis/trainer/v1alpha1/trainjob_types.go Updates TrainJob API types to add manager-keyed TemplateOverrides and new override structs.
docs/proposals/2170-kubeflow-trainer-v2/README.md Updates the proposal to document TemplateOverrides, including rationale and YAML examples.

Comment on lines +799 to +804
// JobTemplateOverride represents a custom override that will be applied to the JobTemplateSpec
type JobTemplateOverride struct {
// Time is the timestamp of when the JobTemplateOverride entry was added.
// +required
Time metav1.Time `json:"time,omitempty"`

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docs code snippet, Time is marked as +required and uses a non-pointer metav1.Time, which doesn’t match the actual API types in pkg/apis/trainer/v1alpha1/trainjob_types.go where Time is optional (*metav1.Time).

Copilot uses AI. Check for mistakes.
Comment on lines +268 to +276
// +listType=map
// +listMapKey=time
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`

// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
// +listType=map
// +listMapKey=time
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobTemplateOverrides/PodTemplateOverrides are declared as +listType=map with +listMapKey=time, but the keyed field Time is optional (*metav1.Time), which makes the map key potentially unset and breaks map-list semantics (unique/stable keys for merge/validation).

Suggested change
// +listType=map
// +listMapKey=time
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`
// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
// +listType=map
// +listMapKey=time
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`
// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"

Copilot uses AI. Check for mistakes.
Comment on lines +299 to +305
TargetJobs []TemplateOverrideTargetJob `json:"targetJobs,omitempty"`

// metadata overrides the Job template metadata or JobSet metadata.
// If targetJobs is specified, these values are merged with the specific ReplicatedJob's Job template metadata.
// If targetJobs is empty, these values are merged with the JobSet object metadata.
// +optional
Metadata *metav1.ObjectMeta `json:"metadata,omitempty"`
Copy link
Member Author

@andreyvelich andreyvelich Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been thinking that we could start using JobTemplateOverride instead of the dedicated Labels and Annotations fields we currently expose in the TrainJob.spec API.

The idea would be:

  • If targetJob is omitted, the override is applied to the JobSet
  • If targetJob is set, the override is applied to the specific Job

One concern is that once we introduce JobTemplateSpecOverride, it could potentially contain fields relevant to both Job and JobSet, which may introduce ambiguity. I’m not entirely sure what the better way to handle that would be, though I also don’t see a clearly better alternative at the moment.

@tenzen-y @kaisoz @mimowo @astefanutti @kannon92 , I’d really appreciate your thoughts on this approach.

Comment on lines +310 to +313
// Time is the timestamp of when the JobTemplateOverride entry was added. If value is omitted,
// controller defaults this value to the current timestamp.
// +optional
Time *metav1.Time `json:"time,omitempty"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time will be set server-side by Trainer admission mutating webhook when TrainJob is created/updated.

Comment on lines +270 to +277
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`

// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
// +listType=map
// +listMapKey=time
PodTemplateOverrides []PodTemplateOverride `json:"pod,omitempty"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer pod and job or podTemplateOverrides and jobTemplateOverrides?

@andreyvelich
Copy link
Member Author

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @akshaychitneni

andreyvelich and others added 5 commits February 10, 2026 15:21
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant