Skip to content

Initial TaskRun Reconcile Refactor #1866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

adambkaplan
Copy link
Member

@adambkaplan adambkaplan commented Apr 16, 2025

Changes

This is a refactor of how we reconcile TaskRun objects into BuildRun objects. The intent is to create useful abstractions of the TaskRun's lifecycle so that we can extend these principles to other Kubernetes object types.

/kind cleanup

Submitter Checklist

  • Includes tests if functionality changed/was added
  • Includes docs if changes are user-facing
  • Set a kind label on this PR
  • Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

Release Notes

NONE

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2025
@pull-request-size pull-request-size bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 16, 2025
@openshift-ci openshift-ci bot added the release-note Label for when a PR has specified a release note label Apr 16, 2025
@pull-request-size pull-request-size bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 17, 2025
@adambkaplan adambkaplan force-pushed the taskrun-reconcile-refactor branch 2 times, most recently from 8e1f2c5 to eca85d0 Compare April 21, 2025 14:19
@openshift-ci openshift-ci bot added release-note-none Label for when a PR does not need a release note and removed release-note Label for when a PR has specified a release note labels Apr 21, 2025
@adambkaplan adambkaplan changed the title WIP - Taskrun Reconcile Refactor Initial TaskRun Reconcile Refactor Apr 21, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 21, 2025
@adambkaplan adambkaplan added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Apr 21, 2025
Comment on lines +486 to +487
// taskrun pod ramp-up (time between pod creation and last init container completion)
buildmetrics.TaskRunPodRampUpDurationObserve(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we can advocate for in the Tekton community? This exposes metrics that are specific to an underlying implementation, making it difficult to use a different API for reconciling BuildRuns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are changing the behavior here. We so far measured the duration from the time when we created the object that does the work (TaskRun) up to when we knew it would start performing the strategy steps (end of TaskRun init containers).

Now you measure pod creation as start and take out the Tekton controller and its time to create a Pod from a TaskRun completely.

If we decide that we want to use different objects to run the BuildRun and want to continue to provide such a metric, then we'd have to rename it for sure. But if we want to keep something like this, then we should not change it. You could add a CalculateRampUpDuration function on the BuildRunner.

Copy link
Member Author

@adambkaplan adambkaplan May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried not to change the behavior (too much), and perhaps with the "BuildRunner" abstraction what is being reported is less clear.

The BuildRunner's GetCreationTimestamp() method returns the creation timestamp of the object executing the build. In this case, the creation timestamp of the Tekton TaskRun. So TaskRunRampUpObserve records the duration between when we created the TaskRun and Tekton created the underlying pod. So I don't think I am modifying the current metric that is reported.

Similar story for the TaskRunPodRampUp metric - it is recording the time between when the pod is created and when the pod's last init container terminates. Side note: this might break if Tekton changes its sidecar behavior to use true Kubernetes "sidecars".

The main behavior change (which actually required me to modify a unit test) is that computing these timestamps makes two kube apiserver calls, instead of one.

Comment on lines +476 to +477
// taskrun ramp-up duration (time between taskrun creation and taskrun pod creation)
buildmetrics.TaskRunRampUpDurationObserve(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar issue - this metric is tied to an underlying implementation. In theory a runner's API does not need to create Kubernetes pods, it just needs to represent the execution of a workload. For example, a KubeVirt VirtualMachineInstance: https://kubevirt.io/user-guide/user_workloads/lifecycle/

generatedTaskRun *pipelineapi.TaskRun
)

generatedTaskRun, err := resources.GenerateTaskRun(t.config, build, buildRun, serviceAccount.Name, strategy)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'm leaving the existing logic for creating TaskRuns in place. I plan on breaking this logic up and moving it to the TaskRunBuildRunner struct (or something similar). This PR is big enough as it is.

Comment on lines +473 to +486
// Set OwnerReference for BuildRun and TaskRun
if err := controllerutil.SetOwnerReference(buildRun, generatedTaskRun, t.scheme); err != nil {
return nil, err
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this here for now, but I may move the setting of owner references to the main Reconcile function in a subsequent PR.

@adambkaplan
Copy link
Member Author

/assign @SaschaSchwarze0

/cc @HeavyWombat @karanibm6

return e.Message
}

func (e *BuildRunnerValidationError) IsTerminal() bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godoc please, currently doing the first pass and I cannot imagine what this is for - good indicator it requires docs.

@@ -0,0 +1,491 @@
package runner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how the validate check is green given this lacks copyright.

Comment on lines +177 to +186
buildSpec := buildrun.Status.BuildSpec

if buildSpec.Source == nil {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate. status.buildSpec is a copy of the Build spec. But the real source may be the BuildRun source. For example, in case of a local source, this causes an overwrite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildRun.Status.Output.Digest = result.Value.StringVal

case generateOutputResultName(imageSizeResult):
// TODO: Throw error that should not be retried?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to log a warning if the conversion here fails. Usually means that the build strategy is incorrect in the case of strategy-managed push.

}

case pipelineapi.TaskRunReasonTimedOut:
reason = "BuildRunTimeout"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no constant for this ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently not - my search for BuildRunTimeout only returns refrences to the string and not assignment to a constant. Promoting this to the API package is a good idea.

Comment on lines +486 to +487
// taskrun pod ramp-up (time between pod creation and last init container completion)
buildmetrics.TaskRunPodRampUpDurationObserve(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are changing the behavior here. We so far measured the duration from the time when we created the object that does the work (TaskRun) up to when we knew it would start performing the strategy steps (end of TaskRun init containers).

Now you measure pod creation as start and take out the Tekton controller and its time to create a Pod from a TaskRun completely.

If we decide that we want to use different objects to run the BuildRun and want to continue to provide such a metric, then we'd have to rename it for sure. But if we want to keep something like this, then we should not change it. You could add a CalculateRampUpDuration function on the BuildRunner.

@adambkaplan adambkaplan force-pushed the taskrun-reconcile-refactor branch from eca85d0 to f1ec968 Compare May 14, 2025 18:02
Copy link
Contributor

openshift-ci bot commented May 14, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from saschaschwarze0. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@adambkaplan
Copy link
Member Author

I think I fixed all the issues identified in the review (or addressed them in comments). PTAL.

Use standard controller-runtime fake client for unit tests.

Signed-off-by: Adam Kaplan <[email protected]>
@adambkaplan adambkaplan force-pushed the taskrun-reconcile-refactor branch from f1ec968 to 0b98316 Compare May 15, 2025 18:56
Initial refactor to use a factory pattern when generating the Tekton
TaskRun for a Shipwright BuildRun. This starts us down the path of
using other k8s objects besides TaskRuns to execute the build process.
Other possibilities include PipelineRuns, Argo Workflows, and even
standard k8s objects (Pod, Job).

Key logic moved out of the main BuildRun reconciler and `resources`
package:

- Cancellation of underlying runner (TaskRun)
- Syncing of status to the BuildRun object
- Sync of status from TaskRun results.
- Sync of status from TaskRun conditions.
- Sync of status from failure messages recorded in termination logs.
- Report timestamps of started, completed, and other lifecycle
  events:
  - Pod creation time
  - Pod init containers finished
  - Execution start time
- Migrate unit tests to use controller-runtime fake client
- Move status condition update calls back to the main reconciler.

Further abstractions are likely needed to ensure we properly generate
the execution objects for other APIs.

Signed-off-by: Adam Kaplan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none Label for when a PR does not need a release note size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants