-
Notifications
You must be signed in to change notification settings - Fork 33.1k
fix(core): Prevent worker from recovering finished executions #16094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(core): Prevent worker from recovering finished executions #16094
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cubic found 1 issue across 3 files. Review it in cubic.dev
React with 👍 or 👎 to teach cubic. Tag @cubic-dev-ai
to give specific feedback.
Codecov ReportAttention: Patch coverage is
📢 Thoughts on this report? Let us know! |
Out of curiosity, what would happen if we were to include |
{ status: 'error', data: stringify({ runData: { foo: 'bar' } }) }, | ||
workflow, | ||
); | ||
const messages = setupMessages(execution.id, 'Some workflow'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is not creating the 100% the same scenario you're covering in the PR description. It does contain an additional n8n.workflow.started
message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, what the test is verifying is that crash recovery bails if the execution has a terminal error
state, no matter what messages we may have about it.
✅ All Cypress E2E specs passed |
I did try this and the worker gets the full logs, but I hesitate to change this because historically these hooks have always been with main and I'm not confident I can foresee all the downstream consequences of moving them, e.g. for telemetry purposes the execution starts when the trigger is received, not when a worker becomes available. Let's tackle this separately. |
Got released with |
Summary
Crash recovery has plenty of known issues but I found a fundamental one.
Consider this scenario:
error
statusIn this scenario, we end up with inconsistent event logs:
This is because of our lifecycle hooks setup:
getLifecycleHooksForScalingMain
which includeshookFunctionsWorkflowEvents
getLifecycleHooksForScalingWorker
which excludeshookFunctionsWorkflowEvents
getLifecycleHooksForSubExecutions
which includeshookFunctionsWorkflowEvents
This event log inconsistency impacts crash recovery, which assumes all events for an execution are in the event log for a given instance. This means crash recovery wrongly identifies execution 1 as unfinished (instead of finished with an error) leading it to wrongly amend details for a finished execution (instead of skipping it), which will also lead to another crash recovery on the next restart, as the details are incomplete.
In short, this means every worker tries to recover its own finished failed executions (instead of only unfinished crashed ones), which the worker does not even have the data for, and so this cycle repeats on restart. The root solution would be to straighten out our hooks setup and have event logs for workers remain with workers, but lifecycle hooks are so central and so legacy that this would be a big dedicated effort.
Hence for now, this fix filters out finished executions downstream and defers crash recovery logging until after we know which were the actually recovered execution IDs.
Related Linear tickets, Github issues, and Community forum posts
n/a
Review / Merge checklist
release/backport
(if the PR is an urgent fix that needs to be backported)