Skip to content

fix(core): Prevent worker from recovering finished executions #16094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 10, 2025

Conversation

ivov
Copy link
Contributor

@ivov ivov commented Jun 6, 2025

Summary

Crash recovery has plenty of known issues but I found a fundamental one.

Consider this scenario:

  • Main enqueues a job for execution 1 whose workflow has an error-handling workflow
  • Worker A picks up job, runs execution 1, which throws, finishing with error status
  • Worker A enqueues job for execution 2, for the error-handling workflow
  • Worker B picks up job, runs execution 2 successfully

In this scenario, we end up with inconsistent event logs:

n8nEventLog.log (main)
- n8n.workflow.started for execution 1
- n8n.workflow.failed for execution 1

n8nEventLog-worker.log (worker A)
- n8n.node.{started|finished} events for execution 1

n8nEventLog-worker.log (worker B)
- n8n.workflow.started for execution 2
- n8n.node.{started|finished} events for execution 2
- n8n.workflow.success for execution 2

This is because of our lifecycle hooks setup:

  • Main uses getLifecycleHooksForScalingMain which includes hookFunctionsWorkflowEvents
  • Worker uses getLifecycleHooksForScalingWorker which excludes hookFunctionsWorkflowEvents
  • Subexecutions use getLifecycleHooksForSubExecutions which includes hookFunctionsWorkflowEvents

This event log inconsistency impacts crash recovery, which assumes all events for an execution are in the event log for a given instance. This means crash recovery wrongly identifies execution 1 as unfinished (instead of finished with an error) leading it to wrongly amend details for a finished execution (instead of skipping it), which will also lead to another crash recovery on the next restart, as the details are incomplete.

In short, this means every worker tries to recover its own finished failed executions (instead of only unfinished crashed ones), which the worker does not even have the data for, and so this cycle repeats on restart. The root solution would be to straighten out our hooks setup and have event logs for workers remain with workers, but lifecycle hooks are so central and so legacy that this would be a big dedicated effort.

Hence for now, this fix filters out finished executions downstream and defers crash recovery logging until after we know which were the actually recovered execution IDs.

Related Linear tickets, Github issues, and Community forum posts

n/a

Review / Merge checklist

  • PR title and summary are descriptive. (conventions)
  • Docs updated or follow-up ticket created.
  • Tests included.
  • PR Labeled with release/backport (if the PR is an urgent fix that needs to be backported)

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic found 1 issue across 3 files. Review it in cubic.dev

React with 👍 or 👎 to teach cubic. Tag @cubic-dev-ai to give specific feedback.

Copy link

codecov bot commented Jun 6, 2025

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...rc/eventbus/message-event-bus/message-event-bus.ts 0.00% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ivov ivov requested a review from despairblue June 6, 2025 12:54
@n8n-assistant n8n-assistant bot added core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team labels Jun 6, 2025
@despairblue
Copy link
Contributor

Out of curiosity, what would happen if we were to include hookFunctionsWorkflowEvents in getLifecycleHooksForScalingWorker or at least the hook that logs?

{ status: 'error', data: stringify({ runData: { foo: 'bar' } }) },
workflow,
);
const messages = setupMessages(execution.id, 'Some workflow');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not creating the 100% the same scenario you're covering in the PR description. It does contain an additional n8n.workflow.started message.

Copy link
Contributor Author

@ivov ivov Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, what the test is verifying is that crash recovery bails if the execution has a terminal error state, no matter what messages we may have about it.

Copy link
Contributor

✅ All Cypress E2E specs passed

@ivov
Copy link
Contributor Author

ivov commented Jun 10, 2025

Out of curiosity, what would happen if we were to include hookFunctionsWorkflowEvents in getLifecycleHooksForScalingWorker or at least the hook that logs?

I did try this and the worker gets the full logs, but I hesitate to change this because historically these hooks have always been with main and I'm not confident I can foresee all the downstream consequences of moving them, e.g. for telemetry purposes the execution starts when the trigger is received, not when a worker becomes available. Let's tackle this separately.

@ivov ivov merged commit 53b6812 into master Jun 10, 2025
55 checks passed
@ivov ivov deleted the prevent-worker-from-recovering-finished-executions branch June 10, 2025 09:12
@janober
Copy link
Member

janober commented Jun 11, 2025

Got released with [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team Released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants