fix(core): Prevent worker from recovering finished executions #16094

ivov · 2025-06-06T12:39:08Z

Summary

Crash recovery has plenty of known issues but I found a fundamental one.

Consider this scenario:

Main enqueues a job for execution 1 whose workflow has an error-handling workflow
Worker A picks up job, runs execution 1, which throws, finishing with error status
Worker A enqueues job for execution 2, for the error-handling workflow
Worker B picks up job, runs execution 2 successfully

In this scenario, we end up with inconsistent event logs:

n8nEventLog.log (main)
- n8n.workflow.started for execution 1
- n8n.workflow.failed for execution 1

n8nEventLog-worker.log (worker A)
- n8n.node.{started|finished} events for execution 1

n8nEventLog-worker.log (worker B)
- n8n.workflow.started for execution 2
- n8n.node.{started|finished} events for execution 2
- n8n.workflow.success for execution 2

This is because of our lifecycle hooks setup:

Main uses getLifecycleHooksForScalingMain which includes hookFunctionsWorkflowEvents
Worker uses getLifecycleHooksForScalingWorker which excludes hookFunctionsWorkflowEvents
Subexecutions use getLifecycleHooksForSubExecutions which includes hookFunctionsWorkflowEvents

This event log inconsistency impacts crash recovery, which assumes all events for an execution are in the event log for a given instance. This means crash recovery wrongly identifies execution 1 as unfinished (instead of finished with an error) leading it to wrongly amend details for a finished execution (instead of skipping it), which will also lead to another crash recovery on the next restart, as the details are incomplete.

In short, this means every worker tries to recover its own finished failed executions (instead of only unfinished crashed ones), which the worker does not even have the data for, and so this cycle repeats on restart. The root solution would be to straighten out our hooks setup and have event logs for workers remain with workers, but lifecycle hooks are so central and so legacy that this would be a big dedicated effort.

Hence for now, this fix filters out finished executions downstream and defers crash recovery logging until after we know which were the actually recovered execution IDs.

Related Linear tickets, Github issues, and Community forum posts

n/a

Review / Merge checklist

PR title and summary are descriptive. (conventions)
Docs updated or follow-up ticket created.
Tests included.
PR Labeled with release/backport (if the PR is an urgent fix that needs to be backported)

cubic-dev-ai

cubic found 1 issue across 3 files. Review it in cubic.dev

_{React with 👍 or 👎 to teach cubic. Tag @cubic-dev-ai to give specific feedback.}

packages/cli/src/eventbus/message-event-bus/message-event-bus.ts

codecov · 2025-06-06T12:44:53Z

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...rc/eventbus/message-event-bus/message-event-bus.ts	0.00%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

despairblue · 2025-06-10T05:51:23Z

Out of curiosity, what would happen if we were to include hookFunctionsWorkflowEvents in getLifecycleHooksForScalingWorker or at least the hook that logs?

despairblue · 2025-06-10T07:33:57Z

packages/cli/src/executions/__tests__/execution-recovery.service.test.ts

+					{ status: 'error', data: stringify({ runData: { foo: 'bar' } }) },
+					workflow,
+				);
+				const messages = setupMessages(execution.id, 'Some workflow');


This test is not creating the 100% the same scenario you're covering in the PR description. It does contain an additional n8n.workflow.started message.

Exactly, what the test is verifying is that crash recovery bails if the execution has a terminal error state, no matter what messages we may have about it.

github-actions · 2025-06-10T07:45:08Z

✅ All Cypress E2E specs passed

ivov · 2025-06-10T09:11:08Z

Out of curiosity, what would happen if we were to include hookFunctionsWorkflowEvents in getLifecycleHooksForScalingWorker or at least the hook that logs?

I did try this and the worker gets the full logs, but I hesitate to change this because historically these hooks have always been with main and I'm not confident I can foresee all the downstream consequences of moving them, e.g. for telemetry purposes the execution starts when the trigger is received, not when a worker becomes available. Let's tackle this separately.

janober · 2025-06-11T10:07:24Z

Got released with [email protected]

fix(core): Prevent worker from recovering finished executions

355cfa8

cubic-dev-ai bot reviewed Jun 6, 2025

View reviewed changes

packages/cli/src/eventbus/message-event-bus/message-event-bus.ts Show resolved Hide resolved

ivov requested a review from despairblue June 6, 2025 12:54

n8n-assistant bot added core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team labels Jun 6, 2025

despairblue approved these changes Jun 10, 2025

View reviewed changes

ivov merged commit 53b6812 into master Jun 10, 2025
55 checks passed

ivov deleted the prevent-worker-from-recovering-finished-executions branch June 10, 2025 09:12

This was referenced Jun 11, 2025

🚀 Release 1.98.0 #16216

Closed

🚀 Release 1.98.0 #16224

Closed

🚀 Release 1.98.0 #16225

Merged

janober added the Released label Jun 11, 2025

ivov mentioned this pull request Jun 12, 2025

fix(core): Prevent queue starvation from subworkflow executions #16264

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(core): Prevent worker from recovering finished executions #16094

fix(core): Prevent worker from recovering finished executions #16094

ivov commented Jun 6, 2025 •

edited

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Jun 6, 2025

Uh oh!

despairblue commented Jun 10, 2025

Uh oh!

despairblue Jun 10, 2025

Uh oh!

ivov Jun 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

ivov commented Jun 10, 2025

Uh oh!

Uh oh!

janober commented Jun 11, 2025

Uh oh!

Uh oh!

fix(core): Prevent worker from recovering finished executions #16094

fix(core): Prevent worker from recovering finished executions #16094

Conversation

ivov commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Linear tickets, Github issues, and Community forum posts

Review / Merge checklist

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jun 6, 2025

Codecov Report

Uh oh!

despairblue commented Jun 10, 2025

Uh oh!

despairblue Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

ivov Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

ivov commented Jun 10, 2025

Uh oh!

Uh oh!

janober commented Jun 11, 2025

Uh oh!

Uh oh!

ivov commented Jun 6, 2025 •

edited

Loading

ivov Jun 10, 2025 •

edited

Loading