Skip to content

fix(metric-engine): properly propagate errors during batch open operation #6325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 19, 2025

Conversation

WenyXu
Copy link
Member

@WenyXu WenyXu commented Jun 16, 2025

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

This PR fixes error propagation issues in the metric engine's batch open operation. Previously, underlying errors during batch open operation were not properly propagated to the caller, which could lead to bad user experience.

After:

│ {"timestamp":"2025-06-16T12:36:05.412304Z","level":"INFO","fields":{"message":"going to open 1 region(s)"},"target":"datanode::datanode"}                                        │
│ {"timestamp":"2025-06-16T12:36:05.413249Z","level":"INFO","fields":{"message":"Try to open region 4509732438016(1050, 16777216), worker: 0"},"target":"mito2::worker::handle_ope │
│ n"}                                                                                                                                                                              │
│ {"timestamp":"2025-06-16T12:36:05.413674Z","level":"INFO","fields":{"message":"Try to open region 4509715660800(1050, 0), worker: 0"},"target":"mito2::worker::handle_open"}     │
│ {"timestamp":"2025-06-16T12:36:05.414027Z","level":"INFO","fields":{"message":"Checkpoint not found in data/greptime/public/1050/1050_0000000000/metadata/manifest/, build manif │
│ est from scratch"},"target":"mito2::manifest::manager"}                                                                                                                          │
│ {"timestamp":"2025-06-16T12:36:05.414202Z","level":"INFO","fields":{"message":"Checkpoint not found in data/greptime/public/1050/1050_0000000000/data/manifest/, build manifest  │
│ from scratch"},"target":"mito2::manifest::manager"}                                                                                                                              │
│ {"timestamp":"2025-06-16T12:36:05.415271Z","level":"ERROR","fields":{"message":"Failed to open region: 4509715660800(1050, 0)","err":"0: Failed to open mito region, region type │
│ : metadata, at src/metric-engine/src/engine/open.rs:106:14\n1: Failed to open region, at /home/weny/Projects/greptimedb/src/mito2/src/worker/handle_open.rs:140:54\n2: Incompati │
│ ble WAL provider change. This is typically caused by changing WAL provider in database config file without completely cleaning existing files. Global provider: `raft_engine`, r │
│ egion provider: `kafka`"},"target":"datanode::region_server","span":{"name":"handle_batch_open_requests"},"spans":[{"name":"handle_batch_open_requests"}]}                       │
│ Error: 0: Failed to start datanode, at /home/weny/Projects/greptimedb/src/cmd/src/datanode/builder.rs:128:14                                                                     │
│ 1: Unexpected, violated: Failed to open batch regions: 0: Failed to open mito region, region type: metadata, at src/metric-engine/src/engine/open.rs:106:14                      │
│ 1: Failed to open region, at /home/weny/Projects/greptimedb/src/mito2/src/worker/handle_open.rs:140:54                                                                           │
│ 2: Incompatible WAL provider change. This is typically caused by changing WAL provider in database config file without completely cleaning existing files. Global provider: `raf │
│ t_engine`, region provider: `kafka`, at /home/weny/Projects/greptimedb/src/datanode/src/region_server.rs:802:14

Before:

│ {"timestamp":"2025-06-16T08:55:11.003324Z","level":"INFO","fields":{"message":"Try to open region 4449602895872(1036, 16777216), worker: 0"},"target":"mito2::worker::handle_open"}                    │
│ {"timestamp":"2025-06-16T08:55:11.003817Z","level":"INFO","fields":{"message":"Try to open region 4449586118656(1036, 0), worker: 0"},"target":"mito2::worker::handle_open"}                           │
│ {"timestamp":"2025-06-16T08:55:11.004311Z","level":"INFO","fields":{"message":"Checkpoint not found in data/greptime/public/1036/1036_0000000000/metadata/manifest/, build manifest from scratch"},"ta │
│ rget":"mito2::manifest::manager"}                                                                                                                                                                               │
│ {"timestamp":"2025-06-16T08:55:11.004408Z","level":"INFO","fields":{"message":"Checkpoint not found in data/greptime/public/1036/1036_0000000000/data/manifest/, build manifest from scratch"},"target │
│ ":"mito2::manifest::manager"}                                                                                                                                                                                   │
│ {"timestamp":"2025-06-16T08:55:11.005561Z","level":"ERROR","fields":{"message":"Failed to open batch regions","err":"0: Failed to open batch regions, at /home/weny/Projects/greptimedb/src/datanode/s │
│ rc/region_server.rs:763:14\n1: Physical region 4449586118656(1036, 0) not found, at src/metric-engine/src/engine/open.rs:78:18"},"target":"datanode::region_server","span":{"name":"handle_batch_open_requests" │
│ },"spans":[{"name":"handle_batch_open_requests"}]}                                                                                                                                                              │
│ Error: 0: Failed to start datanode, at /home/weny/Projects/greptimedb/src/cmd/src/datanode/builder.rs:128:14                                                                                           │
│ 1: Unexpected, violated: Failed to open batch regions: 0: Failed to open batch regions, at /home/weny/Projects/greptimedb/src/datanode/src/region_server.rs:763:14                                     │
│ 1: Physical region 4449586118656(1036, 0) not found, at src/metric-engine/src/engine/open.rs:78:18, at /home/weny/Projects/greptimedb/src/datanode/src/region_server.rs:802:14    

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@WenyXu WenyXu requested a review from waynexia as a code owner June 16, 2025 12:23
@WenyXu WenyXu requested a review from fengjiachun June 16, 2025 12:23
@github-actions github-actions bot added the docs-not-required This change does not impact docs. label Jun 16, 2025
@WenyXu WenyXu marked this pull request as draft June 17, 2025 05:00
@WenyXu WenyXu marked this pull request as ready for review June 17, 2025 06:23
@waynexia waynexia requested a review from Copilot June 17, 2025 06:25
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes error propagation issues in the metric engine’s batch open operation by adjusting error handling and consolidating state recovery.

  • Introduces a new error variant (NoOpenRegionResult) for missing open region results.
  • Refactors error propagation and state recovery flows in the engine’s sync, open, and catchup modules.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/metric-engine/src/error.rs Adds new error variant and updates error status mapping.
src/metric-engine/src/engine/sync.rs Removes redundant primary key encoding retrieval in state recovery.
src/metric-engine/src/engine/open.rs Refactors open logic; converts physical region IDs to a HashMap and adjusts error propagation.
src/metric-engine/src/engine/catchup.rs Updates state recovery to align with new error propagation design.
Comments suppressed due to low confidence (2)

src/metric-engine/src/engine/open.rs:103

  • [nitpick] Consider logging additional context or including a comment explaining the error wrapping with BoxedError to help future debugging of failure cases in open_physical_region_with_results.
let _ = metadata_region_result.context(NoOpenRegionResultSnafu {

src/metric-engine/src/engine/catchup.rs:82

  • [nitpick] Since the primary key encoding is now retrieved internally within recover_states, please add an inline comment or update the function’s docstring to document this design change for clarity.
self.recover_states(region_id, physical_region_options)

Copy link
Collaborator

@fengjiachun fengjiachun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WenyXu WenyXu mentioned this pull request Jun 19, 2025
18 tasks
Signed-off-by: WenyXu <[email protected]>
@WenyXu WenyXu enabled auto-merge June 19, 2025 06:29
@WenyXu WenyXu added this pull request to the merge queue Jun 19, 2025
Merged via the queue into GreptimeTeam:main with commit 5231505 Jun 19, 2025
41 checks passed
@WenyXu WenyXu deleted the fix/metric-batch-open branch June 19, 2025 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required This change does not impact docs. size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants