Skip to content

fix(metric-engine): handle stale metadata region recovery failures #6395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

WenyXu
Copy link
Member

@WenyXu WenyXu commented Jun 25, 2025

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

fixes #6273

What's changed and what's your intention?

This PR addresses a issue in the metric engine where stale metadata regions could cause recovery failures, leading to orphaned regions.

  • Added proper error handling and cleanup mechanisms when region recovery fails
  • Introduced close_physical_region_on_recovery_failure method to handle cleanup on recovery failures

Problem

When the metadata region is opened with a stale manifest, the metric engine may fail to recover logical tables from the metadata region. This happens because the manifest could reference files that have already been deleted due to compaction operations performed by the region leader. In such cases, the recovery process fails but the region(metadata/data region) remains open. When the upper layer retries the operation, it will still fail because the stale metadata region hasn't been closed by the metric engine.

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@github-actions github-actions bot added size/XS docs-not-required This change does not impact docs. labels Jun 25, 2025
@WenyXu WenyXu changed the title fix(metric-engine): handle metadata region recover logical tables failures fix(metric-engine): handle stale metadata region recovery failures Jun 25, 2025
@WenyXu WenyXu force-pushed the fix/stale-metadata-region branch from a72752a to 2fb96dd Compare June 25, 2025 07:39
@github-actions github-actions bot added size/S and removed size/XS labels Jun 25, 2025
@WenyXu WenyXu force-pushed the fix/stale-metadata-region branch from 60bff85 to cfcad14 Compare June 25, 2025 12:59
@WenyXu WenyXu marked this pull request as ready for review June 25, 2025 12:59
@WenyXu WenyXu requested a review from waynexia as a code owner June 25, 2025 12:59
@WenyXu WenyXu requested review from fengjiachun and evenyag June 25, 2025 12:59
Signed-off-by: WenyXu <[email protected]>
@WenyXu WenyXu force-pushed the fix/stale-metadata-region branch from cfcad14 to c11b7b8 Compare June 25, 2025 13:15
Copy link
Collaborator

@fengjiachun fengjiachun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// If the metadata region is opened with a stale manifest,
// the metric engine may fail to recover logical tables from the metadata region,
// as the manifest could reference files that have already been deleted
// due to compaction operations performed by the region leader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can repair the metadata in this case; it appears to be expected and may occur.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the region leader completes compaction, the old manifest reference files (i.e., compaction inputs) are deleted from object storage. In this case, If a region still holds a stale manifest, we need to close it first. Otherwise, during migration retries, it will keep reopening the same stale state and fail again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required This change does not impact docs. size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to migrate metric engine if metadata region uses a stale manifest
3 participants