-
Notifications
You must be signed in to change notification settings - Fork 384
fix(metric-engine): handle stale metadata region recovery failures #6395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: WenyXu <[email protected]>
a72752a
to
2fb96dd
Compare
60bff85
to
cfcad14
Compare
Signed-off-by: WenyXu <[email protected]>
cfcad14
to
c11b7b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// If the metadata region is opened with a stale manifest, | ||
// the metric engine may fail to recover logical tables from the metadata region, | ||
// as the manifest could reference files that have already been deleted | ||
// due to compaction operations performed by the region leader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can repair the metadata in this case; it appears to be expected and may occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the region leader completes compaction, the old manifest reference files (i.e., compaction inputs) are deleted from object storage. In this case, If a region still holds a stale manifest, we need to close it first. Otherwise, during migration retries, it will keep reopening the same stale state and fail again.
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
fixes #6273
What's changed and what's your intention?
This PR addresses a issue in the metric engine where stale metadata regions could cause recovery failures, leading to orphaned regions.
close_physical_region_on_recovery_failure
method to handle cleanup on recovery failuresProblem
When the metadata region is opened with a stale manifest, the metric engine may fail to recover logical tables from the metadata region. This happens because the manifest could reference files that have already been deleted due to compaction operations performed by the region leader. In such cases, the recovery process fails but the region(metadata/data region) remains open. When the upper layer retries the operation, it will still fail because the stale metadata region hasn't been closed by the metric engine.
PR Checklist
Please convert it to a draft if some of the following conditions are not met.