Skip to content

feat: CRC for PM replay #982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zachschuermann
Copy link
Collaborator

@zachschuermann zachschuermann commented May 27, 2025

What changes are proposed in this pull request?

Leverage CRC files for P+M replay. We do this by modifying our PM log replay to:

  1. If we have CRC at this version, we can just read the Protocol and Metadata from it.
  2. Else, create a 'pruned' (CRC-rooted) log segment that is only the commits after the latest CRC (if there is no CRC, we just use the whole log segment)
  3. Do PM replay on the pruned log segment to find the latest Protocol/Metadata actions.
  4. If no Protocol/Metadata was found, return Protocol/Metadata from CRC (if CRC doesn't exist, we are guaranteed to get a Protocol/Metadata from checkpoint in (3)).

How was this change tested?

new UTs

Copy link

codecov bot commented May 27, 2025

Codecov Report

Attention: Patch coverage is 92.72727% with 4 lines in your changes missing coverage. Please review.

Project coverage is 85.07%. Comparing base (f139209) to head (76c99c4).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/actions/crc.rs 78.57% 0 Missing and 3 partials ⚠️
kernel/src/log_segment.rs 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #982      +/-   ##
==========================================
- Coverage   85.07%   85.07%   -0.01%     
==========================================
  Files          90       90              
  Lines       23387    23410      +23     
  Branches    23387    23410      +23     
==========================================
+ Hits        19896    19915      +19     
  Misses       2479     2479              
- Partials     1012     1016       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zachschuermann zachschuermann marked this pull request as ready for review June 12, 2025 20:16
@zachschuermann zachschuermann changed the title [wip] feat: CRC for PM replay feat: CRC for PM replay Jun 12, 2025
Comment on lines +134 to +135
pub(crate) protocol: Option<Protocol>,
pub(crate) metadata: Option<Metadata>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fixing an incorrect usage of Default --> we want Default to just set up 'empty' P/M for us to fill in after visiting - not to generate some default P/M itself.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Jun 12, 2025
});

let json = crc_json.to_string().into_bytes();
// let (engine, store) = DefaultEngine::new_memory();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyone know why I can't seem to get the extension trait working here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know now and can remove the commented out line (sadly)

@@ -94,14 +93,46 @@ pub(crate) struct DeletedRecordCountsHistogram {
pub(crate) deleted_record_counts: Vec<i64>,
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub(crate) struct ProtocolMetadata {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we're just using this here to return a protocol and metadata? We have lots of other places where we pair them together as either arguments, or a (protocol, metadata). I'd say here can we just do:

fn try_protocol_and_metadata_from_crc(path: FileMeta, engine: &dyn Engine) -> DeltaResult<(Protocol, Metadata)> {}

We could pass through everything and unify on a single struct in the future, but for now this is just confusing

/// protocol/metadata cannot be found in the CRC file (specification requires both to be
/// present).
pub(crate) fn try_from_crc(path: FileMeta, engine: &dyn Engine) -> DeltaResult<Self> {
let json = engine.json_handler();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro_nit:

Suggested change
let json = engine.json_handler();
let json_handler = engine.json_handler();

});

let json = crc_json.to_string().into_bytes();
// let (engine, store) = DefaultEngine::new_memory();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know now and can remove the commented out line (sadly)

let protocol_metadata_log_segment: Cow<'_, Self> =
if let Some(latest_crc_file) = &self.latest_crc_file {
let mut log_segment = self.clone();
log_segment.checkpoint_version = None;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we pruned here so we know that any crc has to be after the checkpoint?

// 2. construct a "protocol-metadata log segment"
// a) if we have a CRC file, we only need commits (CRC_version, target_version]
// b) if we don't have a CRC, we proceed as normal
let protocol_metadata_log_segment: Cow<'_, Self> =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to add a start_version: Option<Version> to replay_for_metadata. In that case, find_commit_cover could quite easily just drop any versions less than start_version, and we wouldn't need this code really at all (and could avoid the clone)

Comment on lines +498 to 505
if let Some(latest_crc_file) = &self.latest_crc_file {
let path = &latest_crc_file.location;
let pm = ProtocolMetadata::try_from_crc(path.clone(), engine)?;
metadata_opt.get_or_insert(pm.metadata);
protocol_opt.get_or_insert(pm.protocol);
}

Ok((metadata_opt, protocol_opt))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if let Some(latest_crc_file) = &self.latest_crc_file {
let path = &latest_crc_file.location;
let pm = ProtocolMetadata::try_from_crc(path.clone(), engine)?;
metadata_opt.get_or_insert(pm.metadata);
protocol_opt.get_or_insert(pm.protocol);
}
Ok((metadata_opt, protocol_opt))
if let Some(latest_crc_file) = &self.latest_crc_file {
let path = &latest_crc_file.location;
let pm = ProtocolMetadata::try_from_crc(path.clone(), engine)?;
Ok((metadata_opt.or(Some(pm.metadata)), protocol_opt.or(Some(pm.protocol))))
} else {
Ok((metadata_opt, protocol_opt))
}

@@ -164,4 +166,15 @@ impl DefaultEngineExtension for DefaultEngine<TokioBackgroundExecutor> {
TokioBackgroundExecutor::new().into(),
))
}

fn new_memory() -> (Arc<DefaultEngine<TokioBackgroundExecutor>>, Arc<InMemory>) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably can remove this now right, or move it to the kernel crate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that require a major version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants