Skip to content

refactor!: txn-specific write_metadata_schema #1021

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zachschuermann
Copy link
Collaborator

What changes are proposed in this pull request?

On the way to supporting stats in writes, we must allow for the write_metadata_schema to be specified per-txn/table instead of globally since the schema for write_metadata_schema will eventually include a stats column which is a function of the table schema.

Concretely, this PR

  1. removes the global transaction::get_write_metadata_schema()
  2. adds write_metadata_schema() to WriteContext (and plumbs the schema through)

Note that the actual naming will change in #1019

This PR affects the following public APIs

The (static) transaction::get_write_metadata_schema() is now a method: WriteContext::write_metadata_schema() (and the WriteContext is derived from a specific Transaction.

How was this change tested?

minor modifications to existing UT

engine: &dyn Engine,
write_metadata: impl Iterator<Item = &'a dyn EngineData> + Send + 'a,
) -> impl Iterator<Item = DeltaResult<Box<dyn EngineData>>> + Send + 'a {
let evaluation_handler = engine.evaluation_handler();
Copy link
Collaborator Author

@zachschuermann zachschuermann Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review with whitespace hidden

Copy link

codecov bot commented Jun 16, 2025

Codecov Report

Attention: Patch coverage is 96.29630% with 3 lines in your changes missing coverage. Please review.

Project coverage is 84.71%. Comparing base (bbca626) to head (9d087f7).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/transaction.rs 95.08% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1021      +/-   ##
==========================================
+ Coverage   84.70%   84.71%   +0.01%     
==========================================
  Files          92       92              
  Lines       23067    23266     +199     
  Branches    23067    23266     +199     
==========================================
+ Hits        19538    19710     +172     
- Misses       2568     2578      +10     
- Partials      961      978      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Jun 16, 2025
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding how responsibility is divided between kernel and engine here? It seems like the engine provides the write schema? I guess that makes sense (e.g. engine can choose not to write columns that have default values), but we don't validate that it's a subset of the table schema?

Comment on lines +240 to +244
fn generate_adds<'a>(
&'a self,
engine: &dyn Engine,
write_metadata: impl Iterator<Item = &'a dyn EngineData> + Send + 'a,
) -> impl Iterator<Item = DeltaResult<Box<dyn EngineData>>> + Send + 'a {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we take &'a self, I think we can remove the named lifetimes?
At worst we might need + '_ for the iterators?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the compiler is yelling that anonymous lifetimes are unstable in impl trait..

@zachschuermann
Copy link
Collaborator Author

I'm not understanding how responsibility is divided between kernel and engine here? It seems like the engine provides the write schema? I guess that makes sense (e.g. engine can choose not to write columns that have default values), but we don't validate that it's a subset of the table schema?

I was attempting to 'push down' the write metadata schema (the schema which we agree is how the engine communicates the metadata about files it wrote during appends) from being global to now per-table. The engine/kernel interaction is intended to be:

  1. kernel provides the WriteContext from a Transaction to include the physical schema and transforms such that the engine can apply the schema + transform to its logical data to write valid parquet files, then
  2. engine hands back some Box<dyn EngineData> which abides by our write_metadata_schema including metadata about the writes it performed. currently this is basically just a bunch of paths, etc. but will soon include a stats column whose schema is a function of the table's (physical) schema.

To be clear this is an unnecessary code change in isolation (the schema is constant for all transactions currently) but this paves the way for us to add the stats column which depends on the physical schema. I suppose one could argue that we could just allow engines to add their own stats and not need to specify it in the write_metadata_schema, but given we use that as the API for adding files to a table, it seemed to make sense to add it.

@zachschuermann zachschuermann requested a review from scovich June 17, 2025 15:01
@scovich
Copy link
Collaborator

scovich commented Jun 17, 2025

Who provides "our write_metadata_schema" tho? And how do ensure it's valid/compatible with the table?

@zachschuermann
Copy link
Collaborator Author

Who provides "our write_metadata_schema" tho? And how do ensure it's valid/compatible with the table?

we (kernel) do(es)! this is the Transaction::write_metadata_schema -> when we do stats in the future this will be generated from the table schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that require a major version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants