You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[How to URL encode keys and string values](#how-to-url-encode-keys-and-string-values)
94
95
95
96
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
96
97
@@ -253,6 +254,54 @@ as of now. The add and remove file actions are stored as their individual column
253
254
254
255
These files reside in the `_delta_log/_sidecars` directory.
255
256
257
+
### Log Compaction Files
258
+
259
+
Log compaction files reside in the `_delta_log` directory. A log compaction file from a start version `x` to an end version `y` will have the following name:
260
+
`<x>.<y>.compact.json`. This contains the aggregated
261
+
actions for commit range `[x, y]`. Similar to commits, each row in the log
262
+
compaction file represents an [action](#actions).
263
+
The commit files for a given range are created by doing [Action Reconciliation](#action-reconciliation)
264
+
of the corresponding commits.
265
+
Instead of reading the individual commit files in range `[x, y]`, an implementation could choose to read
266
+
the log compaction file `<x>.<y>.compact.json` to speed up the snapshot construction.
- Can optionally produce log compactions for any given commit range
300
+
301
+
Readers:
302
+
- Can optionally consume log compactions, if available
303
+
- The compaction replaces the corresponding commits during action reconciliation
304
+
256
305
### Last Checkpoint File
257
306
The Delta transaction log will often contain many (e.g. 10,000+) files.
258
307
Listing such a large directory can be prohibitively expensive.
@@ -261,26 +310,6 @@ The last checkpoint file can help reduce the cost of constructing the latest sna
261
310
Rather than list the entire directory, readers can locate a recent checkpoint by looking at the `_delta_log/_last_checkpoint` file.
262
311
Due to the zero-padded encoding of the files in the log, the version id of this recent checkpoint can be used on storage systems that support lexicographically-sorted, paginated directory listing to enumerate any delta files or newer checkpoints that comprise more recent versions of the table.
263
312
264
-
#### Last Checkpoint File Schema
265
-
266
-
This last checkpoint file is encoded as JSON and contains the following information:
267
-
268
-
Field | Description
269
-
-|-
270
-
version | The version of the table when the last checkpoint was made.
271
-
size | The number of actions that are stored in the checkpoint.
272
-
parts | The number of fragments if the last checkpoint was written in multiple parts. This field is optional.
273
-
sizeInBytes | The number of bytes of the checkpoint. This field is optional.
274
-
numOfAddFiles | The number of AddFile actions in the checkpoint. This field is optional.
275
-
checkpointSchema | The schema of the checkpoint file. This field is optional.
276
-
tags | String-string map containing any additional metadata about the last checkpoint. This field is optional.
277
-
checksum | The checksum of the last checkpoint JSON. This field is optional.
278
-
279
-
The checksum field is an optional field which contains the MD5 checksum for fields of the last checkpoint json file.
280
-
Last checkpoint file readers are encouraged to validate the checksum, if present, and writers are encouraged to write the checksum
281
-
while overwriting the file. Refer to [this section](#json-checksum) for rules around calculating the checksum field
282
-
for the last checkpoint JSON.
283
-
284
313
## Actions
285
314
Actions modify the state of the table and they are stored both in delta files and in checkpoints.
286
315
This section lists the space of available actions as well as their schema.
@@ -1232,7 +1261,9 @@ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff
1232
1261
2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
1233
1262
We need to preserve the `cutOffCheckpoint` and all commits after it, because we need them to enable
1234
1263
time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
1235
-
3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the `cutOffCheckpoint` checkpoint.
1264
+
3. Delete all [delta log entries](#delta-log-entries and [checkpoint files](#checkpoints) before the
1265
+
`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having
1266
+
startVersion <= `cutOffCheckpoint`'s version.
1236
1267
4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
1237
1268
the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
1238
1269
5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
@@ -1795,7 +1826,27 @@ Checkpoint schema (just the `add` column):
1795
1826
| | | |-- col-04ee4877-ee53-4cb9-b1fb-1a4eb74b508c: long
1796
1827
```
1797
1828
1798
-
## JSON checksum
1829
+
## Last Checkpoint File Schema
1830
+
1831
+
This last checkpoint file is encoded as JSON and contains the following information:
1832
+
1833
+
Field | Description
1834
+
-|-
1835
+
version | The version of the table when the last checkpoint was made.
1836
+
size | The number of actions that are stored in the checkpoint.
1837
+
parts | The number of fragments if the last checkpoint was written in multiple parts. This field is optional.
1838
+
sizeInBytes | The number of bytes of the checkpoint. This field is optional.
1839
+
numOfAddFiles | The number of AddFile actions in the checkpoint. This field is optional.
1840
+
checkpointSchema | The schema of the checkpoint file. This field is optional.
1841
+
tags | String-string map containing any additional metadata about the last checkpoint. This field is optional.
1842
+
checksum | The checksum of the last checkpoint JSON. This field is optional.
1843
+
1844
+
The checksum field is an optional field which contains the MD5 checksum for fields of the last checkpoint json file.
1845
+
Last checkpoint file readers are encouraged to validate the checksum, if present, and writers are encouraged to write the checksum
1846
+
while overwriting the file. Refer to [this section](#json-checksum) for rules around calculating the checksum field
1847
+
for the last checkpoint JSON.
1848
+
1849
+
### JSON checksum
1799
1850
To generate the checksum for the last checkpoint JSON, firstly, the checksum JSON is canonicalized and converted to a string. Then
1800
1851
the 32 character MD5 digest is calculated on the resultant string to get the checksum. Rules for [JSON](https://datatracker.ietf.org/doc/html/rfc8259) canonicalization are:
0 commit comments