[Kernel] [Pagination] New Page Token Class #4848

mmmyr · 2025-06-27T19:29:55Z

Which Delta project/connector is this regarding?

Description

Introduce a new page token class (for pagination).

How was this patch tested?

PageTokenSuite.scala

Does this PR introduce any user-facing changes?

No.

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

huan233usc

LGTM with some minor

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

scottsand-db

Good tests! Left comments and questions on the semantics of PageToken

scottsand-db · 2025-07-01T15:51:18Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

+              + "Expected: "
+              + PAGE_TOKEN_SCHEMA
+              + ", Got: "
+              + row.getSchema());


I think all of this could fit on one line.
You could also use String.format here, too.
Did you want the different schemas printed on different lines? Are you missing some \n here?

scottsand-db · 2025-07-01T16:02:46Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

+
+  // ===== Variables to mark where the last page ended (and the current page starts) =====
+  /**
+   * The name of the log file where the current page starts. This is the same as the last log file


will this always be the case?
what if we are able to detect that this columnar batch is the last columnar batch in the file?
will we then set this startingLogFileName to the next file?

Right now it's always the case. You are right, probably it's better to say: the last log file read in the previous page.

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

scottsand-db · 2025-07-01T16:04:53Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

+   * page. This row index is relative to the file. The current page should begin from the row
+   * immediately after this row index.
+   */
+  private final long lastReturnedRowIndex;


why is it the last row index and not the next row index ?

if I read 1000 rows in json file 007.json, which is rows 0 to 999 inclusive, do we return 999 or 1000? I would think we should return 1000?

In my prototype, I returned 999, and I believe this makes more sense than returning the next row index. For example, suppose 007.json contains 1,000 rows (indexed 0 to 999), and the last page returns rows 0 to 999. If we return "007.json" + row index 1000, it's a bit odd—row index 1000 doesn't actually exist in that file.

Conceptually, I feel like it's more accurate to return the last row index read, along with the last file name read. This way, we're recording the precise position where the last page ended, and we can unambiguously start the next page from the row that comes immediately after. I'll change the variable name "startingLogFileName" into "lastReadLogFileName", and also change "startingSidecarFileIdx" to "lastReadSidecarFileIdx".

We can either record:

last file & last row index of the previous page.

starting file & starting row index of the current page.

But we won't be able to know the real file name of next batch to read until callers tries to get the next batch. If we want to return the start position of next page, we want an accurate starting file name + starting row index.

I'll change the variable name "startingLogFileName" into "lastReadLogFileName", and also change "startingSidecarFileIdx" to "lastReadSidecarFileIdx".

SGTM. Thanks for being very clear on this detail and semantic!

scottsand-db

Looks great! Left some comments

scottsand-db · 2025-07-01T18:51:55Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

+        row.getLong(7)); // logSegmentHash
+  }
+
+  /** Schema for PageToken Row representation */


nit: superfluous comment :) you can delete this

scottsand-db · 2025-07-01T18:53:27Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

+  /** Schema for PageToken Row representation */
+  public static final StructType PAGE_TOKEN_SCHEMA =
+      new StructType()
+          .add("logFileName", StringType.STRING)


can you please explicitly add the nullability here? , false /* nullable */) etc

and make the nullability for the sidecarIndex to true?

scottsand-db · 2025-07-01T18:54:01Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

+            "Invalid Page Token: input row schema does not match expected PageToken schema.\nExpected: %s\nGot: %s",
+            PAGE_TOKEN_SCHEMA, row.getSchema()));
+
+    for (int i = 0; i < PAGE_TOKEN_SCHEMA.length(); i++) {


a simple comment here would help:

above: // Check #1: Correct schema
here: // Check #2: All required fields are present

scottsand-db · 2025-07-01T18:54:20Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

+          .add("logSegmentHash", LongType.LONG);
+
+  // ===== Variables to mark where the last page ended (and the current page starts) =====
+  /** The last log file read in the previous page. */


nit: newline between the header block and here

scottsand-db · 2025-07-01T18:55:01Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

+   * page. This row index is relative to the file. The current page should begin from the row
+   * immediately after this row index.
+   */
+  private final long lastReturnedRowIndex;


I'll change the variable name "startingLogFileName" into "lastReadLogFileName", and also change "startingSidecarFileIdx" to "lastReadSidecarFileIdx".

SGTM. Thanks for being very clear on this detail and semantic!

scottsand-db · 2025-07-01T18:57:06Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java

+      long predicateHash,
+      long logSegmentHash) {
+    this.lastReadLogFileName =
+        requireNonNull(lastReadLogFileName, "lastReadLogFileName must not be null");


instead of must not be you can just say x is null --> will this make our lines shorter and let this fit onto one line? much cleaner :)

scottsand-db · 2025-07-01T18:57:44Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/PageTokenSuite.scala

+
+  val expectedRow = new GenericRow(PageToken.PAGE_TOKEN_SCHEMA, rowData)
+
+  test("Test PageToken.fromRow with valid data") {


nit: you don't need to say test("Test --> we know this is a test :)

scottsand-db · 2025-07-01T18:58:27Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/PageTokenSuite.scala

+  private val rowData: Map[Integer, Object] = new HashMap()
+  rowData.put(0, TEST_FILE_NAME)
+  rowData.put(1, TEST_ROW_INDEX.asInstanceOf[Object])
+  rowData.put(2, TEST_SIDECAR_INDEX.orElse(null))


let's not have business logic orElse(null) here.

is it null? make it null
is it not null? then just use that value

scottsand-db · 2025-07-01T18:58:52Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/PageTokenSuite.scala

+
+    assert(row.getString(0) == TEST_FILE_NAME)
+    assert(row.getLong(1) == TEST_ROW_INDEX)
+    assert(Optional.of(if (row.isNullAt(2)) null else row.getLong(2)) == TEST_SIDECAR_INDEX)


let's not have business logic here isNullAt

if it is null, require and assert that it is null

if not, require and assert that it is the value we are expecting

scottsand-db · 2025-07-01T19:00:37Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/PageTokenSuite.scala

+    assert(reconstructedPageToken.equals(expectedPageToken))
+  }
+
+  test("PageToken.fromRow throws exception when input row has invalid schema") {


can we test:

wrong field name

wrong data type

also explicitly test the sidecar-can-be-null case

page token

deb7dc0

mmmyr mentioned this pull request Jun 27, 2025

[Kernel] Implement Pagination Support in Kernel #4842

Open

mmmyr self-assigned this Jun 27, 2025

mmmyr added 2 commits June 27, 2025 22:28

page token test

a1d6361

clean up

161033d

mmmyr marked this pull request as ready for review June 27, 2025 23:27

mmmyr requested review from scottsand-db and huan233usc June 27, 2025 23:28

improve

5a19be5

huan233usc reviewed Jun 27, 2025

View reviewed changes

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java Show resolved Hide resolved

huan233usc approved these changes Jun 27, 2025

View reviewed changes

mmmyr requested review from nicklan and raveeram-db June 28, 2025 00:04

optional sidecar idx

5c9c555

scottsand-db reviewed Jun 30, 2025

View reviewed changes

kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/PageToken.java Outdated Show resolved Hide resolved