Support retrying non-finished async tasks on startup and periodically #1585

danielhumanmod · 2025-05-14T05:36:10Z

Context

Polaris uses async tasks to perform operations such as table and manifest file cleanup. These tasks are executed asynchronously in a separate thread within the same JVM, and retries are handled inline within the task execution. However, this mechanism does not guarantee eventual execution in the following cases:

The task fails repeatedly and hits the maximum retry limit.
The service crashes or shuts down before retrying.

Implementation Plan

Stage 1: Potential improvement - #1523
Introduce per-task transactional leasing in the metastore layer via loadTasks(...)

Stage 2 (Current PR):
Persist failed tasks and introduce a retry mechanism triggered during Polaris startup and via periodic background checks, changes included:

Metastore Layer:
- Exposes a new API getMetaStoreManagerMap
- Ensures LAST_ATTEMPT_START_TIME set for each task entity creation, which is important for time-out filtering when loadTasks() from metastore, so that prevent multiple executors from picking the same task
TaskRecoveryManager: New class responsible for task recovery logic, including:
- Constructing executionPolarisCallContext
- Loading tasks from metastore
- Triggering task execution
QuarkusTaskExecutorImpl: Hook into application lifecycle to initiate task recovery.
Task Retry Strategy: Failed tasks remain persisted in the metastore and are retried by the recovery manager.
Tests: Adjusted existing tests and added new coverage for recovery behavior.

Recommended Review Order

Metastore Layer related code
TaskRecoveryManager
QuarkusTaskExecutorImpl and TaskExecutorImpl
Task cleanup handlers
Tests

danielhumanmod · 2025-05-18T00:30:47Z

...rvice/src/test/java/org/apache/polaris/service/quarkus/task/TableCleanupTaskHandlerTest.java

@@ -152,6 +156,7 @@ public void testTableCleanup() throws IOException {

    handler.handleTask(task, callContext);

+    timeSource.add(Duration.ofMinutes(10));


Previously, task entity might miss LAST_ATTEMPT_START_TIME prop so loading tasks without time-out can success; After complete each task entity with this property, we need to manipulate time to make loadTasks works

Can you explain this further - I'm not sure why the tests need this 10m jump? Is it so that tasks are "recovered" by the Quarkus Scheduled method?

adnanhemani · 2025-06-02T19:01:44Z

...va/org/apache/polaris/extension/persistence/relational/jdbc/JdbcMetaStoreManagerFactory.java

@@ -172,6 +172,11 @@ public Map<String, BaseResult> purgeRealms(Iterable<String> realms) {
    return Map.copyOf(results);
  }

+  @Override
+  public Map<String, PolarisMetaStoreManager> getMetaStoreManagerMap() {


To make this a bit more defensively-coded, I might recommend making this into a iterator of Map.Entry objects, given that this is a public method and we wouldn't want any code path to be able to modify this mapping?

adnanhemani · 2025-06-02T19:07:18Z

...s/service/src/test/java/org/apache/polaris/service/quarkus/task/TaskRecoveryManagerTest.java

+  }
+
+  private void addTaskLocation(TaskEntity task) {
+    Map<String, String> internalPropertiesAsMap = new HashMap<>(task.getInternalPropertiesAsMap());


addInternalProperty

adnanhemani · 2025-06-03T04:51:13Z

...ice/common/src/main/java/org/apache/polaris/service/task/ManifestFileCleanupTaskHandler.java

    try {
+      ManifestReader<DataFile> dataFiles = ManifestFiles.read(manifestFile, fileIO);


What's the reason behind this change?

adnanhemani · 2025-06-03T04:53:14Z

service/common/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java

@@ -193,6 +198,9 @@ private Stream<TaskEntity> getManifestTaskStream(
                  .withData(
                      new ManifestFileCleanupTaskHandler.ManifestCleanupTask(
                          tableEntity.getTableIdentifier(), TaskUtils.encodeManifestFile(mf)))
+                  .withLastAttemptExecutorId(executorId)
+                  .withAttemptCount(1)


How can we assume this?

adnanhemani · 2025-06-03T04:53:27Z

service/common/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java

@@ -235,6 +247,9 @@ private Stream<TaskEntity> getMetadataTaskStream(
                  .withData(
                      new BatchFileCleanupTaskHandler.BatchFileCleanupTask(
                          tableEntity.getTableIdentifier(), metadataBatch))
+                  .withLastAttemptExecutorId(executorId)
+                  .withAttemptCount(1)


Ditto as above.

adnanhemani · 2025-06-03T05:50:52Z

service/common/src/main/java/org/apache/polaris/service/task/TaskRecoveryManager.java

+      PolarisCallContext polarisCallContext =
+          new PolarisCallContext(
+              metastore, new PolarisDefaultDiagServiceImpl(), configurationStore, clock);
+      EntitiesResult entitiesResult =


I'm not sure I'm understanding the logic here: we are asking for 20 tasks here - but what if there are more than 20 tasks that need recovery?

adnanhemani · 2025-06-03T06:07:00Z

...rvice/src/test/java/org/apache/polaris/service/quarkus/task/TableCleanupTaskHandlerTest.java

@@ -152,6 +156,7 @@ public void testTableCleanup() throws IOException {

    handler.handleTask(task, callContext);

+    timeSource.add(Duration.ofMinutes(10));


Can you explain this further - I'm not sure why the tests need this 10m jump? Is it so that tasks are "recovered" by the Quarkus Scheduled method?

adnanhemani · 2025-06-03T06:33:42Z

...s/service/src/test/java/org/apache/polaris/service/quarkus/task/TaskRecoveryManagerTest.java

+      tableCleanupTaskHandler.handleTask(task, callCtx);
+
+      // Step 3: Verify that the generated child tasks were registered, ATTEMPT_COUNT = 2
+      timeSource.add(Duration.ofMinutes(10));


I, personally, found this very hard to follow - even with the comments. I would highly recommend making the comments much more verbose here to allow the full flow of logic (what is happening with which task and why) to be communicated to a reader who may not be an expert at this particular type of task or tasks in general.

danielhumanmod added 3 commits May 13, 2025 22:22

complete impl with test

994567a

adjust existing task tests

6d2040d

task recovery impl

448b150

github-project-automation bot added this to Basic Kanban Board May 14, 2025

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board May 14, 2025

danielhumanmod added 2 commits May 13, 2025 22:50

format

4e0a6c6

update task entity

a815e3e

danielhumanmod mentioned this pull request May 14, 2025

Support per-task transactional leasing in loadTasks #1523

Open

danielhumanmod added 3 commits May 13, 2025 23:22

comment in tests

218e5f5

standardize task entity properties when creation

d1d15d7

complete log and test

a5d8f56

danielhumanmod commented May 18, 2025

View reviewed changes

danielhumanmod marked this pull request as ready for review May 18, 2025 00:33

danielhumanmod requested review from adutra, ashvina, dennishuo, dimas-b, eric-maynard, jackye1995, jbonofre, vvcephei, collado-mike, snazy, RussellSpitzer, takidau, MonkeyCanCode, flyrain, ebyhr, HonahX and singhpk234 as code owners May 18, 2025 00:33

danielhumanmod requested a review from pingtimeout as a code owner May 18, 2025 00:33

danielhumanmod changed the title ~~Support more reliable async task retry to guarantee eventual execution (2/2) – Task Executor~~ Support retrying non-finished async tasks on startup and periodically May 18, 2025

adnanhemani reviewed Jun 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support retrying non-finished async tasks on startup and periodically #1585

Support retrying non-finished async tasks on startup and periodically #1585

Uh oh!

danielhumanmod commented May 14, 2025 •

edited

Loading

Uh oh!

danielhumanmod May 18, 2025

Uh oh!

adnanhemani Jun 3, 2025

Uh oh!

adnanhemani Jun 2, 2025

Uh oh!

adnanhemani Jun 2, 2025

Uh oh!

adnanhemani Jun 3, 2025

Uh oh!

adnanhemani Jun 3, 2025

Uh oh!

adnanhemani Jun 3, 2025

Uh oh!

adnanhemani Jun 3, 2025

Uh oh!

adnanhemani Jun 3, 2025

Uh oh!

adnanhemani Jun 3, 2025

Uh oh!

Uh oh!

		@@ -152,6 +156,7 @@ public void testTableCleanup() throws IOException {

		handler.handleTask(task, callContext);

		timeSource.add(Duration.ofMinutes(10));

		try {
		ManifestReader<DataFile> dataFiles = ManifestFiles.read(manifestFile, fileIO);

Support retrying non-finished async tasks on startup and periodically #1585

Are you sure you want to change the base?

Support retrying non-finished async tasks on startup and periodically #1585

Uh oh!

Conversation

danielhumanmod commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Implementation Plan

Recommended Review Order

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielhumanmod commented May 14, 2025 •

edited

Loading