-
Notifications
You must be signed in to change notification settings - Fork 257
Support per-task transactional leasing in loadTasks
#1523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I don't understand how this PR enables isolation of task failures. This PR only reads the tasks from the metastore one at a time, so the only failure would be in loading the task. In a transactional database, the The PR description sounds like it intends to tackle task execution failure - is that right? If so, loading the tasks from the database isn't going to solve that problem. |
I think it could, just very lazily right @collado-mike? The next time the service restarts, we could retry any orphaned tasks. |
Sorry for the confusion — we actually have a second PR for this feature. I try to split two parts to make review easier :) This is the PR for second phase: #1585 Regarding this PR’s changes in the metastore, the goal is to allow each task entity to be read and leased individually. This ensures that if an exception occurs while reading or leasing one task, it won’t affect others. This improvement was also noted in the TODO comment of the previous implementation. It’s not strictly required, but maybe a “nice-to-have” one for isolating failures. Update on May 17: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of things here: now with pagination being merged, this PR will require further revision to rebase properly imo.
I'm also in agreement with @collado-mike here - but to your response, I don't agree that this should be the way we solve this overall. I'm not sure I see high value in picking only one task at a time to solve this problem we have with tasks and retrying them. Instead, I'd advocate for leaning heavier on the definition of limit
. If after we query the relevant amount of tasks using ms.listEntitiesInCurrentTxn
and then find out some task has been modified between our querying of this task and attempting to commit our properties to the task, we should just filter it out of the resultant set and allow the user to receive all other tasks that were not impacted. If there are no tasks at the end of this filtering there are no tasks, then that would be the right place to throw the exception. Sure, we will not get the "limit" amount of tasks if the function returns - but I don't see a guarantee of needing that.
I know I've probably not researched this as deeply as you so WDYT?
loadTasks
Thanks for the reminder — I’ll rebase and update it later!
That’s a great point — I actually considered that approach initially as well. That said, there were a couple of things that led me to explore the per-task leasing direction instead:
But I am not expert in Polaris's metastore, just sharing the context that led me to this approach. Would really appreciate any feedback or additional insight from the community |
I agree with the analysis you've stated too. I think it really comes down to point 1) that you made - and if someone has context as to whether they considered this approach before putting the TODO from point 2) down (and if so, why). I, personally, don't think that the semantics between Transactional and Atomic forces us to make a different implementation here tbh - but would also like any other insight from the community here :) |
String executorId, | ||
int limit) { | ||
List<EntitiesResult> entitySuccessResults = new ArrayList<>(); | ||
final AtomicInteger failedLeaseCount = new AtomicInteger(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need AtomicInteger
here if it's only being used within one thread?
|
||
@Override | ||
public @Nonnull EntitiesResult loadTasks( | ||
@Nonnull PolarisCallContext callCtx, String executorId, int limit, boolean perTaskTxn) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
txnPerTask
seems to have morphed into perTaskTxn
here
@@ -1985,11 +1983,60 @@ private PolarisEntityResolver resolveSecurableToRoleGrant( | |||
return new EntitiesResult(loadedTasks); | |||
} | |||
|
|||
private @Nonnull EntitiesResult loadTasksWithIsolatedTxn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need a single txn per task, or can we try to transactionally grab multiple in one trip to persistence?
I guess we can consider closing this PR since this change is optional in my plan, and the community also feels it might be unnecessary |
@danielhumanmod I actually like the idea of making this method transactional, but I'm worried about making too many trips to persistence. We can close this for now if you want though, I'll shift my focus to the other PR |
Potentially as a improvement for Fix #774
Context
Introduce per-task transactional leasing in the metastore layer via
loadTasks(...)
. This enables allows tasks to be leased and updated one at a time, avoiding the all-or-nothing semantics of bulk operations (which is also mentioned in TODO).Motivation
TODO
in the original implementation, so it’s likely something we want to improve?