Skip to content

Implement polling improvements#2652

Merged
zalegrala merged 144 commits intografana:mainfrom
zalegrala:pollingLessWork
Oct 27, 2023
Merged

Implement polling improvements#2652
zalegrala merged 144 commits intografana:mainfrom
zalegrala:pollingLessWork

Conversation

@zalegrala
Copy link
Copy Markdown
Contributor

@zalegrala zalegrala commented Jul 13, 2023

What this PR does:

Currently, each polling cycle, the block list is read from the backend, followed by a read of each meta.json file. If the meta.json returns not-found, then an additional call is made to the backend to fetch the meta.compacted.json. During compaction, the meta.json is replaced with meta.compacted.json, but the details about the block remain unchanged. This means that tempo performs lots of fetching for information that is already known from the previous poll. This can be a significant source of backend calls, and in turn, latency of the polling cycle. In environments with a high number of blocks, this problem is amplified.

Here we extend the backend interfaces to create a ListBlocks() method that can be used to target block information direclty from each backend.

  • list the backend, splitting on an empty delimter to get all objects in the backend
  • parse the response, sorting the results into compacted and not-compacted
  • for each block, use the previous blocklist to match data we have in memory
  • for each new block, we perform the fetch as normal to read the metadata

This results in fewer calls to the backend, and a may decrease in total time spent for each poll cycle.

Additional changes to the structure of the backend are made to improve the duration while keeping the total calls in check.

  • the s3 and gcs ListBlocks() call has implemented a parallel list over the search space

Which issue(s) this PR fixes:
Fixes #2521

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@zalegrala zalegrala force-pushed the pollingLessWork branch 2 times, most recently from 7e423c0 to 63f1c85 Compare July 13, 2023 22:25
@zalegrala zalegrala force-pushed the pollingLessWork branch 5 times, most recently from f0417b7 to a47fec5 Compare July 26, 2023 21:04
@zalegrala
Copy link
Copy Markdown
Contributor Author

Blocklist poll duration decreases with an image built from this PR in GCS.
image

@zalegrala zalegrala changed the title Poller updates in progress Implement "quick" polling Jul 27, 2023
@zalegrala
Copy link
Copy Markdown
Contributor Author

Blocklist poll duration increases in s3.
image

The list call itslef is now taking longer in s3, but likely still saves cost on requests. Perhaps there is a different approach to the s3 implementation we can take. I'll do a little research and see if I can come up with something, but feel free to chime in if you have ideas.

@joe-elliott
Copy link
Copy Markdown
Collaborator

As long as the polling takes less time than the polling cycle I think we're ok. Double the polling time might not matter if it's going from 20s -> 40s on 200k blocks or something like that.

@zalegrala
Copy link
Copy Markdown
Contributor Author

I've made a small change to use n as a delimiter for both the s3 and the gcs case. This works because .json files end with n. Use against the actual GCS API was working fine, but against the fake-gcs-server that we use in our CI was not accepting multi-character delimiters. I applied the same change to s3 thinking that perhaps the multi-character delimiter was slowing down the responses, but this didn't seem to pan out when I deployed. I think its worth a review at this point.

@zalegrala zalegrala marked this pull request as ready for review July 28, 2023 16:48
Copy link
Copy Markdown
Collaborator

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Have some Qs. Let me know if you want to have a sync convo.

Comment thread tempodb/backend/azure/azure.go Outdated
Comment thread tempodb/backend/raw.go Outdated
Comment thread tempodb/backend/raw.go Outdated
Comment thread tempodb/blocklist/poller.go Outdated
Comment thread tempodb/blocklist/poller.go Outdated
@zalegrala
Copy link
Copy Markdown
Contributor Author

Thanks for the review @joe-elliott. I'm happy to chat through this whenever.

Comment thread modules/frontend/v1/frontendv1pb/frontend.pb.go Outdated
@zalegrala zalegrala force-pushed the pollingLessWork branch 2 times, most recently from ee8950c to bf5f1c6 Compare August 21, 2023 21:22
Comment thread tempodb/blocklist/poller.go Outdated
Comment thread tempodb/blocklist/poller.go Outdated
} else if cm != nil {
chCompactedMeta <- cm
} else if err != nil {
mtx.Lock()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cost of the lock in this context is basically nothing. lock/defer unlock is a safer pattern that is more difficult to mess up as functionality changes.

you could also rely on the lock in this case to protect the err var. up to you

Comment thread integration/poller/poller_test.go
Comment thread tempodb/backend/gcs/gcs.go
Comment thread tempodb/backend/gcs/gcs.go Outdated
Comment thread tempodb/blocklist/poller.go Outdated
Comment thread tempodb/blocklist/poller.go Outdated
Copy link
Copy Markdown
Collaborator

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of smaller comments, but I'm g2g. Great improvement!

Comment thread .github/workflows/ci.yml
Comment thread tempodb/backend/gcs/gcs.go Outdated
Comment thread tempodb/backend/gcs/gcs.go Outdated
continue
}

// TODO: Review the ability to avoid polling for compacted blocks that we
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the code you've added does this? if so we can clean up this todo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. What I had in mind might not be possible, but the note here is about avoiding pulling the data entirely if the block transitions from meta -> compactedMeta. We have all the detail except the time it was compacted. More generally, I wonder if we already had this data in memory, then we know that it is only one polling cycle out of date, and perhaps we could be fuzzy enough on this that we don't have to know the precise time. More thinking needed probably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Polling Performance at high block counts

5 participants