[Parquet] Add traceid index to vparquet2 and vparquet3 and use it when finding trace by id#2697
Conversation
|
Before merging should we include this code in vparquet3? |
Yep, added as mandatory feature in vParquet3. Please take a look. |
…ck ended on exactly the configured rowgroup size while flushing
9c37545 to
cee39a8
Compare
|
Since vParquet3 is already being used in some installs, we decided it needs to not be a breaking change after all. Now it is optional like in vParquet2. |
That's a good thought and I don't have a great answer. Since we never determined the root cause or prevalence of that, it is hard to know this feature will be affected or not. I added an environment variable |
Yeah, I'm good. I suppose we could write a script that validates the index (and the min/max block size) after this is in a testing cluster to see if we find any inconsistencies. Let's keep an eye on the vult(ch)! |
What this PR does:
This re-adds an index file to vparquet2 and vparquet3 blocks to help find traces by ID. It contains the max trace ID of each row group and allows us to jump straight to the correct one, instead of the current method which performs i/o to test each group. Added as optional to avoid breaking changes. New blocks will create and use the index, and old blocks will fallback to the prior method.
Benchmarks don't show any improvement unless we simulate some latency for the i/o operations. Numbers below are for a level-0 block with 11 rowgroups. It saves around log2(rgs) i/o operations, so would be even better on larger blocks. Additionally the index file will be cached in practice but not included here.
0 ms:
50ms
100ms
Which issue(s) this PR fixes:
Fixes n/a
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]