Design proposal: Parquet blocks#1407
Conversation
…ls, edit dynamic schemas section Signed-off-by: Annanay <annanayagarwal@gmail.com>
joe-elliott
left a comment
There was a problem hiding this comment.
Very exciting work. Let's go!
Only other thought is that perhaps we should link to the dremel paper that parquet is based on. The nested columnar approach is such a nice match to trace data.
| We also had the following requirements of the new format: | ||
| * Roundtrippable with OTLP - A new block format must support full trace read/write, so it must be able to return be converted from OTLP and back again. | ||
| * More efficient search - Reduce i/o | ||
| * Faster search - There's no point in having a new format that is slower |
| ### Dedicated columns | ||
| Projecting attributes to their own columns has massive benefits for speed and size, and these are too good to pass up on. Therefore we are taking an opinionated approach and projecting some common attributes to their own columns. All other attributes are stored in the generic key/value maps and are still searchable, but not as quickly. We chose these attributes based on what we commonly use ourselves (scratching our own itch), but we think they will be useful to most workloads. There is an associated cost for unused attributes as it will still store a column of nulls, however the cost should be minimal. We can be more generous with unused resource-level attributes because their overhead is the smallest of all, we are finding it to be ~.05% (0.0005) of the total block size. | ||
|
|
||
| Resource-level: |
There was a problem hiding this comment.
If we made these configurable would we have to store it in the block meta somewhere? or would it be dynamically detected during querying?
I feel like it would blow up the block meta too much to store these for every block, but I like the idea of having a configurable list of "dedicated columns".
There was a problem hiding this comment.
Even if we added a configurable list of dedicated columns, the schema together with all columns would be part of the parquet metadata, and we are looking at ways to cache that to prevent pulling it from the object store on every read. So I think it will be "fine" until it blows up to a million columns, but I'm just speculating.
There was a problem hiding this comment.
Making the columns configurable is a good idea and we had the same thought to be "less opinionated", but it ends up as a dynamic schema still, which is non-trivial and presents several challenges that we were avoiding in this first pass. (1) read/write using reflection since a static struct is not available (2) attribute name -> column name translation (3) type-handling (unless name + type is part of the configuration).
Yes the plan for when we get to dynamic schema is to detect the columns at runtime during querying.
| } | ||
| ``` | ||
|
|
||
| Pros: |
There was a problem hiding this comment.
with 14 day retention object storage is ~30% of our TCO. Above it sounds like we are within ~5% of current block size. If this schema is improving query times I think it's worth it.
There was a problem hiding this comment.
Sorry, not sure if this was clear: the parquet blocks are ~5% smaller, so it is a nice benefit.
| * Many nulls - as each attribute only populates 1 column, the other 5 are guaranteed to be null. This has a non-trivial increase in storage size. | ||
|
|
||
| ### Event Attributes | ||
| Span event attributes are stored as JSON-encoded strings in a generic key/value map. This is by far the most space-efficient encoding and the trade-off of decreased searchability seems worthwhile. Storing event attributes this way reduces the block size ~16% for our dataset, which is huge. There are currently no use cases to search event attributes, but we can revisit this in the future if needed. |
There was a problem hiding this comment.
I think this is a smart trade off for getting started. As we look at adding queryable events to TraceQL we can revisit.
|
|
||
| ``` | ||
| === RUN TestSearchProto | ||
| Traces : 55 |
There was a problem hiding this comment.
I think "Traces" means "Traces Matched"?
There was a problem hiding this comment.
Yes. Correct. We added that metric in our local tests to confirm both proto/parquet are returning the same number of hits. We can leave this line out in this doc though.
|
|
||
| * Creating blocks: The ingester will write and flush parquet-formatted blocks. The compactor will only compact like-encoded-blocks, i.e. parquet to parquet. This seems better than having the ingester continue to flush proto and the compactor to convert them. | ||
|
|
||
| * One tunable is the row group size. Our index pages target something like 256K, but row groups are typically much larger, 50-100MB. Not sure about the best value here, will need to experiment. |
There was a problem hiding this comment.
I'd say this difference is concerning. The smaller pages allows for heavy parallelization across the same block. My guess is that the significantly reduced i/o will more than make up for it. Agree experimentation is the best path to finding a good value.
There was a problem hiding this comment.
While we could reduce the rowgroup size further and get a lot more workers active on a single block, I agree with you that the reduced I/O makes up for the reduced parallelism. A lot of our local benchmarking uses sequential scanning of all rowgroups and the results are pretty good already. We'll continue to experiment.
There was a problem hiding this comment.
Also, the columnar format allows adding a new dimension to parallelism -- which is to search on multiple columns simultaneously.
There was a problem hiding this comment.
Good thoughts here. Adding a note that we default to 10MB per search shard, which seems like a reasonable lower bound for row group size experimentation. Also our default index downsampling is 1MB not 256KB, will fix.
Co-authored-by: Joe Elliott <joe.elliott@grafana.com>
What this PR does:
This PR adds a design proposal for Parquet blocks. Rendered markdown file can be viewed here
Which issue(s) this PR fixes:
n/a
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]