Design proposal: Parquet blocks by mdisibio · Pull Request #1407 · grafana/tempo

mdisibio · 2022-04-22T14:24:51Z

What this PR does:
This PR adds a design proposal for Parquet blocks. Rendered markdown file can be viewed here

Which issue(s) this PR fixes:
n/a

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…ls, edit dynamic schemas section Signed-off-by: Annanay <annanayagarwal@gmail.com>

joe-elliott

Very exciting work. Let's go!

Only other thought is that perhaps we should link to the dremel paper that parquet is based on. The nested columnar approach is such a nice match to trace data.

joe-elliott · 2022-04-26T19:38:53Z

+We also had the following requirements of the new format:
+* Roundtrippable with OTLP - A new block format must support full trace read/write, so it must be able to return be converted from OTLP and back again.
+* More efficient search - Reduce i/o
+* Faster search - There's no point in having a new format that is slower


joe-elliott · 2022-04-26T19:43:59Z

+### Dedicated columns
+Projecting attributes to their own columns has massive benefits for speed and size, and these are too good to pass up on.  Therefore we are taking an opinionated approach and projecting some common attributes to their own columns.  All other attributes are stored in the generic key/value maps and are still searchable, but not as quickly.  We chose these attributes based on what we commonly use ourselves (scratching our own itch), but we think they will be useful to most workloads.  There is an associated cost for unused attributes as it will still store a column of nulls, however the cost should be minimal.  We can be more generous with unused resource-level attributes because their overhead is the smallest of all, we are finding it to be ~.05% (0.0005) of the total block size.
+
+Resource-level:


If we made these configurable would we have to store it in the block meta somewhere? or would it be dynamically detected during querying?

I feel like it would blow up the block meta too much to store these for every block, but I like the idea of having a configurable list of "dedicated columns".

Even if we added a configurable list of dedicated columns, the schema together with all columns would be part of the parquet metadata, and we are looking at ways to cache that to prevent pulling it from the object store on every read. So I think it will be "fine" until it blows up to a million columns, but I'm just speculating.

Making the columns configurable is a good idea and we had the same thought to be "less opinionated", but it ends up as a dynamic schema still, which is non-trivial and presents several challenges that we were avoiding in this first pass. (1) read/write using reflection since a static struct is not available (2) attribute name -> column name translation (3) type-handling (unless name + type is part of the configuration).

Yes the plan for when we get to dynamic schema is to detect the columns at runtime during querying.

joe-elliott · 2022-04-26T19:47:36Z

+}
+```
+
+Pros:


with 14 day retention object storage is ~30% of our TCO. Above it sounds like we are within ~5% of current block size. If this schema is improving query times I think it's worth it.

Sorry, not sure if this was clear: the parquet blocks are ~5% smaller, so it is a nice benefit.

joe-elliott · 2022-04-26T19:48:21Z

+* Many nulls - as each attribute only populates 1 column, the other 5 are guaranteed to be null. This has a non-trivial increase in storage size.
+
+### Event Attributes
+Span event attributes are stored as JSON-encoded strings in a generic key/value map. This is by far the most space-efficient encoding and the trade-off of decreased searchability seems worthwhile.  Storing event attributes this way reduces the block size ~16% for our dataset, which is huge.  There are currently no use cases to search event attributes, but we can revisit this in the future if needed.


I think this is a smart trade off for getting started. As we look at adding queryable events to TraceQL we can revisit.

joe-elliott · 2022-04-26T19:49:23Z

+
+```
+=== RUN   TestSearchProto
+Traces : 55


I think "Traces" means "Traces Matched"?

Yes. Correct. We added that metric in our local tests to confirm both proto/parquet are returning the same number of hits. We can leave this line out in this doc though.

joe-elliott · 2022-04-26T19:52:02Z

+
+* Creating blocks: The ingester will write and flush parquet-formatted blocks. The compactor will only compact like-encoded-blocks, i.e. parquet to parquet. This seems better than having the ingester continue to flush proto and the compactor to convert them.
+
+*  One tunable is the row group size.  Our index pages target something like 256K, but row groups are typically much larger, 50-100MB.  Not sure about the best value here, will need to experiment.


I'd say this difference is concerning. The smaller pages allows for heavy parallelization across the same block. My guess is that the significantly reduced i/o will more than make up for it. Agree experimentation is the best path to finding a good value.

While we could reduce the rowgroup size further and get a lot more workers active on a single block, I agree with you that the reduced I/O makes up for the reduced parallelism. A lot of our local benchmarking uses sequential scanning of all rowgroups and the results are pretty good already. We'll continue to experiment.

Also, the columnar format allows adding a new dimension to parallelism -- which is to search on multiple columns simultaneously.

Good thoughts here. Adding a note that we default to 10MB per search shard, which seems like a reasonable lower bound for row group size experimentation. Also our default index downsampling is 1MB not 256KB, will fix.

Co-authored-by: Joe Elliott <joe.elliott@grafana.com>

mdisibio and others added 6 commits April 18, 2022 15:30

first draft

fb5019c

formatting

cde5fb0

Updates: Other approaches -> Context, sketch out implementation detai…

19e5d4c

…ls, edit dynamic schemas section Signed-off-by: Annanay <annanayagarwal@gmail.com>

review feedback

057e7af

updates

a0b7ae1

date

04b53b3

mdisibio requested review from KMiller-Grafana, annanay25, dgzlopes, joe-elliott, mapno, yvrhdn and zalegrala as code owners April 22, 2022 14:24

joe-elliott reviewed Apr 26, 2022

View reviewed changes

achille-roussel mentioned this pull request Apr 26, 2022

Add Parquet support open-telemetry/opentelemetry-proto#346

Closed

Apply suggestions from code review

60b6fbd

Co-authored-by: Joe Elliott <joe.elliott@grafana.com>

mdisibio mentioned this pull request Jun 8, 2022

Add Parquet block format #1479

Merged

3 tasks

annanay25 mentioned this pull request Jun 9, 2022

Parquet Blocks Production Readiness #1480

Closed

27 tasks

joe-elliott approved these changes Jun 13, 2022

View reviewed changes

mdisibio merged commit 205787b into grafana:main Jun 15, 2022

mdisibio deleted the design-proposal-parquet branch April 25, 2023 18:50

+              }
+              ```
+              Pros:


		* Creating blocks: The ingester will write and flush parquet-formatted blocks. The compactor will only compact like-encoded-blocks, i.e. parquet to parquet. This seems better than having the ingester continue to flush proto and the compactor to convert them.

		* One tunable is the row group size. Our index pages target something like 256K, but row groups are typically much larger, 50-100MB. Not sure about the best value here, will need to experiment.

Conversation

mdisibio commented Apr 22, 2022

Uh oh!

joe-elliott left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants