Skip to content

Design proposal: Parquet blocks#1407

Merged
mdisibio merged 7 commits intografana:mainfrom
mdisibio:design-proposal-parquet
Jun 15, 2022
Merged

Design proposal: Parquet blocks#1407
mdisibio merged 7 commits intografana:mainfrom
mdisibio:design-proposal-parquet

Conversation

@mdisibio
Copy link
Copy Markdown
Contributor

What this PR does:
This PR adds a design proposal for Parquet blocks. Rendered markdown file can be viewed here

Which issue(s) this PR fixes:
n/a

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Copy Markdown
Collaborator

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very exciting work. Let's go!

Only other thought is that perhaps we should link to the dremel paper that parquet is based on. The nested columnar approach is such a nice match to trace data.

Comment thread docs/design-proposals/2022-04 Parquet.md Outdated
We also had the following requirements of the new format:
* Roundtrippable with OTLP - A new block format must support full trace read/write, so it must be able to return be converted from OTLP and back again.
* More efficient search - Reduce i/o
* Faster search - There's no point in having a new format that is slower
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

### Dedicated columns
Projecting attributes to their own columns has massive benefits for speed and size, and these are too good to pass up on. Therefore we are taking an opinionated approach and projecting some common attributes to their own columns. All other attributes are stored in the generic key/value maps and are still searchable, but not as quickly. We chose these attributes based on what we commonly use ourselves (scratching our own itch), but we think they will be useful to most workloads. There is an associated cost for unused attributes as it will still store a column of nulls, however the cost should be minimal. We can be more generous with unused resource-level attributes because their overhead is the smallest of all, we are finding it to be ~.05% (0.0005) of the total block size.

Resource-level:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we made these configurable would we have to store it in the block meta somewhere? or would it be dynamically detected during querying?

I feel like it would blow up the block meta too much to store these for every block, but I like the idea of having a configurable list of "dedicated columns".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we added a configurable list of dedicated columns, the schema together with all columns would be part of the parquet metadata, and we are looking at ways to cache that to prevent pulling it from the object store on every read. So I think it will be "fine" until it blows up to a million columns, but I'm just speculating.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the columns configurable is a good idea and we had the same thought to be "less opinionated", but it ends up as a dynamic schema still, which is non-trivial and presents several challenges that we were avoiding in this first pass. (1) read/write using reflection since a static struct is not available (2) attribute name -> column name translation (3) type-handling (unless name + type is part of the configuration).

Yes the plan for when we get to dynamic schema is to detect the columns at runtime during querying.

Comment thread docs/design-proposals/2022-04 Parquet.md Outdated
Comment thread docs/design-proposals/2022-04 Parquet.md Outdated
}
```

Pros:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with 14 day retention object storage is ~30% of our TCO. Above it sounds like we are within ~5% of current block size. If this schema is improving query times I think it's worth it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure if this was clear: the parquet blocks are ~5% smaller, so it is a nice benefit.

* Many nulls - as each attribute only populates 1 column, the other 5 are guaranteed to be null. This has a non-trivial increase in storage size.

### Event Attributes
Span event attributes are stored as JSON-encoded strings in a generic key/value map. This is by far the most space-efficient encoding and the trade-off of decreased searchability seems worthwhile. Storing event attributes this way reduces the block size ~16% for our dataset, which is huge. There are currently no use cases to search event attributes, but we can revisit this in the future if needed.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a smart trade off for getting started. As we look at adding queryable events to TraceQL we can revisit.


```
=== RUN TestSearchProto
Traces : 55
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "Traces" means "Traces Matched"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Correct. We added that metric in our local tests to confirm both proto/parquet are returning the same number of hits. We can leave this line out in this doc though.

Comment thread docs/design-proposals/2022-04 Parquet.md Outdated

* Creating blocks: The ingester will write and flush parquet-formatted blocks. The compactor will only compact like-encoded-blocks, i.e. parquet to parquet. This seems better than having the ingester continue to flush proto and the compactor to convert them.

* One tunable is the row group size. Our index pages target something like 256K, but row groups are typically much larger, 50-100MB. Not sure about the best value here, will need to experiment.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this difference is concerning. The smaller pages allows for heavy parallelization across the same block. My guess is that the significantly reduced i/o will more than make up for it. Agree experimentation is the best path to finding a good value.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we could reduce the rowgroup size further and get a lot more workers active on a single block, I agree with you that the reduced I/O makes up for the reduced parallelism. A lot of our local benchmarking uses sequential scanning of all rowgroups and the results are pretty good already. We'll continue to experiment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the columnar format allows adding a new dimension to parallelism -- which is to search on multiple columns simultaneously.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thoughts here. Adding a note that we default to 10MB per search shard, which seems like a reasonable lower bound for row group size experimentation. Also our default index downsampling is 1MB not 256KB, will fix.

Co-authored-by: Joe Elliott <joe.elliott@grafana.com>
@mdisibio mdisibio mentioned this pull request Jun 8, 2022
3 tasks
@annanay25 annanay25 mentioned this pull request Jun 9, 2022
27 tasks
@mdisibio mdisibio merged commit 205787b into grafana:main Jun 15, 2022
@mdisibio mdisibio deleted the design-proposal-parquet branch April 25, 2023 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants