vParquet5 - Dedicated event attributes and detection/support of blobs#5946
vParquet5 - Dedicated event attributes and detection/support of blobs#5946mdisibio merged 32 commits intografana:mainfrom
Conversation
…amically based on column cardinality
…er/write to account for it
| fmt.Printf("%s attributes: ", scope) | ||
| for _, a := range attrList { | ||
| fmt.Printf("\"%s\", ", a.name) | ||
| fmt.Printf("\"%s.%s\" ", scope, a.name) |
There was a problem hiding this comment.
if there's not a strong preference for this, can we go back to not printing the scope? or we will need to update the dedicated automation since it expects just a name here
There was a problem hiding this comment.
Good catch, reverted. We also aren't printing the event attributes in the simple summary, so effectively the new blob and event columns aren't useable by that automation. Which I think is OK. Adrian and I have been chatting and we think rather than try to update all of the output formats in this command to work with those features (and the new array option), a totally new command like tempo-cli suggest-columns with machine-readable output sounds better. Then update the automation to use that.
What this PR does:
Several new vParquet 5 features:
Dedicated attributes at the event level. This is straightforward, and are now included in tempo-cli analyse block output. Does not affect previous formats.
Blobs: Detection and support for "blob" attributes. These are attributes with high cardinality and/or high length, such as UUIDs or stack traces, where the current dictionary encoding is a hindrance. Now
tempo-cli analyse blockcan detect these and mark the dedicated column mapping accordingly. When reading and writing these columns the dictionary encoding is turned off (and we swap compression algorithms), for better overall performance and much reduced memory pressure, because we aren't encoding/decoding large dictionaries.Example CLI Output:
The blob annotation is ignored by vParquet4, and since we also didn't change the selection of the top N columns, there will be no effect on vParquet4 or earlier. The dedicated columns analysis and configuration for those tenants will still remain optimal.
For example running
parquet-tools schemaon a block with only 4 dedicated columns shows the remaining 6 columns dropped entirely:Notes
Which issue(s) this PR fixes:
Related: #4694
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]