[clickhouse][perf] Restructure ClickHouse FindTraceIDs Query to Improve Performance#8125
Conversation
Signed-off-by: Mahad Zaryab <mahadzaryab1@gmail.com>
Signed-off-by: Mahad Zaryab <mahadzaryab1@gmail.com>
Signed-off-by: Mahad Zaryab <mahadzaryab1@gmail.com>
Signed-off-by: Mahad Zaryab <mahadzaryab1@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8125 +/- ##
==========================================
+ Coverage 95.67% 95.69% +0.02%
==========================================
Files 317 317
Lines 16734 16735 +1
==========================================
+ Hits 16010 16015 +5
+ Misses 571 568 -3
+ Partials 153 152 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Metrics Comparison SummaryTotal changes across all snapshots: 32 Detailed changes per snapshot📊 CassandraFile Name: summary_metrics_snapshot_cassandra
📊 CassandraFile Name: summary_metrics_snapshot_cassandra
📊 CassandraFile Name: summary_metrics_snapshot_cassandra
📊 CassandraFile Name: summary_metrics_snapshot_cassandra
📊 BadgerFile Name: summary_metrics_snapshot_badger
❌ Removed Metrics
View diff sample-jaeger_storage_badger_compaction_current_num_lsm{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_compaction_current_num_lsm{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_get_num_memtable{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_get_num_memtable{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_get_num_user{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_get_num_user{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_get_with_result_num_user{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_get_with_result_num_user{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_iterator_num_user{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_iterator_num_user{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_put_num_user{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_put_num_user{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_read_bytes_lsm{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_read_bytes_lsm{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_read_bytes_vlog{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_read_bytes_vlog{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_read_num_vlog{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_read_num_vlog{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_size_bytes_lsm{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_size_bytes_lsm{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_size_bytes_vlog{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_size_bytes_vlog{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_write_bytes_l0{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_write_bytes_l0{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_write_bytes_user{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_write_bytes_user{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_write_bytes_vlog{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_write_bytes_vlog{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_write_num_vlog{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_write_num_vlog{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}View diff sample-jaeger_storage_badger_write_pending_num_memtable{name="another_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}
-jaeger_storage_badger_write_pending_num_memtable{name="some_store",otel_scope_name="jaeger-v2",otel_scope_schema_url="",otel_scope_version="",role="tracestore"}➡️ View CI artifacts | View Summary Report logs Code CoverageCoverage 96.9% (baseline 46.4%) |
There was a problem hiding this comment.
Pull request overview
This PR improves the performance of the ClickHouse FindTraceIDs and FindTraces queries by restructuring the SQL to apply the result LIMIT before joining with trace_id_timestamps. Previously, every matching row was joined with trace_id_timestamps before limiting, causing unnecessary work at scale.
Changes:
- Splits
SearchTraceIDsSQL intoSearchTraceIDsBase(the inner subquery that filters distinct trace IDs withLIMIT) andSearchTraceIDs(the outer wrapper that joins withtrace_id_timestampsusingfmt.Sprintfformatting). - Updates
buildFindTraceIDsQueryinquery_builder.goto build the inner subquery separately and wrap it using the new template. - Updates all affected snapshot files and test mock keys to reflect the restructured query.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
internal/storage/v2/clickhouse/sql/queries.go |
Splits SearchTraceIDs into SearchTraceIDsBase (inner subquery) and SearchTraceIDs (outer JOIN wrapper template); changes LEFT JOIN to JOIN |
internal/storage/v2/clickhouse/tracestore/query_builder.go |
Builds inner subquery separately and wraps with fmt.Sprintf for the outer JOIN |
internal/storage/v2/clickhouse/tracestore/driver_test.go |
Adds whitespace normalization to the mock driver's substring matching |
internal/storage/v2/clickhouse/tracestore/reader_test.go |
Updates mock query response keys from SearchTraceIDs to SearchTraceIDsBase |
internal/storage/v2/clickhouse/tracestore/snapshots/TestFindTraceIDs_2.sql |
Updated snapshot for the new two-level query structure |
internal/storage/v2/clickhouse/tracestore/snapshots/TestFindTraces_WithFilters_2.sql |
Updated snapshot for the new two-level query structure |
internal/storage/v2/clickhouse/tracestore/snapshots/TestFindTraces_Success/single_span_1.sql |
Updated snapshot for the new two-level query structure |
internal/storage/v2/clickhouse/tracestore/snapshots/TestFindTraces_Success/multiple_spans_1.sql |
Updated snapshot for the new two-level query structure |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
This change seems to go in the opposite direction of what we want. The inner query has no time range so it has to search ALL segments in the table. The whole point of a join was to limit the search to a known time range, which would typically be just a few minutes, one or two segments on disk. |
@yurishkuro The In the approach that you mentioned, are you suggesting that we use the trace_id_timestamps table to narrow down the possible trace_ids based on the time range provided in the query? |
|
Sorry I confused this with GetTrace method. For FindTraceIDs yes you first have to search on the spans table and then join with timestamps table just to return ts range with the trace ID. The copilot is right that you want a LEFT JOIN to avoid skipping found trace IDs even if we can't find their time range (which is an optimization for Get). |
Signed-off-by: Mahad Zaryab <mahadzaryab1@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Which problem is this PR solving?
Description of the changes
How was this change tested?
Checklist
make lint testAI Usage in this PR (choose one)
See AI Usage Policy.