Skip to content

Restore tempo_request_duration_seconds metrics for querier_api_* requests#3403

Merged
yvrhdn merged 2 commits intografana:mainfrom
yvrhdn:kvrhdn/restore-querier-metrics
Feb 16, 2024
Merged

Restore tempo_request_duration_seconds metrics for querier_api_* requests#3403
yvrhdn merged 2 commits intografana:mainfrom
yvrhdn:kvrhdn/restore-querier-metrics

Conversation

@yvrhdn
Copy link
Copy Markdown
Contributor

@yvrhdn yvrhdn commented Feb 16, 2024

What this PR does:

Recent changes (I suspect #3300) broke middleware on our querier requests. We have HTTP middleware that provides tracing and metrics. Requests sent to the querier are routed slightly differently and due to this change we didn't get metrics anymore for the HTTP routes.

Issue
So tempo_request_duration_seconds does not show up for routes like querier_api_search_tags and querier_api_traces_traceid. GRPC routes were still instrumented correctly (/tempopb.Querier/FindTraceByID and /tempopb.Querier/SearchTags).

How it works
Normal middleware flow:

http.Handler.Serve --> middleware (tracing, metrics) --> mux.Router (calls the correct http.Handler for each path)

When we set stream_over_http_enabled: false this will be

http.Handler.Serve --> dskit.Server (with instrumentation middleware) --> Tempo's mux.Router

When we set stream_over_http_enabled: true we switch up the order a bit:

http.Handler.Server --> GRPC interceptor --> instrumentation middleware --> Tempo's Router
                        (1)                                                 (2)

The bug: requests from query-frontend to querier are handled differently, they are pulled by the frontend worker process.
The frontend worker will pull a request from the frontend, unpack the GRPC requests and then push the HTTP request to the regular HTTP handler. This handler was the router (marked (2)), since this handler is after instrumentation we didn't get any metrics for it. By instead passing the start of the middleware chain ((1)) we fix this.

Adding HTTPRouter and HTTPHandler to TempoServer should make this more explicit.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

if err != nil {
return nil, fmt.Errorf("failed to create server: %w", err)
}
s.handler = s.externalServer.HTTPServer.Handler
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joe-elliott we were really close while pairing. We used s.externalServer.HTTP which is the router, HTTPServer.Handler is the handler which contains all middleware as well.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so close!


type TempoServer interface {
HTTP() *mux.Router
HTTPRouter() *mux.Router
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this rename for clarity

@yvrhdn yvrhdn enabled auto-merge (squash) February 16, 2024 18:21
@yvrhdn yvrhdn merged commit c041e08 into grafana:main Feb 16, 2024
yvrhdn pushed a commit to yvrhdn/tempo that referenced this pull request Feb 26, 2024
…ests (grafana#3403)

* Restore tempo_request_duration_seconds metrics for querier_api_* requests

* Update CHANGELOG.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants