Conversation
Return slice of block errors
2db45f0 to
d8a17a4
Compare
d8a17a4 to
1729d48
Compare
There was a problem hiding this comment.
Nice work already! Implementing partial results is pretty tricky it seems and it touches a lot of code 😅
A general concern: passing the amount of blocks that failed seems similar in intention to the SearchMetrics, that is: we want to capture some statistics about what happened in the ingester or querier and pass it all the way up.
Maybe instead of a uint32 blockErrCount we could work with a struct that is specific for custom metrics about the query. In the future we could also add the amount of blocks we scanned and the GBs or whatever.
| var shardMissCount, totalBlockErrCount int | ||
| for _, rr := range rrs { | ||
| if rr.Response.StatusCode == http.StatusOK { | ||
| partialContent := rr.Response.StatusCode == http.StatusPartialContent |
There was a problem hiding this comment.
Nit: renaming this to isPartialContent or gotPartialContent might make it more obvious this is a boolean. For a while I thought this was a variable holding a part of the content.
| var combinedTrace []byte | ||
| var shardMissCount = 0 | ||
| var combinedTrace *tempopb.Trace | ||
| var combinedTraceBytes []byte |
There was a problem hiding this comment.
Nit: we don't use combinedTraceBytes in this for-loop yet, so we can move the declaration a bit more down in this function. This makes the code a bit easier to read as we don't have to worry about this variable yet.
|
|
||
| data, enc, err := job.fn(job.ctx, job.payload) | ||
| if data != nil { | ||
| if data != nil || err != nil { |
There was a problem hiding this comment.
I'm wondering if we still need this if-case? Will there ever be a job that returns no data and no error?
| if req.QueryMode == QueryModeBlocks || req.QueryMode == QueryModeAll { | ||
| span.LogFields(ot_log.String("msg", "searching store")) | ||
| partialTraces, dataEncodings, err := q.store.Find(opentracing.ContextWithSpan(ctx, span), userID, req.TraceID, req.BlockStart, req.BlockEnd) | ||
| partialTraces, dataEncodings, blockErrs, err := q.store.Find(opentracing.ContextWithSpan(ctx, span), userID, req.TraceID, req.BlockStart, req.BlockEnd) |
There was a problem hiding this comment.
We don't use blockErrs in a meaningful way (we only use the count, not the errors itself). Maybe we should only return an int instead of []error?
| // err contains unrecoverable errors | ||
| // errs querying blocks are contained in blockErrs |
There was a problem hiding this comment.
Can you add these comments to Reader.Find itself?
| StatusCode: http.StatusOK, | ||
| Body: ioutil.NopCloser(bytes.NewReader(combinedTrace)), | ||
| StatusCode: statusCode, | ||
| Body: ioutil.NopCloser(bytes.NewReader(combinedTraceBytes)), |
There was a problem hiding this comment.
| Body: ioutil.NopCloser(bytes.NewReader(combinedTraceBytes)), | |
| Body: io.NopCloser(bytes.NewReader(combinedTraceBytes)), |
ioutil.NopCloser has been moved to io.NopCloser, see https://golang.org/doc/go1.16#ioutil
We just removed all occurrences of ioutil here #998 🙂
| span.SetTag("response marshalling format", util.JSONTypeHeaderValue) | ||
| marshaller := &jsonpb.Marshaler{} | ||
| err = marshaller.Marshal(w, resp.Trace) | ||
| err = marshaller.Marshal(w, resp) |
There was a problem hiding this comment.
There is a difference between our responses depending on whether the caller accepts protobuf or not: if the caller accepts protobuf we return HTTP 206 and resp.Trace (line 82).
But if the caller requests something else (aka JSON) we marshal the entire TraceByIDResponse. Which would be something like:
{
"trace": ...,
"blockErrCount": ...
}Why not also return HTTP 206? This change breaks our API I think.
| var resp tempopb.TraceByIDResponse | ||
| err = proto.Unmarshal(body, &resp) |
There was a problem hiding this comment.
The querier only seems to marshal the trace part of TraceByIDResponse (modules/querier/http.go:82):
b, err := proto.Marshal(resp.Trace)
How can we unmarshal a full TraceByIDResponse here?
| body, err := io.ReadAll(rr.Response.Body) | ||
| rr.Response.Body.Close() | ||
| if err != nil { | ||
| return nil, errors.Wrap(err, "error reading response body at query frontend") | ||
| } |
There was a problem hiding this comment.
I'm wondering if we should fail the entire request here. If we can't read the body from one of the requests, isn't this also a partial result?
Same for unmarshaling the body.
| if totalBlockErrCount > maxBlockErrCount { | ||
| return nil, fmt.Errorf("too many block queries failed (max %d)", maxBlockErrCount) | ||
| } |
There was a problem hiding this comment.
I'm wondering if it's useful to fail on maxBlockErrCount. We already did the work: all the queriers returned something. Instead of throwing away the results we can still return whatever we got.
What this PR does:
It supports partial results when block queries fail.
The general idea is to let queriers fail when querying blocks and decide the query frontend if the errors are too many or if the partial result is worth returning.
Partial results are indicated with http 206 (partial content).
Which issue(s) this PR fixes:
Fixes #899
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]