Skip to content

WIP: Blocks partial results#988

Closed
mapno wants to merge 9 commits intografana:mainfrom
mapno:ingester-partial-results
Closed

WIP: Blocks partial results#988
mapno wants to merge 9 commits intografana:mainfrom
mapno:ingester-partial-results

Conversation

@mapno
Copy link
Copy Markdown
Contributor

@mapno mapno commented Sep 28, 2021

What this PR does:

It supports partial results when block queries fail.

The general idea is to let queriers fail when querying blocks and decide the query frontend if the errors are too many or if the partial result is worth returning.

Partial results are indicated with http 206 (partial content).

Which issue(s) this PR fixes:
Fixes #899

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@mapno mapno force-pushed the ingester-partial-results branch from 2db45f0 to d8a17a4 Compare September 28, 2021 15:22
@mapno mapno force-pushed the ingester-partial-results branch from d8a17a4 to 1729d48 Compare September 28, 2021 15:31
@mapno mapno marked this pull request as ready for review September 30, 2021 13:54
@mapno mapno changed the title Ingester partial results WIP: Ingester partial results Sep 30, 2021
@mapno mapno changed the title WIP: Ingester partial results WIP: Blocks partial results Sep 30, 2021
Copy link
Copy Markdown
Contributor

@yvrhdn yvrhdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work already! Implementing partial results is pretty tricky it seems and it touches a lot of code 😅

A general concern: passing the amount of blocks that failed seems similar in intention to the SearchMetrics, that is: we want to capture some statistics about what happened in the ingester or querier and pass it all the way up.
Maybe instead of a uint32 blockErrCount we could work with a struct that is specific for custom metrics about the query. In the future we could also add the amount of blocks we scanned and the GBs or whatever.

var shardMissCount, totalBlockErrCount int
for _, rr := range rrs {
if rr.Response.StatusCode == http.StatusOK {
partialContent := rr.Response.StatusCode == http.StatusPartialContent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: renaming this to isPartialContent or gotPartialContent might make it more obvious this is a boolean. For a while I thought this was a variable holding a part of the content.

var combinedTrace []byte
var shardMissCount = 0
var combinedTrace *tempopb.Trace
var combinedTraceBytes []byte
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we don't use combinedTraceBytes in this for-loop yet, so we can move the declaration a bit more down in this function. This makes the code a bit easier to read as we don't have to worry about this variable yet.

Comment thread tempodb/pool/pool.go

data, enc, err := job.fn(job.ctx, job.payload)
if data != nil {
if data != nil || err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we still need this if-case? Will there ever be a job that returns no data and no error?

if req.QueryMode == QueryModeBlocks || req.QueryMode == QueryModeAll {
span.LogFields(ot_log.String("msg", "searching store"))
partialTraces, dataEncodings, err := q.store.Find(opentracing.ContextWithSpan(ctx, span), userID, req.TraceID, req.BlockStart, req.BlockEnd)
partialTraces, dataEncodings, blockErrs, err := q.store.Find(opentracing.ContextWithSpan(ctx, span), userID, req.TraceID, req.BlockStart, req.BlockEnd)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use blockErrs in a meaningful way (we only use the count, not the errors itself). Maybe we should only return an int instead of []error?

Comment on lines +192 to +193
// err contains unrecoverable errors
// errs querying blocks are contained in blockErrs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add these comments to Reader.Find itself?

StatusCode: http.StatusOK,
Body: ioutil.NopCloser(bytes.NewReader(combinedTrace)),
StatusCode: statusCode,
Body: ioutil.NopCloser(bytes.NewReader(combinedTraceBytes)),
Copy link
Copy Markdown
Contributor

@yvrhdn yvrhdn Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Body: ioutil.NopCloser(bytes.NewReader(combinedTraceBytes)),
Body: io.NopCloser(bytes.NewReader(combinedTraceBytes)),

ioutil.NopCloser has been moved to io.NopCloser, see https://golang.org/doc/go1.16#ioutil
We just removed all occurrences of ioutil here #998 🙂

Comment thread modules/querier/http.go
span.SetTag("response marshalling format", util.JSONTypeHeaderValue)
marshaller := &jsonpb.Marshaler{}
err = marshaller.Marshal(w, resp.Trace)
err = marshaller.Marshal(w, resp)
Copy link
Copy Markdown
Contributor

@yvrhdn yvrhdn Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference between our responses depending on whether the caller accepts protobuf or not: if the caller accepts protobuf we return HTTP 206 and resp.Trace (line 82).

But if the caller requests something else (aka JSON) we marshal the entire TraceByIDResponse. Which would be something like:

{
  "trace": ...,
  "blockErrCount": ...
}

Why not also return HTTP 206? This change breaks our API I think.

Comment on lines +175 to +176
var resp tempopb.TraceByIDResponse
err = proto.Unmarshal(body, &resp)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The querier only seems to marshal the trace part of TraceByIDResponse (modules/querier/http.go:82):

b, err := proto.Marshal(resp.Trace)

How can we unmarshal a full TraceByIDResponse here?

Comment on lines 169 to 173
body, err := io.ReadAll(rr.Response.Body)
rr.Response.Body.Close()
if err != nil {
return nil, errors.Wrap(err, "error reading response body at query frontend")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should fail the entire request here. If we can't read the body from one of the requests, isn't this also a partial result?
Same for unmarshaling the body.

Comment on lines +183 to 185
if totalBlockErrCount > maxBlockErrCount {
return nil, fmt.Errorf("too many block queries failed (max %d)", maxBlockErrCount)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's useful to fail on maxBlockErrCount. We already did the work: all the queriers returned something. Instead of throwing away the results we can still return whatever we got.

@mapno mapno mentioned this pull request Oct 5, 2021
3 tasks
@mapno
Copy link
Copy Markdown
Contributor Author

mapno commented Oct 5, 2021

#1002 created too many conflicts and a lot of changes had to be done anyway, so moved the PR to #1007

@mapno mapno closed this Oct 5, 2021
@mapno mapno deleted the ingester-partial-results branch October 5, 2021 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Returning Partial Results

2 participants