Add partial retry capability to OTEL ES exporter.#2456
Add partial retry capability to OTEL ES exporter.#2456pavolloffay merged 1 commit intojaegertracing:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2456 +/- ##
==========================================
+ Coverage 95.50% 95.51% +0.01%
==========================================
Files 208 208
Lines 10756 10758 +2
==========================================
+ Hits 10273 10276 +3
+ Misses 407 406 -1
Partials 76 76
Continue to review full report at Codecov.
|
|
@joe-elliott would you like to have a look at this? |
There was a problem hiding this comment.
So the logic seems generally fine, but I'm concerned about the increased memory/cpu usage under normal operation due to creating the ConvertedData structs.
I'm also concerned about failure scenarios. If a large number of spans start failing for any reason then bulkItemsToTraces will start being called extremely rapidly putting additional memory/GC/cpu pressure on this process. I'm concerned a failing Elasticsearch cluster could then cause a cascading failure into the Jaeger collector/ingester.
So it seems that you primarily are using the ConvertedData struct to rebuild the trace to return in the PartialTracesError as accurately as possible. However, I don't think you need all this data. The data we actually need to store is already in the span (span data + process). Using the dbmodel.Span alone I think we could rebuild enough of a trace to return in PartialTracesError that the next time it it came into the exporter it would be sufficient to accurately save it to ES.
For instance, the InstrumentationLibrarySpan.InstrumentationLibrary data is being kept in order to rebuild the trace, but the Jaeger exporter doesn't do anything with this data.
Basically I'm proposing putting a partial/fake trace back into PartialTracesError derived from dbmodel.Span only. We should be able to rebuild the trace accurately enough that when it comes back into the ES exporter we will save the data correctly.
The
Rebuilding
right now it's not being converted to ES span model, but it's a pending feature that should be implemented. We need all original data (resource, instrumentation and span) to correctly reconstruct ES model span. |
I see and agree 👍.
Was about to comment about the
Agree with this as well. |
| if traces.ResourceSpans().Len() == 0 { | ||
| traces.ResourceSpans().Resize(1) | ||
| } else { | ||
| traces.ResourceSpans().Resize(traces.ResourceSpans().Len()) |
There was a problem hiding this comment.
Does this do anything?
Should we just set traces.ResourceSpans().Resize(len(bulkItems)) before the loop and make each bulk item its own ResourceSpans? This would be the simplest way to do it.
Otherwise I think you'd need a map of Resource => ResourceSpans which you could check for every bulkItem to see if a ResourceSpan for this Resource already existed and append to it.
There was a problem hiding this comment.
+1 it's better to allocate at the beginning.
| if !spanData.Resource.IsNil() { | ||
| spanData.Resource.CopyTo(rss.Resource()) | ||
| } | ||
| rss.InstrumentationLibrarySpans().Resize(1) |
There was a problem hiding this comment.
Similar to above, you could keep some kind of map of Resource/InstrumentationLibrary => InstrumentationLibrarySpans which you'd then check to see if you had an InstrumentationLibrarySpans to append to or if you should just make a new one.
Might be faster to just make one ResourceSpans/InstrumentationLibrarySpans for each bulkitem and forget trying to reuse them. It's hard to say if spans in a batch will likely come from the same trace and this will be worthwhile or not. Probably depends on which processors are configured in the pipeline.
There was a problem hiding this comment.
We cannot expect that spans in a batch share the same resource&instrumentation.
Might be faster to just make one ResourceSpans/InstrumentationLibrarySpans for each bulkitem and forget trying to reuse them. It's hard to say if spans in a batch will likely come from the same trace and this will be worthwhile or not. Probably depends on which processors are configured in the pipeline.
Keeping the resource/instrumentation -> spans would be more memory efficient. However, I would ho with this simple approach for now. The instrumentation library has two string fields and resource has a map.
Signed-off-by: Pavol Loffay <ploffay@redhat.com>
abb7a01 to
026ab6e
Compare
|
@joe-elliott could you please have a look again? |
There was a problem hiding this comment.
bulkItemsToTraces is difficult code to write in a performant way particurlarly given the generated shim you're working through to access the proto.
Without going to more extremes (such as the maps already discussed) I think this is as good as can be done.
Looks good to me!
The partial retry as the name suggests ensures that only spans that failed to be stored were retried. The returned error encapsulates
pdata.Tracesthat is used by qretry on in the consecutive call to the exporter. Thepdata.Tracesfor retry is created from a struct that is created by ES model translator.