Add partial retry capability to OTEL ES exporter. by pavolloffay · Pull Request #2456 · jaegertracing/jaeger

pavolloffay · 2020-09-04T11:39:30Z

The partial retry as the name suggests ensures that only spans that failed to be stored were retried. The returned error encapsulates pdata.Traces that is used by qretry on in the consecutive call to the exporter. The pdata.Traces for retry is created from a struct that is created by ES model translator.

codecov · 2020-09-04T14:40:02Z

Codecov Report

Merging #2456 into master will increase coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2456      +/-   ##
==========================================
+ Coverage   95.50%   95.51%   +0.01%     
==========================================
  Files         208      208              
  Lines       10756    10758       +2     
==========================================
+ Hits        10273    10276       +3     
+ Misses        407      406       -1     
  Partials       76       76

Impacted Files	Coverage Δ
plugin/storage/badger/spanstore/reader.go	`96.49% <0.00%> (-0.30%)`	⬇️
...lugin/sampling/strategystore/adaptive/processor.go	`100.00% <0.00%> (+0.79%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37eaced...026ab6e. Read the comment docs.

pavolloffay · 2020-09-04T15:09:16Z

@joe-elliott would you like to have a look at this?

joe-elliott

So the logic seems generally fine, but I'm concerned about the increased memory/cpu usage under normal operation due to creating the ConvertedData structs.

I'm also concerned about failure scenarios. If a large number of spans start failing for any reason then bulkItemsToTraces will start being called extremely rapidly putting additional memory/GC/cpu pressure on this process. I'm concerned a failing Elasticsearch cluster could then cause a cascading failure into the Jaeger collector/ingester.

So it seems that you primarily are using the ConvertedData struct to rebuild the trace to return in the PartialTracesError as accurately as possible. However, I don't think you need all this data. The data we actually need to store is already in the span (span data + process). Using the dbmodel.Span alone I think we could rebuild enough of a trace to return in PartialTracesError that the next time it it came into the exporter it would be sufficient to accurately save it to ES.

For instance, the InstrumentationLibrarySpan.InstrumentationLibrary data is being kept in order to rebuild the trace, but the Jaeger exporter doesn't do anything with this data.

Basically I'm proposing putting a partial/fake trace back into PartialTracesError derived from dbmodel.Span only. We should be able to rebuild the trace accurately enough that when it comes back into the ES exporter we will save the data correctly.

pavolloffay · 2020-09-05T13:33:17Z

So the logic seems generally fine, but I'm concerned about the increased memory/cpu usage under normal operation due to creating the ConvertedData structs.

The ConvertedData struct should be light weight and holds references to already allocated objects (ES model span and structs from pdata).

I'm also concerned about failure scenarios. If a large number of spans start failing for any reason then bulkItemsToTraces will start being called extremely rapidly putting additional memory/GC/cpu pressure on this process. I'm concerned a failing Elasticsearch cluster could then cause a cascading failure into the Jaeger collector/ingester.

So it seems that you primarily are using the ConvertedData struct to rebuild the trace to return in the PartialTracesError as accurately as possible. However, I don't think you need all this data. The data we actually need to store is already in the span (span data + process). Using the dbmodel.Span alone I think we could rebuild enough of a trace to return in PartialTracesError that the next time it it came into the exporter it would be sufficient to accurately save it to ES.

Rebuilding pdata.Traces from esmodel.Span would be more CPU/memory heavy. The idea behind ConvertedData is to keep references to already existing objects so in case of failure the remap is as simple as possible (which reminds me that instead of using CopyTo in bulkItemsToTraces we should use append where possible).

For instance, the InstrumentationLibrarySpan.InstrumentationLibrary data is being kept in order to rebuild the trace, but the Jaeger exporter doesn't do anything with this data.

right now it's not being converted to ES span model, but it's a pending feature that should be implemented. We need all original data (resource, instrumentation and span) to correctly reconstruct ES model span.

joe-elliott · 2020-09-05T15:08:48Z

The ConvertedData struct should be light weight and holds references to already allocated objects (ES model span and structs from pdata).

I see and agree 👍.

Rebuilding pdata.Traces from esmodel.Span would be more CPU/memory heavy. The idea behind ConvertedData is to keep references to already existing objects so in case of failure the remap is as simple as possible (which reminds me that instead of using CopyTo in bulkItemsToTraces we should use append where possible).

Was about to comment about the CopyTo() reviewing the code. It is allocating new objects. Agree on appending.

right now it's not being converted to ES span model, but it's a pending feature that should be implemented. We need all original data (resource, instrumentation and span) to correctly reconstruct ES model span.

Agree with this as well.

joe-elliott · 2020-09-05T15:20:26Z

cmd/opentelemetry/app/exporter/elasticsearchexporter/spanstore.go

+		if traces.ResourceSpans().Len() == 0 {
+			traces.ResourceSpans().Resize(1)
+		} else {
+			traces.ResourceSpans().Resize(traces.ResourceSpans().Len())


Does this do anything?

Should we just set traces.ResourceSpans().Resize(len(bulkItems)) before the loop and make each bulk item its own ResourceSpans? This would be the simplest way to do it.

Otherwise I think you'd need a map of Resource => ResourceSpans which you could check for every bulkItem to see if a ResourceSpan for this Resource already existed and append to it.

+1 it's better to allocate at the beginning.

joe-elliott · 2020-09-05T15:20:50Z

cmd/opentelemetry/app/exporter/elasticsearchexporter/spanstore.go

+		if !spanData.Resource.IsNil() {
+			spanData.Resource.CopyTo(rss.Resource())
+		}
+		rss.InstrumentationLibrarySpans().Resize(1)


Similar to above, you could keep some kind of map of Resource/InstrumentationLibrary => InstrumentationLibrarySpans which you'd then check to see if you had an InstrumentationLibrarySpans to append to or if you should just make a new one.

Might be faster to just make one ResourceSpans/InstrumentationLibrarySpans for each bulkitem and forget trying to reuse them. It's hard to say if spans in a batch will likely come from the same trace and this will be worthwhile or not. Probably depends on which processors are configured in the pipeline.

We cannot expect that spans in a batch share the same resource&instrumentation.

Might be faster to just make one ResourceSpans/InstrumentationLibrarySpans for each bulkitem and forget trying to reuse them. It's hard to say if spans in a batch will likely come from the same trace and this will be worthwhile or not. Probably depends on which processors are configured in the pipeline.

Keeping the resource/instrumentation -> spans would be more memory efficient. However, I would ho with this simple approach for now. The instrumentation library has two string fields and resource has a map.

Signed-off-by: Pavol Loffay <ploffay@redhat.com>

pavolloffay · 2020-09-16T13:46:32Z

@joe-elliott could you please have a look again?

joe-elliott

bulkItemsToTraces is difficult code to write in a performant way particurlarly given the generated shim you're working through to access the proto.

Without going to more extremes (such as the maps already discussed) I think this is as good as can be done.

Looks good to me!

pavolloffay added storage/elasticsearch area/otel labels Sep 4, 2020

pavolloffay requested a review from objectiser September 4, 2020 11:39

pavolloffay marked this pull request as ready for review September 4, 2020 15:08

pavolloffay requested a review from a team as a code owner September 4, 2020 15:08

pavolloffay changed the title ~~WIP: Add partial retry capability to OTEL ES exporter.~~ Add partial retry capability to OTEL ES exporter. Sep 4, 2020

joe-elliott reviewed Sep 4, 2020

View reviewed changes

joe-elliott reviewed Sep 5, 2020

View reviewed changes

Use partial retries in ES spanstore

026ab6e

Signed-off-by: Pavol Loffay <ploffay@redhat.com>

pavolloffay force-pushed the otel-retry branch from abb7a01 to 026ab6e Compare September 16, 2020 13:45

joe-elliott approved these changes Sep 16, 2020

View reviewed changes

pavolloffay merged commit 8604c63 into jaegertracing:master Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partial retry capability to OTEL ES exporter.#2456

Add partial retry capability to OTEL ES exporter.#2456
pavolloffay merged 1 commit intojaegertracing:masterfrom
pavolloffay:otel-retry

pavolloffay commented Sep 4, 2020

Uh oh!

codecov bot commented Sep 4, 2020 •

edited

Loading

Uh oh!

pavolloffay commented Sep 4, 2020

Uh oh!

joe-elliott left a comment •

edited

Loading

Uh oh!

pavolloffay commented Sep 5, 2020

Uh oh!

joe-elliott commented Sep 5, 2020

Uh oh!

joe-elliott Sep 5, 2020

Uh oh!

pavolloffay Sep 16, 2020

Uh oh!

joe-elliott Sep 5, 2020

Uh oh!

pavolloffay Sep 16, 2020

Uh oh!

joe-elliott Sep 16, 2020

Uh oh!

pavolloffay commented Sep 16, 2020

Uh oh!

joe-elliott left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pavolloffay commented Sep 4, 2020

Uh oh!

codecov bot commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pavolloffay commented Sep 4, 2020

Uh oh!

joe-elliott left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pavolloffay commented Sep 5, 2020

Uh oh!

joe-elliott commented Sep 5, 2020

Uh oh!

joe-elliott Sep 5, 2020

Choose a reason for hiding this comment

Uh oh!

pavolloffay Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

joe-elliott Sep 5, 2020

Choose a reason for hiding this comment

Uh oh!

pavolloffay Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

joe-elliott Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

pavolloffay commented Sep 16, 2020

Uh oh!

joe-elliott left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Sep 4, 2020 •

edited

Loading

joe-elliott left a comment •

edited

Loading

joe-elliott left a comment •

edited

Loading