Skip to content

Histogram should ignore NaN observations #1275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fstab opened this issue May 15, 2023 · 9 comments
Closed

Histogram should ignore NaN observations #1275

fstab opened this issue May 15, 2023 · 9 comments

Comments

@fstab
Copy link
Member

fstab commented May 15, 2023

When a NaN value is observed, the current implementation of histogram increases count and sets sum to NaN.

https://github.com/prometheus/client_golang/blob/main/prometheus/histogram_test.go#L541-L544

This violates two parts of the OpenMetrics spec:

  • Sum MUST NOT be NaN.
  • Count must be equal to the +Inf bucket.

This means NaN values must be ignored according to OpenMetrics.

I think ignoring NaN observations is a good idea, because then the count can be calculated from the bucket values. I'm doing this in client_java's new data model, and I like it because the derived count can never become inconsistent.

Anyway, it would be good if all Prometheus client libraries behaved the same, so what do you think of ignoring NaN observations?

@beorn7
Copy link
Member

beorn7 commented May 16, 2023

The current implementation is deliberate. It's mathematically the right thing to do. It has also been propagated throughout the stack. For example, we take into account in PromQL that a mismatch between the count and the number of observations in buckets is due to NaN observations.

OpenMetrics has very little weight in terms of native histograms. It doesn't support native histograms at all so far, and it is inherently inconsistent even when it comes to classic histograms. (I have voiced my concerns about this long ago, but as usual, they got ignored.)

Having said all that, I do think NaN observations don't really make sense. They SHOULD be avoided. Therefore, in practice, what we specify as the precise behavior doesn't really make a huge difference. However, I believe we should just treat that edge case in a mathematical meaningful way rather than forbidding it and thereby create more failure cases, that had to be dealt with in a disruptive way. (For example, if we specify that the sum MUST NOT be NaN, we have to fail all scrapes entirely that contain a single sum that is NaN.)

@fstab
Copy link
Member Author

fstab commented May 23, 2023

Thanks Björn! Closing the issue because this answers my question :).

@fstab fstab closed this as completed May 23, 2023
@krajorama
Copy link
Member

I've done a test today myself and client golang does add NaN to Sum, but actually the exposed Count and +Inf is correct. I observed, NaN and 2.0.:

# HELP golang_manual_histogram This is a histogram with manually selected parameters
# TYPE golang_manual_histogram histogram
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="0.005"} 0
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="0.01"} 0
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="0.025"} 0
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="0.05"} 0
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="0.1"} 0
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="0.25"} 0
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="0.5"} 0
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="1.0"} 0
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="2.5"} 1
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="5.0"} 1
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="10.0"} 1
golang_manual_histogram_bucket{address="0.0.0.0",port="5001",le="+Inf"} 2
golang_manual_histogram_sum{address="0.0.0.0",port="5001"} NaN
golang_manual_histogram_count{address="0.0.0.0",port="5001"} 2

client_java 1.3.6 ignores the NaN in counts and sum as well.

# TYPE krajo_hist_seconds histogram
# UNIT krajo_hist_seconds seconds
# HELP krajo_hist_seconds number of seconds since this application was started
krajo_hist_seconds_bucket{le="0.005"} 0
krajo_hist_seconds_bucket{le="0.01"} 0
krajo_hist_seconds_bucket{le="0.025"} 0
krajo_hist_seconds_bucket{le="0.05"} 0
krajo_hist_seconds_bucket{le="0.1"} 0
krajo_hist_seconds_bucket{le="0.25"} 0
krajo_hist_seconds_bucket{le="0.5"} 0
krajo_hist_seconds_bucket{le="1.0"} 0
krajo_hist_seconds_bucket{le="2.5"} 1
krajo_hist_seconds_bucket{le="5.0"} 1
krajo_hist_seconds_bucket{le="10.0"} 1
krajo_hist_seconds_bucket{le="+Inf"} 1
krajo_hist_seconds_count 1
krajo_hist_seconds_sum 2.0

Would be nice to agree on something for Openmetrics 2.0.

@beorn7
Copy link
Member

beorn7 commented Apr 24, 2025

Maybe unsurprisingly, I still agree with myself. The NH behavior makes sense, and OMv2 should be specified accordingly.

It would be good to clarify the behavior of classic histograms. Currently, the +Inf bucket is identical to the count by definition. This creates an inconsistency with native histograms where the count can be greater than the sum of all buckets (because count counts NaN observations, which are in no bucket).

Options:

  1. We could forbid NaN observations for classic histograms.
  2. We can define that the +Inf bucket is an exception to the rule above (even in NHCB) and it does count NaN.
  3. We can make the +Inf bucket and the count different after all.

While (3) would be best if we designed things from scratch, it would break an assumption that has been baked into a lot of code. So I would not recommend it. (1) seems very reasonable, but I'm afraid many instrumentation libraries have allowed NaN , so this could be seen as a breaking change. (Although it's one of those things that wouldn't really matter in practice.) (2) seems to be what client_golang is currently doing, and we can easily adjust the NH spec accordingly.

@beorn7
Copy link
Member

beorn7 commented Apr 24, 2025

A somewhat wild thought to combine the benefits of (1) and (3):

In OMv2, we could specify (3) (of course still with "you SHOULD NOT observe NaN values") but also note that instrumentation libraries MAY forbid NaN observation, in particular to enable classic histograms that are exposable via OMv1 or classic Prometheus formats.

@krajorama
Copy link
Member

@fstab has suggested I also take a look at OpenTelemetry:

Observed a NaN in explicit bucket (~classic histogram):

ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope test-meter 
Metric #0
Descriptor:
     -> Name: otel_manual_histogram
     -> Description: test histogram
     -> Unit: ms
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
StartTimestamp: 2025-04-25 16:29:26.925304769 +0000 UTC
Timestamp: 2025-04-25 16:30:26.984423405 +0000 UTC
Count: 1
Sum: NaN
Min: NaN
Max: NaN
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 25.000000
ExplicitBounds #4: 50.000000
ExplicitBounds #5: 75.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 750.000000
ExplicitBounds #10: 1000.000000
ExplicitBounds #11: 2500.000000
ExplicitBounds #12: 5000.000000
ExplicitBounds #13: 7500.000000
ExplicitBounds #14: 10000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0
Buckets #11, Count: 0
Buckets #12, Count: 0
Buckets #13, Count: 0
Buckets #14, Count: 0
Buckets #15, Count: 1
	{"otelcol.component.id": "debug", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}

Observed a normal value afterwards:

ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope test-meter 
Metric #0
Descriptor:
     -> Name: otel_manual_histogram
     -> Description: test histogram
     -> Unit: ms
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
StartTimestamp: 2025-04-25 16:29:26.925304769 +0000 UTC
Timestamp: 2025-04-25 16:32:26.984406402 +0000 UTC
Count: 2
Sum: NaN
Min: NaN
Max: NaN
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 25.000000
ExplicitBounds #4: 50.000000
ExplicitBounds #5: 75.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 750.000000
ExplicitBounds #10: 1000.000000
ExplicitBounds #11: 2500.000000
ExplicitBounds #12: 5000.000000
ExplicitBounds #13: 7500.000000
ExplicitBounds #14: 10000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 1
Buckets #10, Count: 0
Buckets #11, Count: 0
Buckets #12, Count: 0
Buckets #13, Count: 0
Buckets #14, Count: 0
Buckets #15, Count: 1

@krajorama
Copy link
Member

When I did the same for exponential OTEL histogram:

Observing NaN first: nothing was printed on the collector side.
Observed a value next:

ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope test-meter 
Metric #0
Descriptor:
     -> Name: otel_manual_histogram
     -> Description: test histogram
     -> Unit: ms
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Cumulative
ExponentialHistogramDataPoints #0
StartTimestamp: 2025-04-25 16:39:01.887065846 +0000 UTC
Timestamp: 2025-04-25 16:44:01.945378508 +0000 UTC
Count: 1
Sum: 347.529324
Bucket (347.529214, 347.529444], Count: 1
	{"otelcol.component.id": "debug", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}

Other way around:
Observe value first, then NaN: got the same outcome. But I', using an old SDK version, so I'm upgrading to something newer.

@krajorama
Copy link
Member

Same results for otel sdk v1.35.0 latest ^

@krajorama
Copy link
Member

As far as I see the OTEL model doesn't specify what to do with NaN specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants