-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Histogram should ignore NaN observations #1275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The current implementation is deliberate. It's mathematically the right thing to do. It has also been propagated throughout the stack. For example, we take into account in PromQL that a mismatch between the count and the number of observations in buckets is due to NaN observations. OpenMetrics has very little weight in terms of native histograms. It doesn't support native histograms at all so far, and it is inherently inconsistent even when it comes to classic histograms. (I have voiced my concerns about this long ago, but as usual, they got ignored.) Having said all that, I do think NaN observations don't really make sense. They SHOULD be avoided. Therefore, in practice, what we specify as the precise behavior doesn't really make a huge difference. However, I believe we should just treat that edge case in a mathematical meaningful way rather than forbidding it and thereby create more failure cases, that had to be dealt with in a disruptive way. (For example, if we specify that the sum MUST NOT be NaN, we have to fail all scrapes entirely that contain a single sum that is NaN.) |
Thanks Björn! Closing the issue because this answers my question :). |
I've done a test today myself and client golang does add NaN to Sum, but actually the exposed Count and +Inf is correct. I observed, NaN and 2.0.:
client_java 1.3.6 ignores the NaN in counts and sum as well.
Would be nice to agree on something for Openmetrics 2.0. |
Maybe unsurprisingly, I still agree with myself. The NH behavior makes sense, and OMv2 should be specified accordingly. It would be good to clarify the behavior of classic histograms. Currently, the +Inf bucket is identical to the count by definition. This creates an inconsistency with native histograms where the count can be greater than the sum of all buckets (because count counts NaN observations, which are in no bucket). Options:
While (3) would be best if we designed things from scratch, it would break an assumption that has been baked into a lot of code. So I would not recommend it. (1) seems very reasonable, but I'm afraid many instrumentation libraries have allowed NaN , so this could be seen as a breaking change. (Although it's one of those things that wouldn't really matter in practice.) (2) seems to be what client_golang is currently doing, and we can easily adjust the NH spec accordingly. |
A somewhat wild thought to combine the benefits of (1) and (3): In OMv2, we could specify (3) (of course still with "you SHOULD NOT observe NaN values") but also note that instrumentation libraries MAY forbid NaN observation, in particular to enable classic histograms that are exposable via OMv1 or classic Prometheus formats. |
@fstab has suggested I also take a look at OpenTelemetry: Observed a NaN in explicit bucket (~classic histogram):
Observed a normal value afterwards:
|
When I did the same for exponential OTEL histogram: Observing NaN first: nothing was printed on the collector side.
Other way around: |
Same results for otel sdk v1.35.0 latest ^ |
As far as I see the OTEL model doesn't specify what to do with NaN specifically. |
Uh oh!
There was an error while loading. Please reload this page.
When a
NaN
value is observed, the current implementation of histogram increasescount
and setssum
toNaN
.https://github.com/prometheus/client_golang/blob/main/prometheus/histogram_test.go#L541-L544
This violates two parts of the OpenMetrics spec:
This means
NaN
values must be ignored according to OpenMetrics.I think ignoring
NaN
observations is a good idea, because then the count can be calculated from the bucket values. I'm doing this inclient_java
's new data model, and I like it because the derived count can never become inconsistent.Anyway, it would be good if all Prometheus client libraries behaved the same, so what do you think of ignoring
NaN
observations?The text was updated successfully, but these errors were encountered: