Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
/tempo-vulture
/tempodb/encoding/benchmark_block
private-key.key
integration/e2e/**/e2e_integration_test[0-9]*
integration/**/e2e_integration_test[0-9]*
.tempo.yaml
/tmp
gh-token.txt
Expand Down
15 changes: 2 additions & 13 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,9 @@
## main / unreleased

* [CHANGE] Allow duplicate dimensions for span metrics and service graphs. This is a valid use case if using different instrumentation libraries, with spans having "deployment.environment" and others "deployment_environment", for example. [#6288](https://github.com/grafana/tempo/pull/6288) (@carles-grafana)
* [CHANGE] Updade default max duration for traceql metrics queries up to one day [#6285](https://github.com/grafana/tempo/pull/6285) (@javiermolinar)
* [CHANGE] Set traceQL query metrics checks by default in Vulture [#6275](https://github.com/grafana/tempo/pull/6275) (@javiermolinar)
* [FEATURE] Add span_multiplier_key to overrides. This allows tenants to specify the attribute key used for span multiplier values to compensate for head-based sampling. [#6260](https://github.com/grafana/tempo/pull/6260) (@carles-grafana)
* [BUGFIX] Correct avg_over_time calculation [#6252](https://github.com/grafana/tempo/pull/6252) (@ruslan-mikhailov)
* [BUGFIX] Correct instant query calculation for rate() [#6205](https://github.com/grafana/tempo/pull/6205) (@ruslan-mikhailov)
* [ENHANCEMENT] Improved live store readiness check and added `readiness_target_lag` and `readiness_max_wait` config parameters. Live store will now - if `readiness_target_lag` is set - not report `/ready` until Kafka lag is brought under the specified value [#6238](https://github.com/grafana/tempo/pull/6238) (@oleg-kozlyuk-grafana)

### 3.0 Cleanup

* [CHANGE] **BREAKING CHANGE** Removed `v2` block encoding and compactor component. [#6273](https://github.com/grafana/tempo/pull/6273) (@joe-elliott)
This includes the removal of the following CLI commands which were `v2` specific: `list block`, `list index`, `view index`, `gen index`, `gen bloom`.
* [CHANGE] **BREAKING CHANGE** Sets the `all` target to be 3.0 compatible and removes the `scalable-single-binary` target [#6283](https://github.com/grafana/tempo/pull/6283) (@joe-elliott)
* [CHANGE] Expose otlp http and grpc ports for Docker examples [#6296](https://github.com/grafana/tempo/pull/6296) (@javiermolinar)

# v2.10.0
# v2.10.0-rc.0

* [CHANGE] **BREAKING CHANGE** Validate tenant ID in frontend and distributor [#5786](https://github.com/grafana/tempo/pull/5786) (@carles-grafana)
* [CHANGE] Remove vParquet2 encoding [#6071](https://github.com/grafana/tempo/pull/6071) (@mdisibio)
Expand Down
9 changes: 9 additions & 0 deletions cmd/tempo/app/app.go
Original file line number Diff line number Diff line change
Expand Up @@ -374,6 +374,15 @@
}
}

// LiveStore has a special check that makes sure it has caught up with Kafka
// before serving queries.
if t.liveStore != nil {
if err := t.liveStore.CheckReady(r.Context()); err != nil {
http.Error(w, "LiveStore not ready: "+err.Error(), http.StatusServiceUnavailable)
return
}

Check notice on line 383 in cmd/tempo/app/app.go

View workflow job for this annotation

GitHub Actions / Coverage Annotations

Uncovered lines

Lines 380-383 are not covered by tests
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotations don't catch e2e tests yet

}

http.Error(w, "ready", http.StatusOK)
}
}
Expand Down
2 changes: 2 additions & 0 deletions docs/sources/tempo/configuration/manifest.md
Original file line number Diff line number Diff line change
Expand Up @@ -1416,4 +1416,6 @@ live_store:
name: http.status_code
type: int
options: []
readiness_target_lag: 0s
readiness_max_wait: 30m0s
```
4 changes: 4 additions & 0 deletions example/docker-compose/distributed/tempo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,7 @@ block_builder:

usage_report:
reporting_enabled: false

live_store:
readiness_target_lag: 5s
readiness_max_wait: 5m0s
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
live_store:
readiness_target_lag: 100ms
readiness_max_wait: 10s
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
live_store:
readiness_target_lag: 100ms
readiness_max_wait: 60s
225 changes: 225 additions & 0 deletions integration/operations/livestore_readiness_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
package deployments

import (
"net/http"
"testing"
"time"

"github.com/grafana/e2e"
"github.com/grafana/tempo/integration/util"
tempoUtil "github.com/grafana/tempo/pkg/util"
"github.com/stretchr/testify/require"
)

// TestLiveStoreReadinessDefaultBehavior verifies that with readiness_target_lag=0 (default),
// the LiveStore becomes ready immediately without waiting
func TestLiveStoreReadinessDefaultBehavior(t *testing.T) {
util.RunIntegrationTests(t, util.TestHarnessConfig{
Components: util.ComponentsRecentDataQuerying,
}, func(h *util.TempoHarness) {
liveStoreA := h.Services[util.ServiceLiveStoreZoneA]

// With default config (readiness_target_lag=0), LiveStore should be ready immediately
require.NoError(t, liveStoreA.WaitReady())

// Verify /ready endpoint returns 200
req, err := http.NewRequest("GET", "http://"+liveStoreA.Endpoint(3200)+"/ready", nil)
require.NoError(t, err)
httpResp, err := http.DefaultClient.Do(req)
require.NoError(t, err)
require.Equal(t, 200, httpResp.StatusCode)

// Verify tempo_live_store_ready metric is 1
require.NoError(t, liveStoreA.WaitSumMetrics(e2e.Equals(1), "tempo_live_store_ready"))
})
}

// TestLiveStoreReadinessWithCatchUp verifies that readiness waiting works correctly
// and metrics are recorded
func TestLiveStoreReadinessWithCatchUp(t *testing.T) {
util.RunIntegrationTests(t, util.TestHarnessConfig{
ConfigOverlay: "config-livestore-readiness-enabled.yaml",
Components: util.ComponentsRecentDataQuerying,
}, func(h *util.TempoHarness) {
liveStoreA := h.Services[util.ServiceLiveStoreZoneA]
liveStoreB := h.Services[util.ServiceLiveStoreZoneB]

// Stop liveStoreB to simplify the test
require.NoError(t, liveStoreB.Stop())

// Wait for LiveStore to be ready
h.WaitTracesWritable(t)

// Write some traces to create Kafka lag
for i := 0; i < 5; i++ {
info := tempoUtil.NewTraceInfo(time.Now(), "")
require.NoError(t, h.WriteTraceInfo(info, ""))
}

// Wait for traces to be processed
require.NoError(t, liveStoreA.WaitSumMetrics(e2e.GreaterOrEqual(5), "tempo_live_store_traces_created_total"))

// Stop the LiveStore
require.NoError(t, liveStoreA.Stop())

// Write more traces during downtime to create lag
for i := 0; i < 3; i++ {
require.NoError(t, h.WriteTraceInfo(tempoUtil.NewTraceInfo(time.Now(), ""), ""))
time.Sleep(100 * time.Millisecond)
}

// Restart LiveStore
require.NoError(t, liveStoreA.Start(h.TestScenario.NetworkName(), h.TestScenario.SharedDir()))

// Wait for it to become ready (it should catch up)
require.NoError(t, liveStoreA.WaitReady())

// Verify /ready endpoint returns 200
req, err := http.NewRequest("GET", "http://"+liveStoreA.Endpoint(3200)+"/ready", nil)
require.NoError(t, err)
httpResp, err := http.DefaultClient.Do(req)
require.NoError(t, err)
require.Equal(t, 200, httpResp.StatusCode)

// Verify tempo_live_store_ready metric is 1
require.NoError(t, liveStoreA.WaitSumMetrics(e2e.Equals(1), "tempo_live_store_ready"))

// Verify catch_up_duration metric was recorded
// The metric should have at least one observation
metrics, err := liveStoreA.SumMetrics([]string{"tempo_live_store_catch_up_duration_seconds"})
require.NoError(t, err)
require.Greater(t, metrics[0], 0.0, "catch_up_duration should have been recorded")
})
}

// TestLiveStoreReadinessMaxWaitTimeout verifies that LiveStore becomes ready
// after readiness_max_wait even if lag is still high
func TestLiveStoreReadinessMaxWaitTimeout(t *testing.T) {
util.RunIntegrationTests(t, util.TestHarnessConfig{
ConfigOverlay: "config-livestore-readiness-timeout.yaml",
Components: util.ComponentsRecentDataQuerying,
}, func(h *util.TempoHarness) {
liveStoreA := h.Services[util.ServiceLiveStoreZoneA]
liveStoreB := h.Services[util.ServiceLiveStoreZoneB]

// Stop liveStoreB to simplify the test
require.NoError(t, liveStoreB.Stop())

// Wait for LiveStore to be ready initially
h.WaitTracesWritable(t)

// Write some traces
for i := 0; i < 3; i++ {
info := tempoUtil.NewTraceInfo(time.Now(), "")
require.NoError(t, h.WriteTraceInfo(info, ""))
}

// Wait for traces to be processed
require.NoError(t, liveStoreA.WaitSumMetrics(e2e.GreaterOrEqual(3), "tempo_live_store_traces_created_total"))

// Stop the LiveStore
require.NoError(t, liveStoreA.Stop())

// Write many traces during downtime to create significant lag
// With readiness_target_lag=100ms and readiness_max_wait=5s,
// the LiveStore should become ready after 5s even if lag is high
for i := 0; i < 50; i++ {
require.NoError(t, h.WriteTraceInfo(tempoUtil.NewTraceInfo(time.Now(), ""), ""))
time.Sleep(200 * time.Millisecond) // Create lag that exceeds target
}

// Restart LiveStore
startTime := time.Now()
require.NoError(t, liveStoreA.Start(h.TestScenario.NetworkName(), h.TestScenario.SharedDir()))

// It should become ready due to max_wait timeout (5s)
require.NoError(t, liveStoreA.WaitReady())
elapsed := time.Since(startTime)

// Should have waited close to max_wait (5s), but not too long
require.Less(t, elapsed, 15*time.Second, "should become ready within reasonable time")

// Verify /ready endpoint returns 200
req, err := http.NewRequest("GET", "http://"+liveStoreA.Endpoint(3200)+"/ready", nil)
require.NoError(t, err)
httpResp, err := http.DefaultClient.Do(req)
require.NoError(t, err)
require.Equal(t, 200, httpResp.StatusCode)

// Verify tempo_live_store_ready metric is 1
require.NoError(t, liveStoreA.WaitSumMetrics(e2e.Equals(1), "tempo_live_store_ready"))
})
}

// TestLiveStoreReadinessRestartWithLag verifies restart scenario with accumulated Kafka lag
func TestLiveStoreReadinessRestartWithLag(t *testing.T) {
util.RunIntegrationTests(t, util.TestHarnessConfig{
ConfigOverlay: "config-livestore-readiness-enabled.yaml",
Components: util.ComponentsRecentDataQuerying,
}, func(h *util.TempoHarness) {
liveStoreA := h.Services[util.ServiceLiveStoreZoneA]
liveStoreB := h.Services[util.ServiceLiveStoreZoneB]

// Stop liveStoreB to simplify the test
require.NoError(t, liveStoreB.Stop())

// Wait for initial readiness
h.WaitTracesWritable(t)

// Write initial traces
for i := 0; i < 3; i++ {
info := tempoUtil.NewTraceInfo(time.Now(), "")
require.NoError(t, h.WriteTraceInfo(info, ""))
}

// Wait for traces to be processed
require.NoError(t, liveStoreA.WaitSumMetrics(e2e.GreaterOrEqual(3), "tempo_live_store_traces_created_total"))

// Verify ready state before restart
require.NoError(t, liveStoreA.WaitSumMetrics(e2e.Equals(1), "tempo_live_store_ready"))

// Stop LiveStore
require.NoError(t, liveStoreA.Stop())

// Write traces during downtime to accumulate lag
for i := 0; i < 10; i++ {
require.NoError(t, h.WriteTraceInfo(tempoUtil.NewTraceInfo(time.Now(), ""), ""))
time.Sleep(100 * time.Millisecond)
}

// Restart LiveStore
require.NoError(t, liveStoreA.Start(h.TestScenario.NetworkName(), h.TestScenario.SharedDir()))

// Initially, LiveStore should not be ready (503) while catching up
// Note: This check is timing-sensitive and might pass if catch-up is very fast
req, err := http.NewRequest("GET", "http://"+liveStoreA.Endpoint(3200)+"/ready", nil)
require.NoError(t, err)
httpResp, err := http.DefaultClient.Do(req)
require.NoError(t, err)
// During catch-up, we might see 503
if httpResp.StatusCode == 503 {
t.Log("LiveStore correctly returns 503 during catch-up")
}

// Wait for it to become ready after catching up
require.NoError(t, liveStoreA.WaitReady())

// Verify /ready endpoint returns 200
req, err = http.NewRequest("GET", "http://"+liveStoreA.Endpoint(3200)+"/ready", nil)
require.NoError(t, err)
httpResp, err = http.DefaultClient.Do(req)
require.NoError(t, err)
require.Equal(t, 200, httpResp.StatusCode)

// Verify tempo_live_store_ready metric is 1
require.NoError(t, liveStoreA.WaitSumMetrics(e2e.Equals(1), "tempo_live_store_ready"))

// Verify some traces have been processed
require.NoError(t, liveStoreA.WaitSumMetrics(e2e.GreaterOrEqual(1), "tempo_live_store_traces_created_total"))

// Verify catch_up_duration metric was recorded
metrics, err := liveStoreA.SumMetrics([]string{"tempo_live_store_catch_up_duration_seconds"})
require.NoError(t, err)
require.Greater(t, metrics[0], 0.0, "catch_up_duration should have been recorded")
})
}
16 changes: 16 additions & 0 deletions modules/livestore/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,16 @@ type Config struct {
// Block configuration
BlockConfig common.BlockConfig `yaml:"block_config"`

// ReadinessTargetLag is the target consumer lag threshold before the live-store
// is considered ready to serve queries. The live-store will wait until lag drops
// below this value. Set to 0 to disable readiness waiting (default, backward compatible).
ReadinessTargetLag time.Duration `yaml:"readiness_target_lag"`

// ReadinessMaxWait is the maximum time to wait for catching up at startup.
// If this timeout is exceeded, the live-store becomes ready anyway.
// Only used if ReadinessTargetLag > 0. Default: 30m.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this creates a read outage if both zones have lag?

Copy link
Copy Markdown
Contributor Author

@oleg-kozlyuk-grafana oleg-kozlyuk-grafana Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not. The wait is bounded by ReadinessMaxWait, so if warpstream is very slow, live store will simply fall back to old behavior after a while.

EDIT: also, this behavior is only triggered at start - before live-store is marked ready, i.e. any behavior after startup sequence is unchanged

ReadinessMaxWait time.Duration `yaml:"readiness_max_wait"`

// testing config
holdAllBackgroundProcesses bool `yaml:"-"` // if this is set to true, the live store will never release its background processes
}
Expand Down Expand Up @@ -83,13 +93,19 @@ func (cfg *Config) RegisterFlagsAndApplyDefaults(prefix string, f *flag.FlagSet)

cfg.CommitInterval = 5 * time.Second

// Readiness config - default to disabled (backward compatible)
cfg.ReadinessTargetLag = 0
cfg.ReadinessMaxWait = 30 * time.Minute

// Initialize block config with defaults
cfg.BlockConfig.RegisterFlagsAndApplyDefaults(prefix+".block", f)

// Register flags for existing fields
f.DurationVar(&cfg.CompleteBlockTimeout, prefix+".complete-block-timeout", cfg.CompleteBlockTimeout, "Duration to keep blocks in the live store after they have been flushed.")
f.UintVar(&cfg.QueryBlockConcurrency, prefix+".concurrent-blocks", cfg.QueryBlockConcurrency, "Number of concurrent blocks to query for metrics.")
f.Float64Var(&cfg.Metrics.TimeOverlapCutoff, prefix+".metrics.time-overlap-cutoff", cfg.Metrics.TimeOverlapCutoff, "Time overlap cutoff ratio for metrics queries (0.0-1.0).")
f.DurationVar(&cfg.ReadinessTargetLag, prefix+".readiness-target-lag", cfg.ReadinessTargetLag, "Target lag threshold before live-store is ready. 0 disables waiting (backward compatible).")
f.DurationVar(&cfg.ReadinessMaxWait, prefix+".readiness-max-wait", cfg.ReadinessMaxWait, "Maximum time to wait for catching up at startup. Only used if readiness-target-lag > 0.")

cfg.WAL.RegisterFlags(f) // WAL config has no flags, only defaults
cfg.WAL.Version = encoding.DefaultEncoding().Version()
Expand Down
Loading
Loading