Skip to content

Several fixes and improvements to block flushing, wal replay, file deletion, and logging#593

Merged
mdisibio merged 6 commits intografana:masterfrom
mdisibio:file-already-closed-error
Mar 22, 2021
Merged

Several fixes and improvements to block flushing, wal replay, file deletion, and logging#593
mdisibio merged 6 commits intografana:masterfrom
mdisibio:file-already-closed-error

Conversation

@mdisibio
Copy link
Copy Markdown
Contributor

@mdisibio mdisibio commented Mar 15, 2021

What this PR does:
This PR improves some stability, error handling, and logging for various things:

Main changes:

  • Introduce exponential backoff for flushing blocks to backend (s3/gcs/etc), with a max backoff of 120s. Flushes are attempted forever
  • Introduce exponential backoff for converting wal->complete blocks. We have seen this error due to a corrupted wal file when the disk was full. Backoff is from 30s->60s with 3 attempts, yielding around 120 seconds of retry time, after which the wal file is permanently deleted and data is (probably) lost.
  • Fix regression where the WAL file was deleted as soon as the complete block was created, instead of only after a successful flush to the backend. This was broken unintentionally in the previous round of flush changes.
  • Introduce jitter up to 10s for regular flushes which will help when flushing many tenants. Flushes initiated by /flush or /shutdown do not have jitter.

Minor changes:

  • Fix "file already closed" errors by nil'ing out file handles as soon as they are closed
  • Fix "enqueue on closed queue" panic by forking priority queues from cortex and changing method to return error instead of panic. Additionally to cut down on noisy errors, a guard flag is kept on the state of the flush queues, when shutting down the flusher abandons the tasks (no data is lost because wal still exists)
  • Completed blocks now use the same ID as the source WAL block, which should make troubleshooting easier (i.e. file in /wal/completed/ matches its source file in /wal/)
  • WAL replay doesn't create the ingester instance for that tenant until data is encountered, which lets 0-length files be replayed and deleted like expected (instead of creating a new instance+head block thus perpetuating the problem)
  • Increment ingester_failed_flushes_total metric if error occurs deleting WAL block
  • Various log message changes

Which issue(s) this PR fixes:
Fixes #357 #579 #598

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Copy Markdown
Collaborator

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements. Thank you for catching that bug where we leave wal files hanging around for empty tenants. That has been bugging me for awhile.

Some comments/thoughts but overall looks good.

Comment thread modules/ingester/flush.go Outdated
Comment thread modules/ingester/flush.go Outdated
Comment thread modules/ingester/flush.go Outdated
Comment thread modules/ingester/flush.go
Comment thread modules/ingester/instance.go Outdated
Comment thread modules/ingester/flush.go
Comment thread modules/ingester/flush.go
Copy link
Copy Markdown
Collaborator

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor concern, but you are welcome to merge as is.

@mdisibio mdisibio requested a review from joe-elliott March 19, 2021 20:20
@mdisibio mdisibio merged commit 8e79e79 into grafana:master Mar 22, 2021
@mdisibio mdisibio deleted the file-already-closed-error branch May 27, 2021 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingester Panic Under Pressure while being Shutdown Tempo leaves 0 length wal files forever Research flush backoff behavior

3 participants