Skip to content

async option will be dropped. #1522

@st0012

Description

@st0012
Collaborator

The async option is unique to the Ruby SDK. It was designed to help send events asynchronously through different backends (e.g. a Ruby thread, Sidekiq worker...etc.). Depends on the backend, it can pose thread to the system due to its additional extra memory consumption. So it's an option with some trade-offs.

But since version 4.1, the SDK now has its own background worker managed (implemented with the famous concurrent-ruby library). It can handle most of the async

The async Option Approach

  1. The SDK serializes the event and event hint into json-compatible Ruby hashes.
  2. It passes the event payload and hint to the block.
  3. In general, the block would enqueue a background job with the above data.
    • Some earlier apps use a new Ruby thread to send the data. This is unrecommended.
    • With background job libraries like Sidekiq or Resque, this means adding objects into Redis.
    • With delayed_job, this means adding a new delayed_job record.
  4. A background worker (e.g. Sidekiq worker) then picks the event and hint and send it.

Pros

Users can customize their event sending logic. But generally it's just a worker with Sentry.send_event(event, hint).

Cons

  • The event payload (usually dozens of kbs) could be copied twice: first copied to the medium storage and then allocates the background worker process.
  • When there is an event spike, it can flood the medium storage (Redis) and take down the entire system.

The Background Worker

  1. The SDK passes the event and its hint to the background worker (a pool of threads managed by concurrent-ruby).
  2. A worker then picks the event, serializes it, and sents it.

Pros

  • It doesn't allocate extra memory other than the original event payload.
  • It's faster.
  • It doesn't require any user code.
  • The background worker doesn't queue more than 30 events. So even when there's a spike, it's unlikely to consume all the memory.

Cons

  • Unsent events will die with the process. Generally speaking, the queue time in background worker is very low. And the chance of missing events due to this reason is small in web apps. But for script programs, the process often leaves before the worker is able to send the event. This is why hint: { background: false } is required in rake integrations.
    • However, I don't think this problem can be solved with the async option.

This drawback has been addressed in #1617.

Missing Events During A Spike Because of Queue Limit

I know many users have concern about the background worker's 30 events queue limit will make them lose events during a spike. But as the maintainer and a user of this SDK, I don't worry about it because:

  1. The spike is likely to be an urgent case, and that'll probably be fixed in a short time. So not seeing a few instances of other errors should not affect the overall coverage.
  2. Given these characteristics of the SDK's background worker:
    • The default number of background workers are determined by the number of process cores on your machine.
    • They're a lot faster than the using the async approach with a sidekiq/resque...etc. worker due to the reason I described in the issue.
    • A 30-event queue is only shared within the process/web instance, depends on the concurrency model you have. Not at a global level.
      If there's a spike big enough to overflow the SDK's queue and drop some events, it'll probably overflow your background job queue with the async option too and/or pose a greater damage to your system.
  3. Sentry has a rate-limiting mechanism to prevent overflow on the platform side, which works by both rejecting new events and telling the SDK not to send new events with a 429 response. When the SDK receives a 429 response from Sentry during a spike, it'll stop sending "all events" for a given period of time.

What I'm trying to say is, it's not possible to expect Sentry to accept "all events" during a big spike regardless which approach you use. But when a spike happens, async is more likely to become another bottleneck and/or cause other problems in your system.

My Opinion

The async option seems redundant now and it could sometimes cause more harm. So I think we should drop it in version 5.0.

Questions

The above analysis is only based on my personal usage of the SDK and a few cases I helped debug with. So if you're willing to share your experience, I'd like to know

Even though the decision has been made, we still would like to hear feedback about it:

  • Do you use the async option in your apps?
    • If you do, what's the motivation? Will you still use it after reading the above description?
    • If you don't, is it an intentional decision? If it is, what's the reason behind it?
  • Do you disable the background workers with the background_worker_threads config option?
    • If you do, why?
  • Or any feedback related to this topic.

Activity

pinned this issue on Aug 1, 2021
added this to the 5.0.0 milestone on Aug 1, 2021
josh-m-sharpe

josh-m-sharpe commented on Aug 3, 2021

@josh-m-sharpe

Long time user of sentry-ruby and sentry-raven. I didn't know this feature existed until a current migration of a large rails app to sentry. I noticed it while reading through the docs and attempted to set it up but immediately ran into issues and elected to defer in order to simplify the migration. In addition, we have a large amount of complex queues. Injecting this option into our system would likely require a bit of thought so as not to bowl over important queues while keeping error reporting timely.

I suppose this is a vote to drop it?

louim

louim commented on Aug 20, 2021

@louim
Contributor

Hey! We currently use the async option. Mostly because it was recommended in the docs when we setup the app a long time ago (not sure the background worker option was even a thing at that time 👴🏼 ).

I'm curious about that part:

The background worker doesn't queue more than 30 events. So even when there's a spike, it's unlikely to consume all the memory.

What would happen if there is a spike in events, let's say from a noisy error filling the queue. Would another error happening at the same time be silently dropped because the queue is full? Or am I misunderstanding how it works?

I'd like to switch away from async because json serialization from the async version mean we have to check two version of the payload when doing custom processing in before_send, ex:

data = event.to_hash
errors = data.dig(:exception, :values) if data[:exception]
errors = data.dig("exception", "values") if data["exception"]
st0012

st0012 commented on Aug 21, 2021

@st0012
CollaboratorAuthor

not sure the background worker option was even a thing at that time 👴🏼

Background worker was added since v4.1.0. So it's a new thing for most users I think 🙂

What would happen if there is a spike in events, let's say from a noisy error filling the queue. Would another error happening at the same time be silently dropped because the queue is full? Or am I misunderstanding how it works?

As of now, the queue doesn't distinguish events. So if there's a spike of a particular error, other errors may not make it into the queue. But personally I'm not worry about this because:

  1. The spike is likely to be an urgent case, and that'll probably be fixed in a short time. So not seeing a few instances of other errors should not affect the overall coverage.
  2. Given these characteristics of the SDK's background worker:
    • The default number of background workers are determined by the number of process cores on your machine.
    • They're a lot faster than the using the async approach with a sidekiq/resque...etc. worker due to the reason I described in the issue.
    • A 30-event queue is only shared within the process/web instance, depends on the concurrency model you have. Not at a global level.

If there's a spike big enough to overflow the SDK's queue and drop some events, it'll probably overflow your background job queue with the async option too and/or pose a greater damage to your system.

kzaitsev

kzaitsev commented on Sep 27, 2021

@kzaitsev
Contributor

At my job, we are working fine with the async because we already have the sidekiq in our stack. I can't understand why async can't be an optional way to deliver events without async deprecation? As a solution, you can highlight possible issues with async in the documentation.

Maybe I can't understand the problem because using sentry only for exceptions without APM.

st0012

st0012 commented on Sep 27, 2021

@st0012
CollaboratorAuthor

@Bugagazavr

There are 2 main cost of having this option around:

  1. Among all Sentry SDKs, sentry-ruby is the only one that supports such option. This means we always need to consider this extra condition when making changes to the event sending logic. And it'll make future SDK alignment harder. (It surely made sentry-raven -> sentry-ruby conversion harder)
  2. We need to spend additional effort maintaining the code for this option, and sometimes the result isn't pretty:

def dispatch_async_event(async_block, event, hint)
# We have to convert to a JSON-like hash, because background job
# processors (esp ActiveJob) may not like weird types in the event hash
event_hash = event.to_json_compatible
if async_block.arity == 2
hint = JSON.parse(JSON.generate(hint))
async_block.call(event_hash, hint)
else
async_block.call(event_hash)
end
rescue => e
loggable_event_type = event_hash["type"] || "event"
log_error("Async #{loggable_event_type} sending failed", e, debug: configuration.debug)
send_event(event, hint)
end

if defined?(ActiveJob)
module Sentry
parent_job =
if defined?(::ApplicationJob) && ::ApplicationJob.ancestors.include?(::ActiveJob::Base)
::ApplicationJob
else
::ActiveJob::Base
end
class SendEventJob < parent_job
# the event argument is usually large and creates noise
self.log_arguments = false if respond_to?(:log_arguments=)
# this will prevent infinite loop when there's an issue deserializing SentryJob
if respond_to?(:discard_on)
discard_on ActiveJob::DeserializationError
else
# mimic what discard_on does for Rails 5.0
rescue_from ActiveJob::DeserializationError do
logger.error "Discarded #{self.class} due to a #{exception}. The original exception was #{error.cause.inspect}."
end
end
def perform(event, hint = {})
Sentry.send_event(event, hint)
end
end
end
else
module Sentry
class SendEventJob; end
end
end

If the upside is high, these cost wouldn't be an issue. That's why we have had it for many years. But since we already have a better solution for the problem (background worker) that has much less downside, I don't think it's worth it now.

github-actions

github-actions commented on Oct 28, 2021

@github-actions

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

29 remaining items

st0012

st0012 commented on May 28, 2022

@st0012
CollaboratorAuthor

@ariccio You can simply delete it 😉

vadviktor

vadviktor commented on Aug 2, 2022

@vadviktor

We at my company have been plagued by the async feature for 1-2 years now; hitting the limit of the payload and the rate limit made us scratch our heads on how to safeguard against those (before Sentry internally remedied them). How we've caught these issues? By seeing Sidekiq being brought to its knees and important messages not processed.

Right now, we have decided to end the async reign, and I was glad to see this feature being deprecated since our last version update. 🙌

added a commit that references this issue on Aug 8, 2022
added a commit that references this issue on Aug 2, 2023
sl0thentr0py

sl0thentr0py commented on Dec 10, 2024

@sl0thentr0py
Member

async has been deprecated for long enough and is removed in the next 6.0 major branch so we can close this placeholder issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

    Development

    Participants

    @trevorturk@jordan-brough@josh-m-sharpe@hogelog@vadviktor

    Issue actions

      `async` option will be dropped. · Issue #1522 · getsentry/sentry-ruby