Skip to content

BigQuery DML jobs fail on streaming tables – missing native mechanism to wait for streaming buffer flush #59408

@radhwene

Description

@radhwene

Apache Airflow version

2.11.0

If "Other Airflow 2/3 version" selected, which one?

No response

What happened?

When using BigQuery operators in Apache Airflow (for example BigQueryInsertJobOperator) to run DML statements (UPDATE, DELETE, MERGE), tasks fail if the target table still contains rows in the BigQuery Streaming buffer.

This behavior is expected and documented on the BigQuery side, but Airflow currently retries the task without any awareness of the streaming buffer state, which leads to repeated failures and fragile pipelines.

This issue is especially common when operating on BigQuery tables that are continuously populated via streaming (CDC pipelines, Dataflow jobs, BigQuery Storage Write API, etc.)

What you think should happen instead?

The Sensor-based approach aligns with Airflow’s explicit and composable design philosophy
and avoids changing the behavior of existing operators.

It provides a clear and reusable mechanism for users to handle this documented BigQuery
limitation in a controlled and non-blocking way.

How to reproduce

  1. Create a BigQuery table that is continuously populated via streaming
    (for example using the BigQuery Storage Write API, Dataflow, or tabledata.insertAll).

  2. Ensure that the table contains rows in the streaming buffer
    (this can be verified via the BigQuery Tables API: the streamingBuffer field is present).

  3. Create an Airflow DAG using the Google provider with a BigQuery operator, for example:

    • BigQueryInsertJobOperator
    • executing a DML statement such as UPDATE, DELETE, or MERGE
    • targeting the streaming table
  4. Trigger the DAG while the streaming buffer is still present.

  5. Observe that the BigQuery job fails with an error similar to:
    UPDATE or DELETE statement would affect rows in the streaming buffer, which is not supported

  6. Observe that Airflow retries the task without checking the streaming buffer state, causing repeated failures
    until the buffer is eventually flushed.


Expected result

Airflow should provide a mechanism to optionally detect the presence of the BigQuery streaming buffer
and wait (or defer/reschedule) until it is flushed before executing the DML job.

Operating System

Linuex

Versions of Apache Airflow Providers

Apache Airflow 2.11.x
GCP cloud composer 2

Deployment

Google Cloud Composer

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions