-
Notifications
You must be signed in to change notification settings - Fork 16.1k
Description
Apache Airflow version
2.11.0
If "Other Airflow 2/3 version" selected, which one?
No response
What happened?
When using BigQuery operators in Apache Airflow (for example BigQueryInsertJobOperator) to run DML statements (UPDATE, DELETE, MERGE), tasks fail if the target table still contains rows in the BigQuery Streaming buffer.
This behavior is expected and documented on the BigQuery side, but Airflow currently retries the task without any awareness of the streaming buffer state, which leads to repeated failures and fragile pipelines.
This issue is especially common when operating on BigQuery tables that are continuously populated via streaming (CDC pipelines, Dataflow jobs, BigQuery Storage Write API, etc.)
What you think should happen instead?
The Sensor-based approach aligns with Airflow’s explicit and composable design philosophy
and avoids changing the behavior of existing operators.
It provides a clear and reusable mechanism for users to handle this documented BigQuery
limitation in a controlled and non-blocking way.
How to reproduce
-
Create a BigQuery table that is continuously populated via streaming
(for example using the BigQuery Storage Write API, Dataflow, ortabledata.insertAll). -
Ensure that the table contains rows in the streaming buffer
(this can be verified via the BigQuery Tables API: thestreamingBufferfield is present). -
Create an Airflow DAG using the Google provider with a BigQuery operator, for example:
BigQueryInsertJobOperator- executing a DML statement such as
UPDATE,DELETE, orMERGE - targeting the streaming table
-
Trigger the DAG while the streaming buffer is still present.
-
Observe that the BigQuery job fails with an error similar to:
UPDATE or DELETE statement would affect rows in the streaming buffer, which is not supported -
Observe that Airflow retries the task without checking the streaming buffer state, causing repeated failures
until the buffer is eventually flushed.
Expected result
Airflow should provide a mechanism to optionally detect the presence of the BigQuery streaming buffer
and wait (or defer/reschedule) until it is flushed before executing the DML job.
Operating System
Linuex
Versions of Apache Airflow Providers
Apache Airflow 2.11.x
GCP cloud composer 2
Deployment
Google Cloud Composer
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct