De-duplication of BigQuery in an asynchronous flow ETL pipeline

Question

De-duplication of BigQuery in an asynchronous flow ETL pipeline

Our Data Warehouse team evaluates the BigQuery solution as a data warehouse column store and has some questions regarding its capabilities and best use. Our existing etl pipeline consumes events asynchronously through a queue and stores idempotent events into our existing database technology. Idempotent architecture allows us to sometimes replay several hours or days of events to fix errors and data failures without the risk of duplication.

When testing BigQuery, we experimented with real-time streaming insert api with unique key as insertId. This gives us the opportunity to improve functionality within a short window, but repeated data streams at later times lead to duplication. As a result, we need an elegant option to remove duplicates in / near real time to avoid data inconsistencies.

We had several questions and they would be grateful for any of them. Any additional guidance on using BigQuery in ETL architecture is also welcome.

Is there a common implementation for de-duplicating realtime streaming outside of using tableId?
If we try to use delsert (via delete followed by insertion using the BigQuery API), will the delete always precede the insert, or do the operations run asynchronously?
Is it possible to implement live streaming to the staging environment followed by a scheduled merge to the Table destination? This is a generic solution for another etl technology column store, but we haven't seen any documentation suggesting its use in BigQuery.

+3

google-bigquery

Stewart spencer 27 Mar 17 at 19:52

source to share

1 answer

Pentium10 · Accepted Answer · 2017-03-27T20:03:42+0000

We allow for duplication and write our logic and queries so that each object is streamed data. For example: user profile is streaming data, so there are many lines placed in time, and when we need to fetch the latest data, we use the very last line.

Delsert doesn't fit in my opinion as you are limited to 96 DML statements per day per table . So this means that you need to store temp files in table packs to later issue a single DML statement that deals with a batch of rows and update the current table from the temp table.

If you're considering delsert, it might be easier to consider writing a read-only query for the last line.

A stream followed by a planned merge is possible. Actually you can overwrite some data in one table, for example: deleting duplicates. Or scheduled batch query content from temp table and write to table. It's kind of the same thing that allows you to duplicate what is happening and then process a query in it, also called re-materialization, if you write to the same table.

De-duplication of BigQuery in an asynchronous flow ETL pipeline

More articles: