Cloud Dataflow: creating tables in BigQuery

I have a pipeline that reads streaming data from Cloud Pub / Sub, this data is processed by Dataflow and then stored in one big BigQuery table, each Pub / Sub post contains an account_id associated with it. Is there a way to create new tables on the fly when a new account_id is identified? And then populate them with data from this linked account?

I know it can be done by updating the pipeline for every new account. But in an ideal world, Cloud Dataflow would generate these tables within code programmatically.

+3


source to share


2 answers


wanted to share several options i see

Option1 - wait Partition on non-date field

feature
He doesn't know when this will be implemented and available to us, so this might not be what you want right now. But when it goes live it will be the best option for such scenarios.



Option 2 - you can come up with hashing your account into a predefined number of buckets. In this case, you can pre-create all of these tables, and your code has logic that will process the corresponding destination table based on the account hash. The same hashing logic that should be used in requests that will request this data

+1


source


The API for creating BigQuery tables is at https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert .



However, it would be easier if you kept all accounts in one static table that contains the account_id as a single column.

0


source







All Articles