Google dataflow to write to bigquery table

I compared the performance of processing data and outputting them to Bigquery tables and files, the difference is significant:

input: 1.5M records from about 600 files transform: build / transform multiple fields in each record, build a key and emit a key, value pairs; eventually, the records for each key go to one target, file or table;

it took 7 minutes to write 13 files and more than 60 minutes to write to 13 bigquery tables;

Try to understand if this is the expected result or did I not do it right? what factors should be considered when writing to bigquery table?

Please help, this might be an indicative stopper for what I am trying to do.


source to share

1 answer

For batch jobs, Dataflow imports data into BigQuery, writes it to GCS, and then runs BigQuery jobs to import that data into BigQuery. If you want to know how long it takes for BigQuery jobs to run, I can take a look at the BigQuery jobs running in your project.

You can try the following commands to get information about your BigQuery import jobs.

  bq ls -j <PROJECT ID>:


The above command should show you a list of tasks and things like duration. (Note the colon at the end of the project ID. I think the colon is required).

Then you can try

bq show -j <JOB ID>


For more information on the assignment.

Please note that you must be the owner of the project to be able to see tasks that other users are doing. This applies to BigQuery jobs performed by Dataflow because Dataflow uses a service account.



All Articles