Reading batches of data from BigQuery to Datalab

I have a large array of data in a BigQuery table (~ 45M rows, 13Gb of data). I would like to process this data in my Google Datalab Notebook to do some basic statistics with pandas, to visualize the data later with matplotlib in a Datalab cell. I think it's not worth trying to load the entire dataset into a pandas' Dataframe (at least I will have RAM problems).

Can I read data from BigQuery in batches (say 10K rows) to be used in Datalab?

Thanks in advance!

+3


source to share


2 answers


If your goal is to visualize data, is sampling better than loading a small batch?

You can try your data, for example:

import google.datalab.bigquery as bq
df = bq.Query(sql='SELECT image_url, label FROM coast.train WHERE rand() < 0.01').execute().result().to_dataframe()

      



Or, use a convenience class:

from google.datalab.ml import BigQueryDataSet
sampled_df = BigQueryDataSet(table='myds.mytable').sample(1000)

      

+3


source


Have you tried just sorting through table by table? The object Table

is iterable, which uses the page selector to get data from the BigQuery table, and that's streaming. The default page size is 1024.



+2


source







All Articles