Missing records using chunksize - Pandas and Google Analytics API integration
I am working on automating some reports via pandas and the Google Analytics API. When querying multiple dimensions for shared data, the resulting recordset is well above the default 10k max_result limit imposed by pandas.
To get around this, I pass a large number of max_results parameters and specify chunksize. My intention is to then iterate over the resulting resulting generator to create one large DataFrame that I can execute across all my operations.
from pandas.io import ga
import pandas as pd
max_results = 1000000
chunks = ga.read_ga(metrics=["visits"],
dimensions=["date", "browser", "browserVersion",
"operatingSystem", "operatingSystemVersion",
"isMobile", "mobileDeviceInfo"],
start_date="2012-12-01",
end_date="2012-12-31",
max_results=max_results,
chunksize=5000)
stats = pd.concat([chunk for chunk in chunks])
stats.groupby(level="date").sum()
However, it is clear that some records are not being pulled because the total daily visits are not consistent with Google Analytics.
I do not face this problem when selecting only a few dimensions. For example...
test = ga.read_ga(metrics=["visits"], dimensions=["date"],
start_date="2012-12-01", end_date="2012-12-31")
test.groupby(level="date").sum()
... produces the same numbers as Google Analytics.
Thanks in advance for your help.
source to share
The 10,000 line total is a limit imposed by the Google Analytics API ( https://developers.google.com/analytics/devguides/reporting/core/v3/reference#maxResults )
The code uses start_index to execute multiple queries and limit the work. I've flagged this as a bug in pandas: https://github.com/pydata/pandas/issues/2805 I'll see when I get a chance. If you could show the expected data versus what you got through pandas, that would be helpful.
As a workaround, I suggest repeating every day and making a daily request.
source to share