Missing records using chunksize - Pandas and Google Analytics API integration

Question

Missing records using chunksize - Pandas and Google Analytics API integration

I am working on automating some reports via pandas and the Google Analytics API. When querying multiple dimensions for shared data, the resulting recordset is well above the default 10k max_result limit imposed by pandas.

To get around this, I pass a large number of max_results parameters and specify chunksize. My intention is to then iterate over the resulting resulting generator to create one large DataFrame that I can execute across all my operations.

from pandas.io import ga
import pandas as pd

max_results = 1000000
chunks = ga.read_ga(metrics=["visits"],
                    dimensions=["date", "browser", "browserVersion",
                    "operatingSystem", "operatingSystemVersion",
                    "isMobile", "mobileDeviceInfo"],
                    start_date="2012-12-01",
                    end_date="2012-12-31",
                    max_results=max_results,
                    chunksize=5000)

stats = pd.concat([chunk for chunk in chunks])
stats.groupby(level="date").sum()

However, it is clear that some records are not being pulled because the total daily visits are not consistent with Google Analytics.

I do not face this problem when selecting only a few dimensions. For example...

test = ga.read_ga(metrics=["visits"], dimensions=["date"],
            start_date="2012-12-01", end_date="2012-12-31")

test.groupby(level="date").sum()

... produces the same numbers as Google Analytics.

Thanks in advance for your help.

+3

python pandas google-analytics google-analytics-api

Greg reda 04 Feb 13 at 22:15

source to share

1 answer

Chang She · Accepted Answer · 2013-02-06T05:24:07+0000

The 10,000 line total is a limit imposed by the Google Analytics API ( https://developers.google.com/analytics/devguides/reporting/core/v3/reference#maxResults )

The code uses start_index to execute multiple queries and limit the work. I've flagged this as a bug in pandas: https://github.com/pydata/pandas/issues/2805 I'll see when I get a chance. If you could show the expected data versus what you got through pandas, that would be helpful.

As a workaround, I suggest repeating every day and making a daily request.

Missing records using chunksize - Pandas and Google Analytics API integration

More articles: