Missing records using chunksize - Pandas and Google Analytics API integration

I am working on automating some reports via pandas and the Google Analytics API. When querying multiple dimensions for shared data, the resulting recordset is well above the default 10k max_result limit imposed by pandas.

To get around this, I pass a large number of max_results parameters and specify chunksize. My intention is to then iterate over the resulting resulting generator to create one large DataFrame that I can execute across all my operations.

from pandas.io import ga
import pandas as pd

max_results = 1000000
chunks = ga.read_ga(metrics=["visits"],
                    dimensions=["date", "browser", "browserVersion",
                    "operatingSystem", "operatingSystemVersion",
                    "isMobile", "mobileDeviceInfo"],
                    start_date="2012-12-01",
                    end_date="2012-12-31",
                    max_results=max_results,
                    chunksize=5000)

stats = pd.concat([chunk for chunk in chunks])
stats.groupby(level="date").sum()

      

However, it is clear that some records are not being pulled because the total daily visits are not consistent with Google Analytics.

I do not face this problem when selecting only a few dimensions. For example...

test = ga.read_ga(metrics=["visits"], dimensions=["date"],
            start_date="2012-12-01", end_date="2012-12-31")

test.groupby(level="date").sum()

      

... produces the same numbers as Google Analytics.

Thanks in advance for your help.

+3


source to share


1 answer


The 10,000 line total is a limit imposed by the Google Analytics API ( https://developers.google.com/analytics/devguides/reporting/core/v3/reference#maxResults )

The code uses start_index to execute multiple queries and limit the work. I've flagged this as a bug in pandas: https://github.com/pydata/pandas/issues/2805 I'll see when I get a chance. If you could show the expected data versus what you got through pandas, that would be helpful.



As a workaround, I suggest repeating every day and making a daily request.

+1


source







All Articles