API design on top of BigQuery

I have an AppEngine application that tracks different types of impression data across multiple sites. We are currently collecting about 40 million records per month, and the main BigQuery table is closing 15 GB in size after 6 weeks of data collection, and our estimates show that over another 6 weeks we will be collecting over 100 million records per month. A relatively small dataset in terms of bigdata, but with the ability to grow fairly quickly.

Now, before testing successfully, we need to work on an API that sits on top of BigQuery, which allows us to analyze the data and deliver the results to the dashboard we provide.

My concern is that most of the data parsed by the client only takes a few days (per query), and since BigQuery queries are actually full table scans, the API may become slower to respond over time as the table grows in size and BQ more data needs to be processed to return results.

So my question is. Should we traverse BigQuery log tables, say a month or a week, to reduce the data that needs processing, or would it be "smarter" to preprocess the data and store the results in an NDB datastore? This will lead to an incredibly fast API, but requires us to preprocess everything, even those that some clients may never need.

Or am I perhaps optimizing prematurely?

+3


source to share


1 answer


Based on my experience analyzing the performance of similar projects in BigQuery. If you're only interested in performance, then you don't need to change anything. The BigQuery optimizer can figure out a lot of things, and if the query uses WHERE against just a few days, the performance will be good. But from a billing standpoint, you will pay more and more as your data grows to save money - smartly spoofing data by months or even weeks. With TABLE_RANGE, you can still query all the data if you need it, so you won't lose any functionality.



+1


source







All Articles