BigQuery - 6 Years Migration, Table / Query Design

I am creating a Spark job to move 6 years of ecommerce orders / events to our new BigQuery warehouse in addition to the new streaming capability. The moved data is in the same model as the new data and the views will reveal details as needed.

We have 4 datasets:

(source data - hierarchical JSON) (e.g. DIMS on top of raw data - usually flat, but not always) pre_aggs (e.g. recalculation of the total currency amount with some rollup key seals included - flat) reports (as you think)

Unlike some new live data that is streamed and unlimited, the migration of this historical data is batch and limited (I don't have to worry about late arriving events / watermarks and duplicates, for example). I can also split the data manually by event time (orderPlacedTimestamp) and be saved in the correct partition table by date (suffix?). The total data is about 2 GB and 2 million rows compressed in BQ, so not too massive, but rather complex structures with unnecessary visibility. I have options for writing raw materials and aggregates as materials from Spark, so I really want to do it according to best practice and optimize performance (query speed is more important and worth paying a little)

I found this really good blog on SQL / Query lambda architecture which is some inspiration and will try to do in a similar way.

https://www.linkedin.com/pulse/building-robust-real-time-etl-google-bigquery-lambda-novozhilov

I am still curious about how best to store / split this data and then build queries over time to match. Week by week is likely to be the most monthly.

My parameters:

  • Everything in one table - seems nice and simple without constantly managing the tables over time, but means a full scan for every query, when often I just want to return just a year or a maximum of 2 from a point in time?

  • One table per time segment, for example. annually, monthly

order_history_2017 or order_history_201701

We have a calendar mode search that will have keys in each line that can be used in the suffix above, for example 201701 for January 2017.

On a monthly basis, this would mean 72 tables, which seem quite a bit, maybe better fit?

For argumentation on, say, its monthly tables, what is the best way in BQ (standard SQL) to then query for the correct prefix of tables containing a continuous timeline by constructing the table name on the fly with the correct suffix (dynamically perhaps?

For example, say that I want to query all orders (the order has orderPlacedTimestamp) between 2017-01-10 and 2017-02-10 - this would mean scanning (and merging?) Only order_history_2017-01-10 and order_history_2017-02 -10 tables in this case, doing BETWEEN as below:

SELECT *
FROM order_history_201701 UNION ALL
SELECT *
FROM order_history_201702
WHERE order.orderPlacedTimestamp BETWEEN DATE("2017–01-10") and DATE("2017-02-10")

      

Then I could get scenarios where this historical data has to be merged with live (streamed) data too - wrapped in a way similar to the lambda design article.

  1. Option I was thinking about.

So many options with BQ! :)

Anyway, this is my current thinking, any words of wisdom on this topic would be hugely appreciated in terms of table design and query building optimization.

Thank you, a bunch of everyone!

+3


source to share


1 answer


My recommendation is to seriously consider BigQuery's Split Tables functionality

While streaming to partitioned tables has some limitations (you can navigate to partitions in the last 30 days in the past and 5 days in the future with respect to the current date) there are no such limitations for download jobs or queries

A while ago I considered using this option to work around the workaround by splitting by column rather than date by matching some date attributes between "0001-01-01" and "9999-12-31" (3,652,058 days - attribute values ​​should be divided by). The POC was conceptually successful, but I still didn't like it, especially with the hope (at least the way I felt this time) from the Google Team to provide date or int  table breaks. I decided to wait.

In the meantime, to update and double check that loading or querying into a partitioned table is still not partition-specific (as it sounded somehow in one of the SO posts), I quickly checked and you can see the result below

Step 1 . Create a partitioned table - project.dataset.partitioned_table

I just used the UI to do this.

Step 2 . Insert the query result into a table of different sections.

#standardSQL
SELECT 13 AS a

      



with project.dataset.partitioned_table $ YYYYMMDD as the destination (you can use DML INSERT for this)

Am I running this multiple times for a date range between AC (0001-01-01) and end? (9999-21-31)

Step 3 - Check the result

#standardSQL
SELECT DATE(_partitiontime) AS partition_, a
FROM `project.dataset.partitioned_table`
ORDER BY a

      

The result was (keep in mind - the section format here is YYYYMMDD)

partition_      a
----------      --
2017-07-16      1
2017-07-16      2
2017-07-16      3
2017-07-16      4
2017-07-15      5
2017-07-14      6
2010-01-01      7
2001-01-01      8
1001-01-01      9
0001-01-01      10
4001-01-01      11
7001-01-01      12
9999-12-31      13

      

+1


source







All Articles