Why do these 2 requests have the same "GB processed" (and therefore cost)?

Question

Why do these 2 requests have the same "GB processed" (and therefore cost)?

My test data consists of 27,768,767 lines. My schema includes a "message" column of type strings. These strings vary in length, but are usually several hundred characters long. There is also a user_id column of type int. Here are two queries that return 0 rows (where clauses don't match anything in my data). To my surprise, they both report 4.69 GB handling.

SELECT * FROM logtesting.logs WHERE user_id=1;

Query complete (1.7s elapsed, 4.69 GB processed)

...

SELECT * FROM logtesting.logs WHERE message CONTAINS 'this string never appears';

Query complete (2.1s elapsed, 4.69 GB processed)

Since ints are stored in 8 bytes , I would expect the data processed in the previous (user_id) request to be something like 213MB (28 million rows * 8 bytes per user_id). The last (post) query is more difficult to evaluate as the strings vary in length, but I expect it to be several times larger than the previous (user_id) query.

Do I understand my understanding of how BigQuery calculates query costs incorrectly?

+3

google-bigquery

Ian rose 09 jul. 15 at 14:27

source to share

1 answer

Patrice · Accepted Answer · 2015-07-09T14:28:45+0000

No matter what you do, BigQuery will need to scan all the rows in your tables (not necessarily all the columns, though), so that's okay because your table doesn't change. The where clause means that it will NOT RETURN data. This still needs to be handled.

The only way to make sure you skip processing is to not select all of your columns. BigQuery is column based, so if you don't need all of your attributes, don't return all of them (this also means they won't be processed). THIS will help reduce the cost :)

Historically "select *" was not supported to make sure people didn't find it the hard way.

Why do these 2 requests have the same "GB processed" (and therefore cost)?

More articles: