What is the best Impala fetch query for a huge database?

Question

What is the best Impala fetch query for a huge database?

I have a huge table (over 1 billion rows) in Impala. I need to try ~ 100,000 lines multiple times. What's the best way to query for example strings?

+3

random nosql impala

Soroosh Jul 20. 15 at 16:40

source to share

3 answers

Impala does not currently support TABLESAMPLE, unfortunately. See https://issues.cloudera.org/browse/IMPALA-1924 to follow its development.

0

Jeff hammerbacher 21 jul. 15 at 17:14

source to share

In hindsight, knowing that TABLESAMPLE is not available, you can add an "RVAL" (random 32-bit integer) field to each record and also repeat the selection by adding "where RVAL> x and RVAL <y" for the corresponding x and y values. Non-overlapping intervals [x1, y1], [x2, y2], ... will be independent. You can also choose "where RVAL% 10000 = 1, = 2, ... etc. For a separate collection of independent subsets.

0

Tony bartoletti June 10. 16 at 22:26

source to share

Matt · Accepted Answer · 2015-07-22T17:46:24+0000

As Jeff mentioned, what you were asking for is not possible yet, but we have an internal aggregate function that takes 200,000 samples (using collector fetch) and returns the samples separated by commas as one string. There is no way to change the number of samples yet. If the number of rows is less than 200,000, they will all be returned. If you are wondering how this works, see the implementation of the aggregate function and the collector fetch structure .

There is no way to "split" or split the results, so I don't know how useful this is.

For example, fetching is trivial from a table with 8 rows:

> select sample(id) from functional.alltypestiny
+------------------------+
| sample(id)             |
+------------------------+
| 0, 1, 2, 3, 4, 5, 6, 7 |
+------------------------+
Fetched 1 row(s) in 4.05s

(For context: this was added in a last release to keep histogram statistics in the planner, which sadly isn't ready yet).

What is the best Impala fetch query for a huge database?

More articles: