What is the best Impala fetch query for a huge database?

I have a huge table (over 1 billion rows) in Impala. I need to try ~ 100,000 lines multiple times. What's the best way to query for example strings?

+3


source to share


3 answers


As Jeff mentioned, what you were asking for is not possible yet, but we have an internal aggregate function that takes 200,000 samples (using collector fetch) and returns the samples separated by commas as one string. There is no way to change the number of samples yet. If the number of rows is less than 200,000, they will all be returned. If you are wondering how this works, see the implementation of the aggregate function and the collector fetch structure .

There is no way to "split" or split the results, so I don't know how useful this is.

For example, fetching is trivial from a table with 8 rows:



> select sample(id) from functional.alltypestiny
+------------------------+
| sample(id)             |
+------------------------+
| 0, 1, 2, 3, 4, 5, 6, 7 |
+------------------------+
Fetched 1 row(s) in 4.05s

      

(For context: this was added in a last release to keep histogram statistics in the planner, which sadly isn't ready yet).

+2


source


Impala does not currently support TABLESAMPLE, unfortunately. See https://issues.cloudera.org/browse/IMPALA-1924 to follow its development.



0


source


In hindsight, knowing that TABLESAMPLE is not available, you can add an "RVAL" (random 32-bit integer) field to each record and also repeat the selection by adding "where RVAL> x and RVAL <y" for the corresponding x and y values. Non-overlapping intervals [x1, y1], [x2, y2], ... will be independent. You can also choose "where RVAL% 10000 = 1, = 2, ... etc. For a separate collection of independent subsets.

0


source







All Articles