Dash dataframes known_divisions and performance
I have several files whose column is named idx
and I would like to use it as an index. The resulting data map has about 13M lines. I know that I can read and assign the index this way (which is ~ 40s slow)
df = dd.read_parquet("file-*.parq")
df = df.set_index("idx")
or another way (which is fast ~ 40ms)
df = dd.read_parquet("file-*.parq", index = "idx")
Simple operation calculating length with speed over 4 times faster. What I don't understand
- in the first case it
df.known_divisions
returnsTrue
, and in the second -False
. I expected the opposite behavior. Then I did some operations on topdf
and without known_division I always got the best performance. I scratch my head to see if this is on purpose or not. - the number of partitions is the number of files. How can I set a different number of partitions?
UPDATE
This is not just a computation len
that is faster. In my calculation, I create 4 new dataframes using groupby, apply and join multiple times and this is the timing
| |Load and reindex (s)|Load with index (s)|
|:-----------------|-------------------:|------------------:|
| load | 12.5000 | 0.0124 |
| grp, apply, join | 11.4000 | 6.2700 |
| compute() | 146.0000 | 125.0000 |
| TOTAL | 169.9000 | 131.2820 |
source to share
When you use the first method, dask loads the data and splits the rows by the value of the selected column (which involves shuffling all the chunks on disk) before doing whatever calculation you asked for. In the case of calculating the length, this is all the time wasted, since knowing about the indexes does nothing with this, but further calculations associated with this index (for example, join operations) will be much faster.
In the second version, you are asserting that your selected column is an index, but dask does not shuffle the data without an explicit request. If statistics are stored in the parquet metadata and the maximum / minimum size of each parquet is such that they form a monotonic series (ie, all the "idx" values ββin the second chunk are greater than all the values ββin the first, etc.), then you will still have known divisions and optimized performance for certain index operations. If these conditions are not met, you have an index column set, but the divisions are unknown, which again is great for calculating length.
source to share