Dash dataframes known_divisions and performance

I have several files whose column is named idx

and I would like to use it as an index. The resulting data map has about 13M lines. I know that I can read and assign the index this way (which is ~ 40s slow)

df = dd.read_parquet("file-*.parq")
df = df.set_index("idx")

      

or another way (which is fast ~ 40ms)

df = dd.read_parquet("file-*.parq", index = "idx")

      

Simple operation calculating length with speed over 4 times faster. What I don't understand

  • in the first case it df.known_divisions

    returns True

    , and in the second - False

    . I expected the opposite behavior. Then I did some operations on top df

    and without known_division I always got the best performance. I scratch my head to see if this is on purpose or not.
  • the number of partitions is the number of files. How can I set a different number of partitions?

UPDATE This is not just a computation len

that is faster. In my calculation, I create 4 new dataframes using groupby, apply and join multiple times and this is the timing

|                  |Load and reindex (s)|Load with index (s)|
|:-----------------|-------------------:|------------------:|
| load             |            12.5000 |            0.0124 |
| grp, apply, join |            11.4000 |            6.2700 |
| compute()        |           146.0000 |          125.0000 |
| TOTAL            |           169.9000 |          131.2820 |

      

+3


source to share


1 answer


When you use the first method, dask loads the data and splits the rows by the value of the selected column (which involves shuffling all the chunks on disk) before doing whatever calculation you asked for. In the case of calculating the length, this is all the time wasted, since knowing about the indexes does nothing with this, but further calculations associated with this index (for example, join operations) will be much faster.



In the second version, you are asserting that your selected column is an index, but dask does not shuffle the data without an explicit request. If statistics are stored in the parquet metadata and the maximum / minimum size of each parquet is such that they form a monotonic series (ie, all the "idx" values ​​in the second chunk are greater than all the values ​​in the first, etc.), then you will still have known divisions and optimized performance for certain index operations. If these conditions are not met, you have an index column set, but the divisions are unknown, which again is great for calculating length.

+1


source







All Articles