Python Dask - vertical concatenation of 2 DataFrames

I have the following Dask DataFrame in Python:

          A         B      C      D      E      F
0         1         4      8      1      3      5
1         6         6      2      2      0      0
2         9         4      5      0      6     35
3         0         1      7     10      9      4
4         0         7      2      6      1      2

      

I am trying to combine 2 Dask DataFrames vertically:

ddf_i = ddf + 11.5
dd.concat([ddf,ddf_i],axis=0)

      

but I am getting this error:

Traceback (most recent call last):
      ...
      File "...", line 572, in concat
        raise ValueError('All inputs have known divisions which cannot '
    ValueError: All inputs have known divisions which cannot be concatenated in order. Specify interleave_partitions=True to ignore order

      

However, if I try:

dd.concat([ddf,ddf_i],axis=0,interleave_partitions=True)

      

then it seems to work. Is there a problem setting this to True

(in terms of performance - speed)? Or is there any other way to vertically stack Dask DataFrames?

+5


source to share


2 answers


If you check the splitting of the dataframe ddf.divisions

, you will find, assuming one section, that it has index edges there: (0, 4). This is useful for dask as it knows when you are performing some operation on the data, not to use a section that does not include the required index values. It also makes some windowing operations much faster when the index is right for the job.



When you merge, the second block of data has the same index as the first. Concatenation will work without interleaving if the index values ​​have different ranges in the two partitions.

+3


source


I have a similar problem when merging two dataframes with a datetime index.

Date and time indexed dataframes ddf1, ddf2 without interleaving: (ddf1.divisions: (Timestamp ('2018-03-03 13: 04: 44.497929'), Timestamp ('2018-03-03 13: 23: 04.759840') ) (ddf2.divisions: (timestamp ('2018-03-03 07: 09: 04.184453'), timestamp ('2018-03-03 07: 32: 46.815356'))



Two ddfs cannot be merged unless interleave_partitions = True is specified. Was this caused by a limitation of datetimeindex support in dask? Or do I think I need to convert the index in half?

0


source







All Articles