Pandas: DataFrame.sum () or DataFrame (). as_matrix.sum ()

I am writing a function that calculates the conditional probability of all columns in a pd.DataFrame which has ~ 800 columns. I wrote several versions of this function and found a very big difference in computation time in two main parameters:

col_sums = data.sum()   #Simple Column Sum over 800 x 800 DataFrame

      

Option # 1: {'col_sums' and 'data' are Series and DataFrame, respectively}

[This is looped over index1 and index2 to get all combinations]

joint_occurance = data[index1] * data[index2]
sum_joint_occurance = joint_occurance.sum()
max_single_occurance = max(col_sum[index1], col_sum[index2])
cond_prob = sum_joint_occurance / max_single_occurance #Symmetric Conditional Prob
results[index1][index2] = cond_prob

      

Vs.

Option # 2: [While traversing index1 and index2 to get all combinations] Difference only instead of using DataFrame I exported data_matrix to np.array before loop

new_data = data.T.as_matrix() [Type: np.array]

      

Option # 1 Execution time ~ 1700 s Option # 2 Execution time ~ 122 s

Questions:

  • Is converting DataFrames content to np.array best for computational tasks?
  • Is the .sum () procedure in pandas significantly different from the .sum () procedure in NumPy, or is the speed difference due to label access to the data?
  • Why are these time series so different?
+3


source to share


1 answer


While reading the documentation, I came across:

Section 7.1.1 Fast scalar get and setup Since indexing with [] has to handle a lot of cases (single label access, slicing, boolean indexing, etc.), it has a bit of overhead to fix what you're asking from. If you only want to get a scalar value, the fastest way is to use the get_value method, which is implemented on all data structures:



In [656]: s.get_value(dates[5])
Out[656]: -0.67368970808837059
In [657]: df.get_value(dates[5], ’A’)
Out[657]: -0.67368970808837059

      

Best Guess: Because I am referencing individual data items from a data block many times (~ 640,000 per matrix). I think the speed reduction came from the way I was referencing the data (ie "Indexing with [] handles a lot of cases") and so I have to use the get_value () method to access search-like scalars by matrix.

+1


source







All Articles