Creating pandas dataframe from changeset indexes

I have two series that are indexed in the same format. Here are the clips of the two below (due to the size of the data, I won't show the entire set):

>>> s1
Out[52]: 
parameter_id  parameter_type_cs_id
4959          1                        -0.2664122
4960          1                      -0.004289398
4961          1                      -0.006652875
4966          1                      -0.004208685
4967          1                       -0.02268688
4968          1                       -0.05958452
4969          1                       -0.01133198
4970          1                       -0.01968251
4972          1                       -0.05860331
4974          1                       -0.08260008
4975          1                       -0.05402012
4979          1                        -0.0308407
4980          1                       -0.02232495
4987          1                        -0.2315813
4990          1                       -0.02171027
...
727241        1                            -0.00156766
727242        1                          -0.0009964491
727243        1                           -0.007068732
727244        1                           -0.003500738
727245        1                           -0.006572505
727246        1                          -0.0005814131
728060        1                             -0.0144799
728062        1                             -0.0418521
728063        1                            -0.01367948
728065        1                            -0.03625054
728066        1                            -0.06806824
728068        1                           -0.007910916
728071        1                           -0.005482052
728073        1                           -0.005845178
intercept                             [-11.4551819018]
Name: coef, Length: 1529, dtype: object

>>> s2
Out[53]: 
parameter_id  parameter_type_cs_id
4958          1                       -0.001683882
4959          1                          -1.009859
4960          1                      -0.0004456379
4961          1                       -0.005564386
4963          1                         -0.9145955
4964          1                      -0.0009077246
4965          1                      -0.0003179153
4966          1                      -0.0006907124
4967          1                        -0.02125838
4968          1                        -0.02443978
4969          1                       -0.002665334
4970          1                       -0.003135213
4971          1                      -0.0003539563
4972          1                        -0.03684852
4973          1                      -0.0001203596
...
728044        1                          -0.0003084855
728060        1                              -0.925618
728061        1                           -0.001192743
728062        1                             -0.9203911
728063        1                           -0.002522615
728064        1                          -0.0003572484
728065        1                           -0.003475959
728066        1                            -0.02329697
728068        1                           -0.001412785
728069        1                           -0.002095895
728070        1                          -9.790675e-05
728071        1                          -0.0003013977
728072        1                          -0.0003369116
728073        1                           -0.000249748
intercept                             [-12.1281459287]
Name: coef, Length: 1898, dtype: object

      

The index formats are the same, so I'm trying to put them in a data tick:

d = {'s1': s1, 's2': s2}
df = pd.DataFrame(d)

      

However, I notice that almost everything has a way out NaN

, which I find shocking. I looked through the indices for the individual series and noticed that in the dataframe they were as strings instead of the same format as the series

>>> s1.index.values
Out[54]: 
array([(4959, 1), (4960, 1), (4961, 1), ..., (728071, 1), (728073, 1),
       ('intercept', '')], dtype=object)

>>> s2.index.values
Out[55]: 
array([(4958, 1), (4959, 1), (4960, 1), ..., (728072, 1), (728073, 1),
       ('intercept', '')], dtype=object)

      

But there are lines in the dataframe

>>> df.index.values
Out[56]: 
array([('4959', '1'), ('4960', '1'), ('4961', '1'), ..., ('8666', '1'),
       ('9638', '1'), ('intercept', '')], dtype=object)

      

Why is it changing the type and causing my problem ...?

Even stranger to me, if I try the same as above on a smaller set, I see the behavior I would expect (not all NaN

, and indices are not convertible)

s1_ = s1[:15]
s2_ = s2[:15]
d_ = {'s1': s1_, 's2': s2_}
df_ = pd.DataFrame(d_) #<---- This has the behavior I would expect

      

EDIT I found a way that works, but I'm not sure why it works like this, if I convert both series to dataframes and then attach to them, it works as expected:

df_1 = pd.DataFrame({'s1': s1})
df_2 = pd.DataFrame({'s2': s2})
new_df = df_1.join(df_2) #WHY DOES THIS WAY WORK!?!?

      

+3


source to share


2 answers


The reason it converts indices to strings is because the last index

intercept                             [-11.4551819018]

      



in your series data is a string. The documentation for Pandas dataframes states that when building a dataframe from a series, the dataframe retains the same indexing from the series, which causes the conversion to all rows due to the last row in the data.

Your solution to create two dataframes and join them works because the indexing is sequential since you are using the same data structure (e.g. dataframe) instead of converting from one data structure (series) to another (dataframe ), It looks like Pandas specific thing. I would stick with your solution.

+2


source


I don't have your dataframe, but here is a small dataframe example to show that pandas designs the dataframe as expected (using pandas 0.15.1 and python 3.4). As expected, NaNs are introduced when the indices do not match.

The last line of your data ('intercept', '') and all other lines are numbers. So, ('catch', '') goes to the index of each series, and this probably causes the values ​​in the index to "advance" towards the rows.



>> s1 = pd.Series([1,2,3], index=pd.MultiIndex.from_tuples([(1,1),(1,2),(1,3)], names=['a','b']))
>>> s1
a  b
1  1    1
   2    2
   3    3
dtype: int64
>>> s2 = pd.Series([100,200,300], index=pd.MultiIndex.from_tuples([(1,2),(1,3),(1,4)], names=['a','b']))
>>> 
>>> s2
a  b
1  2    100
   3    200
   4    300
dtype: int64
>>> df = pd.DataFrame({'s1':s1, 's2':s2})
>>> df
     s1   s2
a b         
1 1   1  NaN
  2   2  100
  3   3  200
  4 NaN  300
>>> df.index.values
array([(1, 1), (1, 2), (1, 3), (1, 4)], dtype=object)

      

+3


source







All Articles