Pandas add variable as column and match original table (but they are different lengths)
What I am asking is quite difficult as the title. I made this example to show you my question. Here's an example table:
df = pd.DataFrame({'Number': [1,2,3,4,5,6,7,8,9], 'Col1':['a','b','c','d','e','f','g','h','i']})
The next step is to extract df ['Number'] and iterate over for some reason. number= [i*i for i in df['Number']]
output[1, 4, 9, 16, 25, 36, 49, 64, 81]
Now I have a variable 'number' which is a list.
Now, the key step is to rearrange this list. let the number of numbers be less than 40
number1 = [i for i in number if i < 40]
number2 = [i for i in number if i > 40]
OK, the key step I want is to add number 1 and number 2 to df, but the expected end result looks like this:
That is, add a new column "Type" and these two new variables must match the index, and the content must be "number1" and "number2", but not "1,4,9 ... 81".
source to share
I think you need numpy.where
with boolean mask
:
print (df.Number.pow(2) < 40)
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 False
8 False
Name: Number, dtype: bool
df['Type'] = np.where(df.Number.pow(2) < 40, 'number1', 'number2')
#same as
#df['Type'] = np.where(df.Number ** 2 < 40, 'number1', 'number2')
#another solution
#df['Type'] = np.where(df.Number.pow(2).lt(40), 'number1', 'number2')
print (df)
Col1 Number Type
0 a 1 number1
1 b 2 number1
2 c 3 number1
3 d 4 number1
4 e 5 number1
5 f 6 number1
6 g 7 number2
7 h 8 number2
8 i 9 number2
Timing - numpy.where
faster:
df = pd.DataFrame({'Number': [1,2,3,4,5,6,7,8,9], 'Col1':['a','b','c','d','e','f','g','h','i']})
#[9000 rows x 5 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
df['Type'] = np.where(df.Number.pow(2).lt(40), 'number1', 'number2')
df['Type1'] = 'number' + (1 + ((df.Number**2)>40).astype(int)).astype(str)
# Rule to produce new values
def f(row):
if row['Number']**2 > 40:
val = 'Number2'
else:
val = 'Number1'
return val
df['Type2'] = df.apply(f, axis=1)
#print (df)
In [218]: %timeit df['Type'] = np.where(df.Number.pow(2).lt(40), 'number1', 'number2')
1000 loops, best of 3: 1.63 ms per loop
In [219]: %timeit df['Type1'] = 'number' + (1 + ((df.Number**2)>40).astype(int)).astype(str)
100 loops, best of 3: 13.5 ms per loop
In [220]: %timeit df['Type2'] = df.apply(f, axis=1)
10 loops, best of 3: 127 ms per loop
EDIT:
I am creating helper columns for better understanding:
df['pow'] = df.Number.pow(2)
df['comaping val'] = 40
df['val'] = df.Number.pow(2).lt(40)
print (df)
Col1 Number pow comaping val val
0 a 1 1 40 True
1 b 2 4 40 True
2 c 3 9 40 True
3 d 4 16 40 True
4 e 5 25 40 True
5 f 6 36 40 True
6 g 7 49 40 False
7 h 8 64 40 False
8 i 9 81 40 False
source to share
Here's my creative approach:
Data:
In [23]: df
Out[23]:
Col1 Number
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
Decision:
In [24]: df['Type'] = 'number' + (1 + ((df.Number**2)>40).astype(int)).astype(str)
Result:
In [25]: df
Out[25]:
Col1 Number Type
0 a 1 number1
1 b 2 number1
2 c 3 number1
3 d 4 number1
4 e 5 number1
5 f 6 number1
6 g 7 number2
7 h 8 number2
8 i 9 number2
Explanation:
In [29]: ((df.Number**2)>40).astype(int)
Out[29]:
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
Name: Number, dtype: int32
In [30]: 1 + ((df.Number**2)>40).astype(int)
Out[30]:
0 1
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
Name: Number, dtype: int32
source to share
Create a custom function and then use it in pandas.apply
import pandas as pd
# Rule to produce new values
def f(row):
if row['Number']**2 > 40:
val = 'Number2'
else:
val = 'Number1'
return val
df = pd.DataFrame({'Number': [1,2,3,4,5,6,7,8,9], 'Col1':['a','b','c','d','e','f','g','h','i']})
# Apply the function to construct new column
df['Type'] = df.apply(f, axis=1)
print (df)
Output:
Col1 Number Type
0 a 1 Number1
1 b 2 Number1
2 c 3 Number1
3 d 4 Number1
4 e 5 Number1
5 f 6 Number1
6 g 7 Number2
7 h 8 Number2
8 i 9 Number2
source to share