Creating new columns in the DataFrame

I have a DataFrame with multiple columns:

   'a'  'b'  'c'  'd'
0  'x'   3    3    5
1  'y'   2    3    6
2  'z'   1    4    1

      

I want to create some new columns that depend on the data. For every possible value in column "a" I want two new columns (I have a list of all the different values ​​in column "a". There are only a few of them). For each column, there are two conditions: for the first new column, column "a" must equal the required value (for example, "x"), and column "b" is equal to column "c". The second new column "a" still needs to equal the required value, but column "b" must equal column "d" (column "b" will be either "c" or "d"). If both conditions are met, the new column will get 1, unless it gets 0.

This is how it would look with the above DataFrame example, given that:

and. Desired value for columns 'e' and 'f' is 'x'

b. Desired value for columns 'g' and 'h' is 'y'

from. Desired value for columns 'j' and 'k' is 'z'

e. Columns 'e', ​​'g', 'h' mean columns 'b' and 'c' are equal

e. Columns 'f', 'h', 'k' means columns 'b' and 'd' are equal

   'a'  'b'  'c'  'd'  'e'  'f'  'g'  'h'  'j'  'k'
0  'x'   3    3    5    1    0    0    0    0    0
1  'y'   2    3    6    0    0    0    0    0    0 
2  'z'   1    4    1    0    0    0    0    0    1

      

I have tried using the apply function for each example. Here when we want to check for "x" and this column "b" and "c" are equal:

data['d']= data.apply(lambda row: assignEvent(row, 'x', row['c']), axis=1 )

      

With the assignEvent function here:

def assignEvent(row, event, venue):
    """
    :param event: the desired event we're looking for
    :param venue: Either column 'c' or 'd' 
    """

    if (str(row['a'])==event) & (str(venue)==str(row['b'])):
            return 1
    else:
            return 0

      

It doesn't work, although when I'm done, all the values ​​in the new columns are 0. I'm not sure why, though, because I tested it and I know I'm getting caught in an if statement in my function.

+3


source to share


2 answers


I've changed a couple of things. First, your data for column a has quotes, so I am separating it with replace

in the assignEvent function. Second, I only pass the column name as a parameter to the meeting place, and then I will allow that column to be accessed in the function.

def assignEvent(row, event, venue):
    """
    :param event: the desired event we're looking for
    :param venue: Either column 'c' or 'd' 
    """

    if (row['a'].replace("'","")==event) & (row[venue]==row['b']):
            return 1
    else:
            return 0

df['dd']= df.apply(lambda row: assignEvent(row, 'x', 'c'), axis=1 )

      



Output:

     a  b  c  d  dd
0  'x'  3  3  5   1
1  'y'  2  3  6   0
2  'z'  1  4  1   0

      

+1


source


Method:

I'm going to present an avoidance approach apply

for better speed and scalability. It looks like you are essentially aiming to add columns containing two different sets of indicator variables for records data['a']

in depending on the condition you traced in your question. If this is not true and only a subset of the values ​​of column a should receive indicators, see Appendix.

Getting indicator variables is simple:

dummies = pd.get_dummies(data['a'])
dummies
Out[335]: 
   'x'  'y'  'z'
0    1    0    0
1    0    1    0
2    0    0    1

      

Identifying the rows where the conditions are true is also simple, shown here using numpy.where

:

np.where(data['b'] == data['c'], 1, 0)

      

To combine them, we can use matrix multiplication, after playing around with the output formatting a bit np.where

:

np.array([np.where(data['b'] == data['c'], 1, 0)]).T*dummies
Out[338]: 
   'x'  'y'  'z'
0    1    0    0
1    0    0    0
2    0    0    0

      

To do this for both conditions, attach it to the original data and format it as you indicated, I will move on to the next:

def col_a_dummies(data):
    dummies = pd.get_dummies(data['a'])
    b_c = np.array([np.where(data['b'] == data['c'], 1, 0)]).T*dummies
    b_d = np.array([np.where(data['b'] == data['d'], 1, 0)]).T*dummies
    return pd.concat([data[['a', 'b', 'c', 'd']], b_c, b_d], axis=1)

def format_dummies(dummies):
    dummies.columns = ['a', 'b', 'c', 'd', 'e', 'g', 'j', 'f', 'h', 'k']
    return dummies.sort_index(axis=1)

data = format_dummies(col_a_dummies(data))
data
Out[362]: 
     a  b  c  d  e  f  g  h  j  k
0  'x'  3  3  5  1  0  0  0  0  0
1  'y'  2  3  6  0  0  0  0  0  0
2  'z'  1  4  1  0  0  0  0  0  1

      

Addendum: This method still pretty much works if the data block is first filtered before being fed to get_dummies

. This further restricts the need for a unique data index.

def filtered_col_a_dummies(data, values):
    filtered = data[data['a'].isin(values)]
    dummies = pd.get_dummies(filtered['a'])
    b_c = np.array([np.where(filtered['b'] == filtered['c'], 1, 0)]).T*dummies
    b_d = np.array([np.where(filtered['b'] == filtered['d'], 1, 0)]).T*dummies
    return pd.concat([data[['a', 'b', 'c', 'd']], b_c, b_d], axis=1).fillna(0)

      




% timeit Results

In three lines, this is already faster:

def assignEvent(row, event, venue):
    """
    :param event: the desired event we're looking for
    :param venue: Either column 'c' or 'd' 
    """

    if (row['a']==event) & (row[venue]==row['b']):
            return 1
    else:
            return 0

def no_sort_format_dummies(dummies):
    dummies.columns = ['a', 'b', 'c', 'd', 'e', 'g', 'j', 'f', 'h', 'k']
    return dummies

%timeit data.apply(lambda row: assignEvent(row, "'x'", 'c'), axis=1)
1000 loops, best of 3: 467 Β΅s per loop
# needs to be repeated six times in total, total time 2.80 ms, ignoring assignment

%timeit format_dummies(col_a_dummies(data))
100 loops, best of 3: 2.58 ms per loop

      

or

%timeit no_sort_format_dummies(col_a_dummies(data))
100 loops, best of 3: 2.07 ms per loop

      

unless you sort the columns.

If filtered:

%timeit format_dummies(filtered_col_a_dummies(data, ("'x'", "'y'", "'z'")))
100 loops, best of 3: 3.92 ms per loop

      

In 300 rows, it becomes more pronounced:

%timeit data.apply(lambda row: assignEvent(row, "'x'", 'c'), axis=1)
100 loops, best of 3: 10.9 ms per loop

%timeit format_dummies(col_a_dummies(data))
100 loops, best of 3: 2.73 ms per loop

%timeit no_sort_format_dummies(col_a_dummies(data))
100 loops, best of 3: 2.14 ms per loop

%timeit format_dummies(filtered_col_a_dummies(data, ("'x'", "'y'", "'z'")))
100 loops, best of 3: 4.04 ms per loop

      

+1


source







All Articles