Adding Rows Based on a Numeric Column Value

I may be wrong, but they are planning to analyze my data. I will need one entry for each application.

My dataframe looks something like this:

ID   Job Title  Number Applied  Hired  Feature(Math)
 1  Accountant               3      2              1
 2   Marketing               1      1              0
 3     Finance               1      1              1

      

I need to make it look like (1 = yes, 0 = no):

ID   Job Title  Number Applied  Hired  Feature(Math)       
 1  Accountant               1      0              1
 2  Accountant               1      1              1
 3  Accountant               1      1              1
 4   Marketing               1      1              0
 5     Finance               1      1              1

      

I need to add a line for each person who applied. Number Applied

should always be 1. Once this is complete, we can drop the column Number Applied

.

There are additional features that I have not included. The goal of the analysis is to apply a machine learning algorithm to predict whether a person will find a job based on their skill set. My current data frame is not working because when I convert hired to yes or no, it thinks that instead of 3, only 2 people with math skills were hired.

+3


source to share


2 answers


This is the approach I used earlier to "expand" a set of aggregated samples.

from itertools import imap, izip

def iterdicts(df):
    """
    Utility to iterate over rows of a data frame as dictionaries.
    """
    col = df.columns
    for row in df.itertuples(name=None, index=False):
        yield dict(zip(col, row))

def deaggregate(dicts, *columns):
    """
    Deaggregate an iterable of dictionaries `dicts` where the numbers in `columns`
    are assumed to be aggregated counts.
    """
    for row in dicts:
        for i in xrange(max(row[c] for c in columns)):
            d = dict(row)

            # replace each count by a 0/1 indicator
            d.update({c: int(i < row[c]) for c in columns})
            yield d

def unroll(df, *columns):
    return pd.DataFrame(deaggregate(iterdicts(df), *columns))

      

Then you can do



unroll(df, 'Number Applied', 'Hired')

      

   Feature(Math)  Hired  ID   Job Title  Number Applied
0              1      1   1  Accountant               1
1              1      1   1  Accountant               1
2              1      0   1  Accountant               1
3              0      1   2   Marketing               1
4              1      1   3     Finance               1

      

+1


source


d1 = df.loc[df.index.repeat(df['Number Applied'])]

hired = (
    d1.groupby('Job Title').cumcount() >=
        d1['Number Applied'] - d1['Hired']
).astype(int)

d1.assign(**{'Number Applied': 1, 'Hired': hired})

   ID   Job Title  Number Applied  Hired  Feature(Math)
0   1  Accountant               1      0              1
0   1  Accountant               1      1              1
0   1  Accountant               1      1              1
1   2   Marketing               1      1              0
2   3     Finance               1      1              1

      



+1


source







All Articles