Adding Rows Based on a Numeric Column Value
I may be wrong, but they are planning to analyze my data. I will need one entry for each application.
My dataframe looks something like this:
ID Job Title Number Applied Hired Feature(Math)
1 Accountant 3 2 1
2 Marketing 1 1 0
3 Finance 1 1 1
I need to make it look like (1 = yes, 0 = no):
ID Job Title Number Applied Hired Feature(Math)
1 Accountant 1 0 1
2 Accountant 1 1 1
3 Accountant 1 1 1
4 Marketing 1 1 0
5 Finance 1 1 1
I need to add a line for each person who applied. Number Applied
should always be 1. Once this is complete, we can drop the column Number Applied
.
There are additional features that I have not included. The goal of the analysis is to apply a machine learning algorithm to predict whether a person will find a job based on their skill set. My current data frame is not working because when I convert hired to yes or no, it thinks that instead of 3, only 2 people with math skills were hired.
source to share
This is the approach I used earlier to "expand" a set of aggregated samples.
from itertools import imap, izip
def iterdicts(df):
"""
Utility to iterate over rows of a data frame as dictionaries.
"""
col = df.columns
for row in df.itertuples(name=None, index=False):
yield dict(zip(col, row))
def deaggregate(dicts, *columns):
"""
Deaggregate an iterable of dictionaries `dicts` where the numbers in `columns`
are assumed to be aggregated counts.
"""
for row in dicts:
for i in xrange(max(row[c] for c in columns)):
d = dict(row)
# replace each count by a 0/1 indicator
d.update({c: int(i < row[c]) for c in columns})
yield d
def unroll(df, *columns):
return pd.DataFrame(deaggregate(iterdicts(df), *columns))
Then you can do
unroll(df, 'Number Applied', 'Hired')
Feature(Math) Hired ID Job Title Number Applied 0 1 1 1 Accountant 1 1 1 1 1 Accountant 1 2 1 0 1 Accountant 1 3 0 1 2 Marketing 1 4 1 1 3 Finance 1
source to share
d1 = df.loc[df.index.repeat(df['Number Applied'])]
hired = (
d1.groupby('Job Title').cumcount() >=
d1['Number Applied'] - d1['Hired']
).astype(int)
d1.assign(**{'Number Applied': 1, 'Hired': hired})
ID Job Title Number Applied Hired Feature(Math)
0 1 Accountant 1 0 1
0 1 Accountant 1 1 1
0 1 Accountant 1 1 1
1 2 Marketing 1 1 0
2 3 Finance 1 1 1
source to share