Python: dump all unique combinations with constraint, in Pandas DataFrame
You will have to forgive me as I am currently learning Python. I have Pandas
DataFrame
with the following work attribute columns: Name, Position, HourlyPay.
DF
Index Name Position HourlyPay
0 John Analyst 15.00
1 Mike Programmer 18.00
2 Lisa Supervisor 16.75
4 Frank Analyst 15.50
I want to output to another DataFrame
(as shown below) with all possible unique combinations of the n
person command , which will also include their positions as column headers and another column that summarizes them HourlyPay
and then sorts them highest TotalHourlyPay
.
uniqueDf
Index Analyst Programmer Supervisor TotalHourlyPay
0 Frank Mike Lisa 50.25
1 John Mike Lisa 49.75
I used 3 positions for my example uniqueDf
above, but this can change at times. For example, 2 positions Analyst
can run at the same time, so I want to be able to dynamically add or remove multiple position columns whenever I need to. A second example is shown below.
secondExampleDf
Index Analyst Analyst Programmer Supervisor TotalHourlyPay
0 Frank John Mike Lisa 65.25
This is a very simple example of a much larger dataset. I tried this problem but my code is not worth showing. The closest I used is itertools.combinations
in the column df.Name
. I tried to add and sum a column TotalHourlyPay
using join
or merge
between two DataFrame
s, but I couldn't get that to work either.
possibleCombinations = list(itertools.combinations(df.Name, 3))
uniqueDf = pd.DataFrame(possibleCombinations,columns=['Employee1','Employee2','Employee3'])
I'm just asking if anyone can help me in the right direction. I know SO is not about writing code for you, which I definitely don't want. I'm just confused as to what to do next, every link in Google Chrome gets highlighted on click. Any help is appreciated.
thank
source to share
This code gives the desired output:
import pandas as pd
import itertools
# definition of dataframe
df = pd.DataFrame()
df["Index"] = [0, 1, 2, 4, 5, 6, 8, 9, 10]
df["Name"] = ["John", "Mike", "Lisa", "Franck", "Peter", "Suzanne", "Laura", "Sam", "Manon"]
df["Position"] = ["Analyst", "Programmer", "Supervisor", "Analyst", "Programmer", "Programmer", "Supervisor", "Analyst", "Analyst"]
df["HourlyPay"] = [15.00, 18.00, 16.75, 15.50, 17.00, 18.00, 16.00, 12.00, 13.00]
# dict of dataframes by position
unique_positions = list(df["Position"].unique())
pos_dfs = {}
for pos in unique_positions:
pos_dfs[pos] = df.loc[df["Position"]==pos].reset_index()
# required positions with count
req_pos_count = pd.DataFrame.from_dict({"count":{"Analyst": 2, "Supervisor": 1, "Programmer": 1}})
req_pos_unique = list(req_pos_count.index.unique())
req_pos_dfs = [pos_dfs[pos] for pos in req_pos_unique]
which_pos = [item for _, row in req_pos_count.iterrows() for item in [row.name]*row["count"]]
which_pos_count = [str(i) + "_" + pos for i, pos in enumerate(which_pos)]
# combinations
pos_dfs_rows = [list(itertools.combinations(range(len(pos_df)), req_pos_count.loc[req_pos_unique[i]]["count"]))
for i, pos_df in enumerate(req_pos_dfs)]
pos_dfs_rows_comb = [[it for item in sublist for it in item]
for sublist in list(itertools.product(*pos_dfs_rows)) ]
# building of result
uniqueDf = pd.DataFrame(index=range(len(pos_dfs_rows_comb)), columns=which_pos_count+["TotalHourlyPay"])
for k, comb in enumerate(pos_dfs_rows_comb):
rows = [pos_dfs[which_pos[i]].ix[ind] for i, ind in enumerate(comb)]
tp = pd.concat(rows, axis=1, ignore_index=True).transpose()
uniqueDf.loc[k, which_pos_count] = list(tp["Name"])
uniqueDf["TotalHourlyPay"].loc[k] = tp["HourlyPay"].sum()
uniqueDf.sort_values(by="TotalHourlyPay", inplace=True)
source to share
In this process, I followed.
First get the indicator variables for held or missing positions:
position = pd.get_dummies(df['Position']).astype(bool)
not_position = ~pd.get_dummies(df['Position'], prefix='not').astype(bool)
df1 = pd.concat([df, position, not_position], axis=1)
Then create possible combinations:
df2 = df1.merge(df1, left_on='Programmer', right_on='not_Programmer', suffixes=['', '_y'])
df3 = df2.merge(df1, left_on='Supervisor', right_on='not_Supervisor', suffixes=['', '_z'])
Then remove the duplicates and extract the fields that are still valid
df4 = df3[(df3['Analyst']) & (df3['Programmer_y']) & (df3['Supervisor_z'])]
df4.loc[:, ['Name', 'Name_y', 'Name_z', 'HourlyPay', 'HourlyPay_y', 'HourlyPay_z']]
Name Name_y Name_z HourlyPay HourlyPay_y HourlyPay_z
0 John Mike Lisa 15.0 18.0 16.75
1 Frank Mike Lisa 15.5 18.0 16.75
After that, you can take the amount across all rows, delete the new useless payment columns, and rename the remaining columns to return a result, such as your unique Df.
source to share