Pandas parsing strings for consecutive dates
After "chaining" lines and counting consecutive months from a CSV file.
I am currently reading a CSV file with 5 columns of interest (based on insurance policies):
CONTRACT_ID START-DATE END-DATE CANCEL_FLAG OLD_CON_ID
123456 2015-05-30 2016-05-30 0 8788
123457 2014-03-20 2015-03-20 0 12000
123458 2009-12-20 2010-12-20 0 NaN
...
I want to calculate the number of consecutive months that the chain of contracts goes.
Example: taking START-DATE
from a contract at the "front" of the chain (oldest contract) and END-DATE
from the end of the chain (new contract). The oldest contract is determined either before the termination of the contract in the chain, or the one that has no value OLD_CON_ID
.
Each line represents a contract and prev_Con_ID points to the previous contract ID. The desired outcome is the number of months that the contractual chains come back before the break (i.e. the customer did not have a contract for a certain period of time). If there is nothing in this column, then this is the first contract in this chain.
CANCEL_FLAG should also shorten the chain, because a value of 1 means that the contract has been canceled.
The current code counts the number of active contracts for each year by editing the data tactics:
df_contract = df_contract[
(df_contract['START_DATE'] <= pd.to_datetime('2015-05-31')) &
(df_contract['END_DATE'] >= pd.to_datetime('2015-05-31')) & (df_contract['CANCEL_FLAG'] == 0 )
]
df_contract = df_contract[df_contract['CANCEL_FLAG'] == 0
]
activecount = df_contract.count()
print activecount['CONTRACT_ID']
Here are the first 6 lines of code where I create the dataframes and change the datetime values:
file_name = 'EXAMPLENAME.csv' df = pd.read_csv(file_name) df_contract = pd.read_csv(file_name) df_CUSTOMERS = pd.read_csv(file_name) df_contract['START_DATE'] = pd.to_datetime(df_contract['START_DATE']) df_contract['END_DATE'] = pd.to_datetime(df_contract['END_DATE'])
The ideal output is something like:
FIRST_CONTRACT_ID CHAIN_LENGTH CON_MONTHS
1234567 5 60
1500001 1 4
800 10 180
Then those data points will be displayed.
EDIT2: CSV file has changed, it could be simpler now. The question has been updated.
source to share
After a lot of trial and error, I got the job!
This finds the time difference between the first and last contracts in the chain and finds the length of the chain.
Not the cleanest code, but it works:
test = 'START_DATE'
df_short = df_policy[['OLD_CON_ID',test,'CONTRACT_ID']]
df_short.rename(columns={'OLD_CON_ID':'PID','CONTRACT_ID':'CID'},
inplace = True)
df_test = df_policy[['CONTRACT_ID','END_DATE']]
df_test.rename(columns={'CONTRACT_ID':'CID','END_DATE': 'PED'}, inplace = True)
df_copy1 = df_short.copy()
df_copy2 = df_short.copy()
df_copy2.rename(columns={'PID':'PPID','CID':'PID'}, inplace = True)
df_merge1 = pd.merge(df_short, df_copy2,
how='left',
on=['PID'])
df_merge1['START_DATE_y'].fillna(df_merge1['START_DATE_x'], inplace = True)
df_merge1.rename(columns={'START_DATE_x':'1_EFF','START_DATE_y':'2_EFF'}, inplace=True)
Copy, merge, fill and rename code is repeated for 5 merged data frames:
df_merged = pd.merge(df_merge5, df_test,
how='right',
on=['CID'])
df_merged['TOTAL_MONTHS'] = ((df_merged['PED'] - df_merged['6_EFF']
)/np.timedelta64(1,'M'))
df_merged4 = df_merged[
(df_merged['PED'] >= pd.to_datetime('2015-07-06'))
df_merged4['CHAIN_LENGTH'] = df_merged4.drop(['PED','1_EFF','2_EFF','3_EFF','4_EFF','5_EFF'], axis=1).apply(lambda row: len(pd.unique(row)), axis=1) -3
Hope my code will be understood and will help someone in the future.
source to share