Pandas parsing strings for consecutive dates

After "chaining" lines and counting consecutive months from a CSV file.

I am currently reading a CSV file with 5 columns of interest (based on insurance policies):

CONTRACT_ID   START-DATE           END-DATE        CANCEL_FLAG    OLD_CON_ID
123456        2015-05-30           2016-05-30       0             8788
123457        2014-03-20           2015-03-20       0             12000
123458        2009-12-20           2010-12-20       0             NaN
...

      

I want to calculate the number of consecutive months that the chain of contracts goes.

Example: taking START-DATE

from a contract at the "front" of the chain (oldest contract) and END-DATE

from the end of the chain (new contract). The oldest contract is determined either before the termination of the contract in the chain, or the one that has no value OLD_CON_ID

.

Each line represents a contract and prev_Con_ID points to the previous contract ID. The desired outcome is the number of months that the contractual chains come back before the break (i.e. the customer did not have a contract for a certain period of time). If there is nothing in this column, then this is the first contract in this chain.

CANCEL_FLAG should also shorten the chain, because a value of 1 means that the contract has been canceled.

The current code counts the number of active contracts for each year by editing the data tactics:

df_contract = df_contract[
(df_contract['START_DATE'] <= pd.to_datetime('2015-05-31')) & 
(df_contract['END_DATE'] >= pd.to_datetime('2015-05-31')) & (df_contract['CANCEL_FLAG'] == 0 )
]
df_contract = df_contract[df_contract['CANCEL_FLAG'] == 0
]
activecount = df_contract.count()
print activecount['CONTRACT_ID']

      

Here are the first 6 lines of code where I create the dataframes and change the datetime values:

file_name = 'EXAMPLENAME.csv'
df = pd.read_csv(file_name)
df_contract = pd.read_csv(file_name)
df_CUSTOMERS = pd.read_csv(file_name)

df_contract['START_DATE'] = pd.to_datetime(df_contract['START_DATE'])
df_contract['END_DATE'] = pd.to_datetime(df_contract['END_DATE'])

      

The ideal output is something like:

FIRST_CONTRACT_ID       CHAIN_LENGTH       CON_MONTHS
1234567                 5                  60
1500001                 1                  4
800                     10                 180

      

Then those data points will be displayed.

EDIT2: CSV file has changed, it could be simpler now. The question has been updated.

+1


source to share


2 answers


After a lot of trial and error, I got the job!

This finds the time difference between the first and last contracts in the chain and finds the length of the chain.

Not the cleanest code, but it works:

test = 'START_DATE'


df_short = df_policy[['OLD_CON_ID',test,'CONTRACT_ID']]
df_short.rename(columns={'OLD_CON_ID':'PID','CONTRACT_ID':'CID'}, 

inplace = True)
df_test = df_policy[['CONTRACT_ID','END_DATE']]
df_test.rename(columns={'CONTRACT_ID':'CID','END_DATE': 'PED'}, inplace = True)


df_copy1 = df_short.copy()
df_copy2 = df_short.copy()
df_copy2.rename(columns={'PID':'PPID','CID':'PID'}, inplace = True)

df_merge1 = pd.merge(df_short, df_copy2,
    how='left',
    on=['PID'])

df_merge1['START_DATE_y'].fillna(df_merge1['START_DATE_x'], inplace = True)
df_merge1.rename(columns={'START_DATE_x':'1_EFF','START_DATE_y':'2_EFF'}, inplace=True)

      



Copy, merge, fill and rename code is repeated for 5 merged data frames:

df_merged = pd.merge(df_merge5, df_test,
    how='right',
    on=['CID'])

df_merged['TOTAL_MONTHS'] = ((df_merged['PED'] - df_merged['6_EFF']
                             )/np.timedelta64(1,'M'))

df_merged4 = df_merged[
    (df_merged['PED'] >= pd.to_datetime('2015-07-06')) 
df_merged4['CHAIN_LENGTH'] = df_merged4.drop(['PED','1_EFF','2_EFF','3_EFF','4_EFF','5_EFF'], axis=1).apply(lambda row: len(pd.unique(row)), axis=1) -3

      

Hope my code will be understood and will help someone in the future.

0


source


Not sure if I have met your requirements completely, but does something like this work ?:



df_contract['TOTAL_YEARS'] = (df_contract['END_DATE'] - df_contract['START_DATE']
                             )/np.timedelta64(1,'Y')

df_contract['TOTAL_YEARS'][(df['CANCEL_FLAG'] == 1) && (df['TOTAL_YEARS'] > 1)] = 1

      

0


source







All Articles