Pandas: using a combination of "Apply" and "regex" strings over 5 million lines

Question

Pandas: using a combination of "Apply" and "regex" strings over 5 million lines

Problem: I am trying to appropriately classify each row of my dataframe based on a column description

. For this, I want to extract keywords based on a list of common words. First, I split the key phrases into words (ie "Grocery store" becomes "Food" and "Store"). Then I check to see if any of the strings in my framework contain the words "Food" and "Store". Unfortunately the code I produced is too slow. How can I optimize it to work with 5 million rows of data?

Sample data

Here are the first 30 lines of my frame:

   bank_report_id transaction_date  amount                                        description type_codes              category
0              14698       2016-04-26   -3.00  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings
1              14698       2016-04-25 -110.00                                  ROGERSWL 1TIME _V                    Uncategorized
2              14698       2016-04-25  -10.50                                     SUBWAY # x6664               Restaurants/Dining
3              14698       2016-04-25   -1.00  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings
4              14698       2016-04-25  -73.75                                    TICKETMASTER CA                    Entertainment
5              14698       2016-04-25   -6.20                                     HAPPY ONE STOP                 Home Improvement
6              14698       2016-04-25   -7.74                                    BOOSTERJUICE-19               Restaurants/Dining
7              14698       2016-04-25  -28.49                                    LEISURE-FIRST O                    Uncategorized
8              14698       2016-04-22   -3.16                                    MCDONALD #400               Restaurants/Dining
9              14698       2016-04-22   -0.50  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings
10             14698       2016-04-22  -10.50                                     SUBWAY # x6664               Restaurants/Dining
11             14698       2016-04-21  -19.87                                     TRAFALGAR ESSO                    Gasoline/Fuel
12             14698       2016-04-21   -1.00  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings
13             14698       2016-04-20   -3.76                                    MCDONALD #400               Restaurants/Dining
14             14698       2016-04-20   -1.00  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings
15             14698       2016-04-20  -40.00                                     TRAFALGAR ESSO                    Gasoline/Fuel
16             14698       2016-04-19  -10.07                                     TRAFALGAR ESSO                    Gasoline/Fuel
17             14698       2016-04-19   -5.21                                    TIM HORTONS #24               Restaurants/Dining
18             14698       2016-04-19   -3.50  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings
19             14698       2016-04-18   -1.00  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings
20             14698       2016-04-18   -5.21                                    TIM HORTONS #24               Restaurants/Dining
21             14698       2016-04-18  -22.57                                     WAL-MART #3170              General Merchandise
22             14698       2016-04-18  -16.94                                    URBAN PLANET #1                   Clothing/Shoes
23             14698       2016-04-18  -12.95                                     LCBO/RAO #0545               Restaurants/Dining
24             14698       2016-04-18  -13.87                                     TRAFALGAR ESSO                    Gasoline/Fuel
25             14698       2016-04-18  -41.75                                     NON-TD ATM W/D             ATM/Cash Withdrawals
26             14698       2016-04-18   -4.19                                     SUBWAY # x6338               Restaurants/Dining
27             14698       2016-04-15   -0.50  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings
28             14698       2016-04-15  -35.06                                       UNION BURGER               Restaurants/Dining
29             14698       2016-04-15  -25.00                                     PIONEER STN #1                      Electronics

Here is a small subset of the word list:

['Exxon Mobil', 'Shell', 'Food Store', 'Pizza', 'Walgreens', 'Payday Loan', 'NSF', 'Lincoln', 'Apartment', 'Homes']

My attempt at a solution:

def get_matches(row):

    keywords = pd.read_csv('Keywords.csv', encoding='ISO-8859-1')['description'].apply(lambda x: x.lower()).str.split(
        " ").tolist()

    split_description = [d.lower() for d in row['description'].split(" ")]

    thematches = []
    for group in keywords:
        matches = [any([bool(re.search(y, x)) for x in split_description]) for y in group]

        if all(matches):
            thematches.append(" ".join(group))

    if len(thematches) > 0:
        return thematches
    else:
        return "NA"

df['match'] = df.apply(get_matches, axis=1)

Desired output:

    bank_report_id transaction_date  amount                                        description type_codes              category              match
0            14698       2016-04-26   -3.00  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings      [simply save]
1            14698       2016-04-25 -110.00                                  ROGERSWL 1TIME _V                    Uncategorized           [rogers]
2            14698       2016-04-25  -10.50                                     SUBWAY # x6664               Restaurants/Dining           [subway]
3            14698       2016-04-25   -1.00  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings      [simply save]
4            14698       2016-04-25  -73.75                                    TICKETMASTER CA                    Entertainment    [ticket master]
5            14698       2016-04-25   -6.20                                     HAPPY ONE STOP                 Home Improvement                 NA
6            14698       2016-04-25   -7.74                                    BOOSTERJUICE-19               Restaurants/Dining            [juice]
7            14698       2016-04-25  -28.49                                    LEISURE-FIRST O                    Uncategorized                 NA
8            14698       2016-04-22   -3.16                                    MCDONALD #400               Restaurants/Dining       [mcdonald's]
9            14698       2016-04-22   -0.50  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings      [simply save]
10           14698       2016-04-22  -10.50                                     SUBWAY # x6664               Restaurants/Dining           [subway]
11           14698       2016-04-21  -19.87                                     TRAFALGAR ESSO                    Gasoline/Fuel             [esso]
12           14698       2016-04-21   -1.00  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings      [simply save]
13           14698       2016-04-20   -3.76                                    MCDONALD #400               Restaurants/Dining       [mcdonald's]
14           14698       2016-04-20   -1.00  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings      [simply save]
15           14698       2016-04-20  -40.00                                     TRAFALGAR ESSO                    Gasoline/Fuel             [esso]
16           14698       2016-04-19  -10.07                                     TRAFALGAR ESSO                    Gasoline/Fuel             [esso]
17           14698       2016-04-19   -5.21                                    TIM HORTONS #24               Restaurants/Dining  [tim hortons, rt]
18           14698       2016-04-19   -3.50  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings      [simply save]
19           14698       2016-04-18   -1.00  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings      [simply save]
20           14698       2016-04-18   -5.21                                    TIM HORTONS #24               Restaurants/Dining  [tim hortons, rt]
21           14698       2016-04-18  -22.57                                     WAL-MART #3170              General Merchandise               [rt]
22           14698       2016-04-18  -16.94                                    URBAN PLANET #1                   Clothing/Shoes     [urban planet]
23           14698       2016-04-18  -12.95                                     LCBO/RAO #0545               Restaurants/Dining                 NA
24           14698       2016-04-18  -13.87                                     TRAFALGAR ESSO                    Gasoline/Fuel             [esso]
25           14698       2016-04-18  -41.75                                     NON-TD ATM W/D             ATM/Cash Withdrawals                 NA
26           14698       2016-04-18   -4.19                                     SUBWAY # x6338               Restaurants/Dining           [subway]
27           14698       2016-04-15   -0.50  Simply Save TD EVERY DAY SAVINGS ACCOUNT xxxxx...                          Savings      [simply save]
28           14698       2016-04-15  -35.06                                       UNION BURGER               Restaurants/Dining           [burger]
29           14698       2016-04-15  -25.00                                     PIONEER STN #1                      Electronics          [pioneer]

+3

python pandas regex classification apply

Riley Hun 10 jul. 17 at 14:57

source to share

2 answers

Deena · Answer 1 · 2017-07-10T19:06:28+0000

I would do two things:

Since you are using a column 'description'

, try exporting it as a list df.description.tolist()

. Use this list to process strings and then you can pd.concat

get your results. I believe this can eliminate the overhead pandas

. Arrays Numpy

are known to be even more optimized, however I'm not entirely sure if this really applies to string operations. But you can also try it.
Parallelize your code. joblib

offers an excellent user-friendly interface. ( https://pythonhosted.org/joblib/parallel.html )

Rayhane mama · Answer 2 · 2017-07-10T19:42:23+0000

You can try something like this:

df['match'] = df['description type_codes'].apply(lambda x: [l  for l in match_list if l.lower() in x.lower()])

it's always faster to use pandas.map and list instead of explicitly looping through the iteration.

if you don't like []

the places where there are no matches, you can use them to change them to np.nan

or whatever:

df['match'] = df.match.apply(lambda y: np.nan if len(y)==0 else y)

for more information on improving performance with pandas you should visit these links:

topic

document

output:

# only the interesting column

0         [simply save]
1              [rogers]
2              [subway]
3         [simply save]
4                   NaN
5                   NaN
6               [juice]
7                   NaN
8          [mcdonald's]
9         [simply save]
10             [subway]
11               [esso]
12        [simply save]
13         [mcdonald's]
14        [simply save]
15               [esso]
16               [esso]
17    [tim hortons, rt]
18        [simply save]
19        [simply save]
20    [tim hortons, rt]
21                 [rt]
22       [urban planet]
23                  NaN
24               [esso]
25                  NaN
26             [subway]
27        [simply save]
28             [burger]
29            [pioneer]

Hope this was helpful.

Pandas: using a combination of "Apply" and "regex" strings over 5 million lines

More articles: