Python fuzzywuzzy error string or buffer awaiting

Question

Python fuzzywuzzy error string or buffer awaiting

I am using fuzzywuzzy to find closest matches in csv for company names. I am comparing manually matched strings to unmatched strings in the hopes of finding some useful closeness matches, however, I am getting a string or buffer error in fuzzywuzzy. My code:

from fuzzywuzzy import process
from pandas import read_csv

if __name__ == '__main__':
    df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
    df_false = df[df['match_manual'].isnull()]  
    df_true = df[df['match_manual'].notnull()]
    sss_false = df_false['sss'].values.tolist()
    sss_true = df_true['sss'].values.tolist()


    for sssf in sss_false:
        mmm = process.extractOne(sssf, sss_true) # find best choice
        print sssf + str(tuple(mmm))

This creates the following error:

Traceback (most recent call last):
File "fuzzywuzzy_usm2_csv_test.py", line 21, in <module>
mmm = process.extractOne(sssf, sss_true) # find best choice
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractOne
best_list = extract(query, choices, processor, scorer, limit=1)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract
processed = processor(choice)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer

It has to do with the effects of imports in pandas with the specified encoding, which I added to prevent UnicodeDecodeErrors

, but had a knock effect causing this error. I tried to force the object to use str(sssf)

, but that doesn't work.

So I have highlighted the line causing the error here: #N/A,,,,,,

(line 29 in the code below). I assumed the error #

was causing the error, but oddly enough, the A

char issue wasn't causing it , because the file is running when it's deleted. What is strange to me is that the line with two lines below N/A

parses just fine, however line 29 will not be parsed when I remove the character #

, even if that field looks like the field below.

sss,sid,match_manual,notes,match_date,source,match_by
N20 KIDS,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 FESTIVAL,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 LTD,,,,,,
N21 LTD.,04615294_com,true,,2014-12-02,,OpenCorps
N2 CHECK,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 CHECK LIMITED,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LIMITED,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3)
N 2 CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 CHECK LTD,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2E & BACK LTD,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 GROUP LLC,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 GROUP LTD,04475764_com,true,,2014-05-05,data taken from u_supplier_match,20140429_fuzzy_match.ktr (stream 2)
N2R PRODUCTIONS,SC266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 VISUAL COMMUNICATIONS LIMITED,,,,,,
N2 VISUAL COMMUNICATIONS LTD,03144224_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N2WEB,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N3 DISPLAY GRAPHICS LTD,04008480_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N3O LIMITED,06561158_com,true,,2014-12-02,,OpenCorps
N3O LTD,,,,,,
N400138,,,,,,
N400360,,,,,,
N4K LTD,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N51 LTD,,,,,,
N68 LTD,,,,,,
N8 LTD,,,,,,
N9 DESIGN,07342091_com,true,,2015-02-07,openrefine/opencorporates,IM
#N/A,,,,,,
N A,,,,,,
N/A,red_general_xtr,true,Matches done manually,2015-04-16,manual matching,IM
(N) A & A BUILDERS LTD,,,,,,

+3

string python-2.7 fuzzywuzzy

woodbine 03 June 15 at 22:26

source to share

2 answers

Your variable sss_true

contains:

[
    u'N21 LTD.',
    u'N2 CHECK LIMITED',
    u'N2 CHECK LTD',
    u'N2 GROUP LTD',
    u'N2 VISUAL COMMUNICATIONS LTD',
    u'N3 DISPLAY GRAPHICS LTD',
    u'N3O LIMITED',
    u'N9 DESIGN',
    nan              # <---- note this
]

Once you get rid of the not-a-number value , everything starts working as expected.

0

dlask June 15. '15 at 6:29

source to share

J Richard Snape · Accepted Answer · 2015-06-15T10:40:00+0000

pandas.read_csv

Parses a string 'N/A'

as not a number by default ( NaN

)

In your case, this means that you end up with a value NaN

, not a string. In your sample data, this happens in two places

The third line from the bottom (the line highlighted in the question) results in sss_false[-3] == nan

The last line results in sss_true[-1] == nan

.

Option 1

If you want to parse a string 'N/A'

as a string instead NaN

, the way to do it is to replace

df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")

from

df = read_csv("usm_clean.csv", encoding = "ISO-8859-1", keep_default_na=False, na_values='')

The meaning of these additional options is described in the pandas docs .

na_values : list-like or dict, the default None

Additional lines to recognize as NA / NaN. If dict is accepted, values for each NA column

keep_default_na : bool, default True

If na_values are specified and keep_default_na is False, the default NaN values are overridden, otherwise they are appended to

So the above modification tells pandas to recognize an empty string as NA and discard the default 'N/A'

Option 2

If you want to flush the rows with 'N/A'

in the first column, you need to remove the members NaN

from sss_true

and sss_false

. one way to do it:

sss_true = [x for x in sss_true if type(x) != str]
sss_false = [x for x in sss_false if type(x) != str]

Python fuzzywuzzy error string or buffer awaiting

Option 1

Option 2

More articles: