Python fuzzywuzzy error string or buffer awaiting

I am using fuzzywuzzy to find closest matches in csv for company names. I am comparing manually matched strings to unmatched strings in the hopes of finding some useful closeness matches, however, I am getting a string or buffer error in fuzzywuzzy. My code:

from fuzzywuzzy import process
from pandas import read_csv

if __name__ == '__main__':
    df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
    df_false = df[df['match_manual'].isnull()]  
    df_true = df[df['match_manual'].notnull()]
    sss_false = df_false['sss'].values.tolist()
    sss_true = df_true['sss'].values.tolist()


    for sssf in sss_false:
        mmm = process.extractOne(sssf, sss_true) # find best choice
        print sssf + str(tuple(mmm))

      

This creates the following error:

Traceback (most recent call last):
File "fuzzywuzzy_usm2_csv_test.py", line 21, in <module>
mmm = process.extractOne(sssf, sss_true) # find best choice
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractOne
best_list = extract(query, choices, processor, scorer, limit=1)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract
processed = processor(choice)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer

      

It has to do with the effects of imports in pandas with the specified encoding, which I added to prevent UnicodeDecodeErrors

, but had a knock effect causing this error. I tried to force the object to use str(sssf)

, but that doesn't work.

So I have highlighted the line causing the error here: #N/A,,,,,,

(line 29 in the code below). I assumed the error #

was causing the error, but oddly enough, the A

char issue wasn't causing it , because the file is running when it's deleted. What is strange to me is that the line with two lines below N/A

parses just fine, however line 29 will not be parsed when I remove the character #

, even if that field looks like the field below.

sss,sid,match_manual,notes,match_date,source,match_by
N20 KIDS,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 FESTIVAL,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 LTD,,,,,,
N21 LTD.,04615294_com,true,,2014-12-02,,OpenCorps
N2 CHECK,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 CHECK LIMITED,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LIMITED,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3)
N 2 CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 CHECK LTD,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2E & BACK LTD,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 GROUP LLC,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 GROUP LTD,04475764_com,true,,2014-05-05,data taken from u_supplier_match,20140429_fuzzy_match.ktr (stream 2)
N2R PRODUCTIONS,SC266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 VISUAL COMMUNICATIONS LIMITED,,,,,,
N2 VISUAL COMMUNICATIONS LTD,03144224_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N2WEB,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N3 DISPLAY GRAPHICS LTD,04008480_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N3O LIMITED,06561158_com,true,,2014-12-02,,OpenCorps
N3O LTD,,,,,,
N400138,,,,,,
N400360,,,,,,
N4K LTD,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N51 LTD,,,,,,
N68 LTD,,,,,,
N8 LTD,,,,,,
N9 DESIGN,07342091_com,true,,2015-02-07,openrefine/opencorporates,IM
#N/A,,,,,,
N A,,,,,,
N/A,red_general_xtr,true,Matches done manually,2015-04-16,manual matching,IM
(N) A & A BUILDERS LTD,,,,,,

      

+3


source to share


2 answers


pandas.read_csv

Parses a string 'N/A'

as not a number by default ( NaN

)

In your case, this means that you end up with a value NaN

, not a string. In your sample data, this happens in two places

The third line from the bottom (the line highlighted in the question) results in sss_false[-3] == nan

The last line results in sss_true[-1] == nan

.


Option 1

If you want to parse a string 'N/A'

as a string instead NaN

, the way to do it is to replace

df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")

      

from



df = read_csv("usm_clean.csv", encoding = "ISO-8859-1", keep_default_na=False, na_values='')

      

The meaning of these additional options is described in the pandas docs .

na_values : list-like or dict, the default None

Additional lines to recognize as NA / NaN. If dict is accepted, values ​​for each NA column

keep_default_na : bool, default True

If na_values ​​are specified and keep_default_na is False, the default NaN values ​​are overridden, otherwise they are appended to

So the above modification tells pandas to recognize an empty string as NA and discard the default 'N/A'


Option 2

If you want to flush the rows with 'N/A'

in the first column, you need to remove the members NaN

from sss_true

and sss_false

. one way to do it:

sss_true = [x for x in sss_true if type(x) != str]
sss_false = [x for x in sss_false if type(x) != str]

      

+2


source


Your variable sss_true

contains:

[
    u'N21 LTD.',
    u'N2 CHECK LIMITED',
    u'N2 CHECK LTD',
    u'N2 GROUP LTD',
    u'N2 VISUAL COMMUNICATIONS LTD',
    u'N3 DISPLAY GRAPHICS LTD',
    u'N3O LIMITED',
    u'N9 DESIGN',
    nan              # <---- note this
]

      



Once you get rid of the not-a-number value , everything starts working as expected.

0


source







All Articles