Python fuzzywuzzy error string or buffer awaiting
I am using fuzzywuzzy to find closest matches in csv for company names. I am comparing manually matched strings to unmatched strings in the hopes of finding some useful closeness matches, however, I am getting a string or buffer error in fuzzywuzzy. My code:
from fuzzywuzzy import process
from pandas import read_csv
if __name__ == '__main__':
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
df_false = df[df['match_manual'].isnull()]
df_true = df[df['match_manual'].notnull()]
sss_false = df_false['sss'].values.tolist()
sss_true = df_true['sss'].values.tolist()
for sssf in sss_false:
mmm = process.extractOne(sssf, sss_true) # find best choice
print sssf + str(tuple(mmm))
This creates the following error:
Traceback (most recent call last):
File "fuzzywuzzy_usm2_csv_test.py", line 21, in <module>
mmm = process.extractOne(sssf, sss_true) # find best choice
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractOne
best_list = extract(query, choices, processor, scorer, limit=1)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract
processed = processor(choice)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer
It has to do with the effects of imports in pandas with the specified encoding, which I added to prevent UnicodeDecodeErrors
, but had a knock effect causing this error. I tried to force the object to use str(sssf)
, but that doesn't work.
So I have highlighted the line causing the error here: #N/A,,,,,,
(line 29 in the code below). I assumed the error #
was causing the error, but oddly enough, the A
char issue wasn't causing it , because the file is running when it's deleted. What is strange to me is that the line with two lines below N/A
parses just fine, however line 29 will not be parsed when I remove the character #
, even if that field looks like the field below.
sss,sid,match_manual,notes,match_date,source,match_by
N20 KIDS,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 FESTIVAL,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 LTD,,,,,,
N21 LTD.,04615294_com,true,,2014-12-02,,OpenCorps
N2 CHECK,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 CHECK LIMITED,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LIMITED,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3)
N 2 CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 CHECK LTD,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2E & BACK LTD,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 GROUP LLC,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 GROUP LTD,04475764_com,true,,2014-05-05,data taken from u_supplier_match,20140429_fuzzy_match.ktr (stream 2)
N2R PRODUCTIONS,SC266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 VISUAL COMMUNICATIONS LIMITED,,,,,,
N2 VISUAL COMMUNICATIONS LTD,03144224_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N2WEB,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N3 DISPLAY GRAPHICS LTD,04008480_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N3O LIMITED,06561158_com,true,,2014-12-02,,OpenCorps
N3O LTD,,,,,,
N400138,,,,,,
N400360,,,,,,
N4K LTD,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N51 LTD,,,,,,
N68 LTD,,,,,,
N8 LTD,,,,,,
N9 DESIGN,07342091_com,true,,2015-02-07,openrefine/opencorporates,IM
#N/A,,,,,,
N A,,,,,,
N/A,red_general_xtr,true,Matches done manually,2015-04-16,manual matching,IM
(N) A & A BUILDERS LTD,,,,,,
source to share
pandas.read_csv
Parses a string 'N/A'
as not a number by default ( NaN
)
In your case, this means that you end up with a value NaN
, not a string. In your sample data, this happens in two places
The third line from the bottom (the line highlighted in the question) results in sss_false[-3] == nan
The last line results in sss_true[-1] == nan
.
Option 1
If you want to parse a string 'N/A'
as a string instead NaN
, the way to do it is to replace
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
from
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1", keep_default_na=False, na_values='')
The meaning of these additional options is described in the pandas docs .
na_values : list-like or dict, the default None
Additional lines to recognize as NA / NaN. If dict is accepted, values for each NA column
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False, the default NaN values are overridden, otherwise they are appended to
So the above modification tells pandas to recognize an empty string as NA and discard the default 'N/A'
Option 2
If you want to flush the rows with 'N/A'
in the first column, you need to remove the members NaN
from sss_true
and sss_false
. one way to do it:
sss_true = [x for x in sss_true if type(x) != str]
sss_false = [x for x in sss_false if type(x) != str]
source to share
Your variable sss_true
contains:
[
u'N21 LTD.',
u'N2 CHECK LIMITED',
u'N2 CHECK LTD',
u'N2 GROUP LTD',
u'N2 VISUAL COMMUNICATIONS LTD',
u'N3 DISPLAY GRAPHICS LTD',
u'N3O LIMITED',
u'N9 DESIGN',
nan # <---- note this
]
Once you get rid of the not-a-number value , everything starts working as expected.
source to share