Python 3.5 - Get counter for sending null items

I am doing textual analysis of texts that, due to PDF-to-txt conversion errors, used to be one-time words together. So instead of matching words, I want to match strings.

For example, I have a line:

mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'

      

And I'm looking for

key_words=['loss', 'debt', 'debts', 'elephant']

      

The output should look like this:

Filename Debt Debts Loss Elephant
mystring  2    1     1    0

      

The code that works well for me, except for a few glitches: 1) it does not report the frequency of zero words (so "Elephant" will not be in the output: 2) the word order in key_words (ie I sometimes get 1 count per "debt" and "debts" and sometimes it only reports 2 bills for "debt" and "debts are not reported." I could live with the second point if I could "print" the variable names in the dataset ... but not sure how to do it.

Below is the relevant code. Thank you! PS. Needless to say, this isn't the most elegant piece of code, but I'm slow to get involved.

bad=set(['debts', 'debt'])

csvfile=open("freq_10k_test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):

    with open(filename, encoding='utf-8', errors='ignore') as f:
      file_name=[]
      file_name.append(filename)

      new_review=[f.read()]
      freq_all=[]
      rev=[]

      from collections import Counter

      for review in new_review:
        review_processed=review.lower()
        for p in list(punctuation):
           review_processed=review_processed.replace(p,'')
           pattern = re.compile("|".join(bad), flags = re.IGNORECASE)
           freq_iter=collections.Counter(pattern.findall(review_processed))           

        frequency=[value for (key,value) in sorted(freq_iter.items())]
        freq_all.append(frequency)
        freq=[v for v in freq_all]

    fulldata = [ [file_name[i]] + freq  for i, freq in enumerate(freq)]  

    writer=csv.writer(open("freq_10k_test.csv",'a',newline='', encoding='cp850', errors='replace'))
    writer.writerows(fulldata)
    csvfile.flush()

      

+3


source to share


2 answers


You can simply pre-initialize the counter, something like this:

freq_iter = collections.Counter()
freq_iter.update({x:0 for x in bad})
freq_iter.update(pattern.findall(review_processed))   

      

The best part Counter

is that you don't need to pre-initialize it - you can just do it c = Counter(); c['key'] += 1

, but nothing prevents you from pre-initializing some values ​​to 0 if you want.

For object debt

/ debts

, this is simply not a well-defined problem. What do you want the code to do in this case? If you want it to match the longest matching pattern, you need to sort the list first, first of all, which will resolve it. If you want both to be reported, you may need to do multiple searches and keep all results.

Updated to add some information on why it can't find debts

: it has more to do with regex lookup than anything else. re.findall

always looks for the shortest match, but also, once it finds it, it does not include it in subsequent matches:

In [2]: re.findall('(debt|debts)', 'debtor debts my debt')
Out[2]: ['debt', 'debt', 'debt']

      



If you really want to find all instances of each word, you need to do this separately:

In [3]: re.findall('debt', 'debtor debts my debt')
Out[3]: ['debt', 'debt', 'debt']

In [4]: re.findall('debts', 'debtor debts my debt')
Out[4]: ['debts']

      

However, perhaps what you are really looking for are words. in this case, use the operator \b

to request a word break:

In [13]: re.findall(r'\bdebt\b', 'debtor debts my debt')
Out[13]: ['debt']

In [14]: re.findall(r'(\b(?:debt|debts)\b)', 'debtor debts my debt')
Out[14]: ['debts', 'debt']

      

I don't know if this is what you want or not ... in this case he was able to distinguish between debt

and correctly debts

, but missed debtor

because it only matches a substring and we asked him not to.

Depending on your use case, you might want to study the text ... I believe there is one in nltk that is pretty simple (only used once, so I won't try to post an example.this question A combination of text and punctuation removal might be helpful in NLTK and-learn scikit ), it should reduce debt

, debts

and debtor

all to the same root word debt

, and do similar things for other words. This may or may not be helpful; I don't know what you are doing with it.

+1


source


How do you want:

mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
for kw in key_words:
  count = mystring.count(kw)
  print('%s %s' % (kw, count))

      



Or for words:

from collections import defaultdict
words = set(mystring.split())
key_words=['loss', 'debt', 'debts', 'elephant']
d = defaultdict(int)
for word in words:
  d[word] += 1

for kw in key_words:
  print('%s %s' % (kw, d[kw]))

      

+1


source







All Articles