Is there a faster way to loop through the set and replace MWE in the proposal? - Python

The challenge is to group multiple-word expressions (aka Multi-Word Expressions

).

Given the MWE dictionary, I need to add a dash to the input clauses where MWEs are found, for example.

**Input:** i have got an ace of diamonds in my wet suit .
**Output:** i have got an ace-of-diamonds in my wet-suit .

      

I am currently going through the sorted dictionary and see if the MWE appears in the sentence and replaces it whenever it appears. But there are a lot of wasted iterations.

Is there a better way to do this? One solution is to get all possible n-grams of 1, i.e.chunker2()

import re, time
mwe_list =set([i.strip() for i in codecs.open( \
            "wn-mwe-en.dic","r","utf8").readlines()])

def chunker(sentence):
  for item in mwe_list:
    if item or item.replace("-", " ") in sentence:
      #print item
      mwe_item =  '-'.join(item.split(" "))
      r=re.compile(re.escape(mwe_item).replace('\\-','[- ]'))
      sentence=re.sub(r,mwe_item,sentence)    
  return sentence

def chunker2(sentence):
    nodes = []
    tokens = sentence.split(" ")
    for i in range(0,len(tokens)):
        for j in range(i,len(tokens)):
            nodes.append(" ".join(tokens[i:j]))
    n = sorted(set([i for i in nodes if not "" and len(i.split(" ")) > 1]))

    intersect = mwe_list.intersection(n)

    for i in intersect:
        print i
        sentence = sentence.replace(i, i.replace(" ", "-"))

    return sentence

s = "i have got an ace of diamonds in my wet suit ."

time.clock()
print chunker(s)
print time.clock()

time.clock()
print chunker2(s)
print time.clock()

      

+3


source to share


2 answers


I would try to do it like this:

  • For each sentence, create a set of n-grams up to a specific length (longest MWE on your list).
  • Now it's simple mwe_nmgrams.intersection(sentence_ngrams)

    and find / replace them.

You don't have to waste time repeating all the elements in the original set.




Here's a slightly faster version chunker2

:

def chunker3(sentence):
    tokens = sentence.split(' ')
    len_tokens = len(tokens)
    nodes = set()

    for i in xrange(0, len_tokens):
        for j in xrange(i, len_tokens):
            chunks = tokens[i:j]

            if len(chunks) > 1:
                nodes.add(' '.join(chunks))

    intersect = mwe_list.intersection(n)

    for i in intersect:
        print i
        sentence = sentence.replace(i, i.replace(' ', '-'))

    return sentence

      

+2


source


First, a 2x improvement: since you are replacing MWEs with portable versions, you can preprocess the dictionary (wn-mwe-en.dic) to eliminate all hyphens from the MWE in the set, excluding one string comparison, if you allowed a hyphen inside a sentence you will also have to pre-process it, presumably online, for a minor penalty. This should cut the execution time in half.

Next, a slight improvement: immutable tuples are generally faster for iteration, not a set or list (which changes, and the iterator must check the movement of elements in memory with each step). The set () conversion eliminates duplicates as you plan. The tuple bit will keep it alive in memory, providing low-level iteration optimization with the python interpreter and its compiled libraries.

Finally, you should probably parse both the sentence and the MWE in words or tokens before doing all your comparisons, this would cut down on the number of string comparisons needed for the average length of your words (4 times if your words are 4 long characters on average ). You can also nest another loop to find the first word in the MWE as an anchor for all MWEs that share that first word, which reduces the length of the string comparisons required. But I will leave this lion's share for experiments on real data. And depending on the efficiency of intrereter vs. compiled lib, doing all this nested loop separation at the python level can actually slow things down.



So here is the result of the first two simple "sure" bets. Should be 2x faster despite preprocessing if your proposal is not very short.

mwe_list = set(i.strip() for i in codecs.open("wn-mwe-en.dic", "r", "utf8").readlines())
mwe_list = tuple(mwe.replace('-', ' ').strip() for mwe in mwe_list)
sentence = sentence.replace('-', ' ').strip()

def chunker(sentence):
  for item in mwe_list:
    if item in sentence:
    ...

      

Could not find the .dic file on my system or I will consult it.

+2


source







All Articles