Python 3 Dictionary for Weighted Inverted Indexes

First, this is homework, so I would just like you to suggest your suggestions. I am writing a program that generates a weighted inverted index. A weighted inverted index is a dictionary with a word as a key; the value is a list of lists, with each item in the list containing the document number and the number of times that word appears in the document.

For example,

{"a": [[1, 2],[2,1]]}
The word "a" appears twice in document 1 and once in document 2.

      

I am practicing with two small files.

file1.txt:

    Where should I go
    When I want to have
    A smoke,
    A pancake, 
    and a nap.

      

file2.txt:

I do not know
Where my pancake is
I want to take a nap.

      

Here is my program code:

def cleanData(myFile):
    file = open(myFile, "r")

    data = file.read()
    wordList = []

    #All numbers and end-of-sentence punctuation
    #replaced with the empty string
    #No replacement of apostrophes
    formattedData = data.strip().lower().replace(",","")\
                 .replace(".","").replace("!","").replace("?","")\
                 .replace(";","").replace(":","").replace('"',"")\
                 .replace("1","").replace("2","").replace("3","")\
                 .replace("4","").replace("5","").replace("6","")\
                 .replace("7","").replace("8","").replace("9","")\
                 .replace("0","")

    words = formattedData.split() #creates a list of all words in the document
    for word in words:
        wordList.append(word)     #adds each word in a document to the word list
    return wordList

def main():

fullDict = {}

files = ["file1.txt", "file2.txt"]
docNumber = 1

for file in files:
    wordList = cleanData(file)

    for word in wordList:
        if word not in fullDict:
            fullDict[word] = []
            fileList = [docNumber, 1]
            fullDict[word].append(fileList)
        else:
            listOfValues = list(fullDict.values())
            for x in range(len(listOfValues)):
                if docNumber == listOfValues[x][0]:
                    listOfValues[x][1] +=1
                    fullDict[word] = listOfValues
                    break
            fileList = [docNumber,1]
            fullDict[word].append(fileList)

    docNumber +=1
return fullDict

      

What I am trying to do is generate something like this:

{"a": [[1,3],[2,1]], "nap": [[1,1],[2,1]]}

      

What I get is this:

{"a": [[1,1],[1,1],[1,1],[2,1]], "nap": [[1,1],[2,1]]}

      

It records all occurrences of each word in all documents, but it records the repetitions separately. I cannot figure it out. Any help would be appreciated! Thank you in advance.:)

+3


source to share


2 answers


There are two main problems in your code.

Problem 1

        listOfValues = list(fullDict.values())
        for x in range(len(listOfValues)):
            if docNumber == listOfValues[x][0]:

      

Here you take all the meanings of the dictionary, regardless of the current word, and increment the score, but you must increment the score in the lists that match the current word. So you have to change it to

listOfValues = fullDict[word]

      

Problem 2

        fileList = [docNumber,1]
        fullDict[word].append(fileList)

      

Besides increasing the word count for all words, you add new meaning to fullDict

always. But you should only add it if docNumber

it doesn't already exist in listOfValues

. So you can use else

with a path for

like this

    for word in wordList:
        if word not in fullDict:
            ....
        else:
            listOfValues = fullDict[word]
            for x in range(len(listOfValues)):
                ....
            else:
                fileList = [docNumber, 1]
                fullDict[word].append(fileList)

      



After doing these two changes, I got the following output

{'a': [[1, 3], [2, 1]],
 'and': [[1, 1]],
 'do': [[2, 1]],
 'go': [[1, 1]],
 'have': [[1, 1]],
 'i': [[1, 2], [2, 2]],
 'is': [[2, 1]],
 'know': [[2, 1]],
 'my': [[2, 1]],
 'nap': [[1, 1], [2, 1]],
 'not': [[2, 1]],
 'pancake': [[1, 1], [2, 1]],
 'should': [[1, 1]],
 'smoke': [[1, 1]],
 'take': [[2, 1]],
 'to': [[1, 1], [2, 1]],
 'want': [[1, 1], [2, 1]],
 'when': [[1, 1]],
 'where': [[1, 1], [2, 1]]}

      


There are several suggestions for improving your code.

  • Instead of using lists to store the document and invoice number, you can actually use a dictionary. It will make your life easier.

  • Instead of manually counting, you can use collections.Counter

    .

  • Instead of using multiple substitutions, you can use a simple regex like this

    formattedData = re.sub(r'[.!?;:"0-9]', '', data.strip().lower())
    
          

If I cleaned up cleanData

, I would do it like this:

import re
def cleanData(myFile):
    with open(myFile, "r") as input_file:
        data = input_file.read()
    return re.sub(r'[.!?;:"0-9]', '', data.strip().lower()).split()

      

In the loop, main

you can use the improvements suggested by Brad Boudlon like this

def main():
    fullDict = {}
    files = ["file1.txt", "file2.txt"]
    for docNumber, currentFile in enumerate(files, 1):
        for word in cleanData(currentFile):
            if word not in fullDict:
                fullDict[word] = [[docNumber, 1]]
            else:
                for x in fullDict[word]:
                    if docNumber == x[0]:
                        x[1] += 1
                        break
                else:
                    fullDict[word].append([docNumber, 1])
    return fullDict

      

+2


source


My preferred implementation of for loops does not iterate using len and range functions. Since these are all mutable lists, you don't need to know the index, you just need to have each of the lists, and then it can be changed without an index. I replaced the for loop with the following and got the same result as the others.



for word in wordList:
    if word not in fullDict:
        fullDict[word] = [[docNumber, 1]]
    else:
        for val in fullDict[word]:
            if val[0] == docNumber:
                val[1] += 1
                break
        else:
            fullDict[word].append([docNumber, 1])

      

+1


source







All Articles