Python 3 Dictionary for Weighted Inverted Indexes
First, this is homework, so I would just like you to suggest your suggestions. I am writing a program that generates a weighted inverted index. A weighted inverted index is a dictionary with a word as a key; the value is a list of lists, with each item in the list containing the document number and the number of times that word appears in the document.
For example,
{"a": [[1, 2],[2,1]]}
The word "a" appears twice in document 1 and once in document 2.
I am practicing with two small files.
file1.txt:
Where should I go
When I want to have
A smoke,
A pancake,
and a nap.
file2.txt:
I do not know
Where my pancake is
I want to take a nap.
Here is my program code:
def cleanData(myFile):
file = open(myFile, "r")
data = file.read()
wordList = []
#All numbers and end-of-sentence punctuation
#replaced with the empty string
#No replacement of apostrophes
formattedData = data.strip().lower().replace(",","")\
.replace(".","").replace("!","").replace("?","")\
.replace(";","").replace(":","").replace('"',"")\
.replace("1","").replace("2","").replace("3","")\
.replace("4","").replace("5","").replace("6","")\
.replace("7","").replace("8","").replace("9","")\
.replace("0","")
words = formattedData.split() #creates a list of all words in the document
for word in words:
wordList.append(word) #adds each word in a document to the word list
return wordList
def main():
fullDict = {}
files = ["file1.txt", "file2.txt"]
docNumber = 1
for file in files:
wordList = cleanData(file)
for word in wordList:
if word not in fullDict:
fullDict[word] = []
fileList = [docNumber, 1]
fullDict[word].append(fileList)
else:
listOfValues = list(fullDict.values())
for x in range(len(listOfValues)):
if docNumber == listOfValues[x][0]:
listOfValues[x][1] +=1
fullDict[word] = listOfValues
break
fileList = [docNumber,1]
fullDict[word].append(fileList)
docNumber +=1
return fullDict
What I am trying to do is generate something like this:
{"a": [[1,3],[2,1]], "nap": [[1,1],[2,1]]}
What I get is this:
{"a": [[1,1],[1,1],[1,1],[2,1]], "nap": [[1,1],[2,1]]}
It records all occurrences of each word in all documents, but it records the repetitions separately. I cannot figure it out. Any help would be appreciated! Thank you in advance.:)
source to share
There are two main problems in your code.
Problem 1
listOfValues = list(fullDict.values())
for x in range(len(listOfValues)):
if docNumber == listOfValues[x][0]:
Here you take all the meanings of the dictionary, regardless of the current word, and increment the score, but you must increment the score in the lists that match the current word. So you have to change it to
listOfValues = fullDict[word]
Problem 2
fileList = [docNumber,1]
fullDict[word].append(fileList)
Besides increasing the word count for all words, you add new meaning to fullDict
always. But you should only add it if docNumber
it doesn't already exist in listOfValues
. So you can use else
with a path for
like this
for word in wordList:
if word not in fullDict:
....
else:
listOfValues = fullDict[word]
for x in range(len(listOfValues)):
....
else:
fileList = [docNumber, 1]
fullDict[word].append(fileList)
After doing these two changes, I got the following output
{'a': [[1, 3], [2, 1]],
'and': [[1, 1]],
'do': [[2, 1]],
'go': [[1, 1]],
'have': [[1, 1]],
'i': [[1, 2], [2, 2]],
'is': [[2, 1]],
'know': [[2, 1]],
'my': [[2, 1]],
'nap': [[1, 1], [2, 1]],
'not': [[2, 1]],
'pancake': [[1, 1], [2, 1]],
'should': [[1, 1]],
'smoke': [[1, 1]],
'take': [[2, 1]],
'to': [[1, 1], [2, 1]],
'want': [[1, 1], [2, 1]],
'when': [[1, 1]],
'where': [[1, 1], [2, 1]]}
There are several suggestions for improving your code.
-
Instead of using lists to store the document and invoice number, you can actually use a dictionary. It will make your life easier.
-
Instead of manually counting, you can use
collections.Counter
. -
Instead of using multiple substitutions, you can use a simple regex like this
formattedData = re.sub(r'[.!?;:"0-9]', '', data.strip().lower())
If I cleaned up cleanData
, I would do it like this:
import re
def cleanData(myFile):
with open(myFile, "r") as input_file:
data = input_file.read()
return re.sub(r'[.!?;:"0-9]', '', data.strip().lower()).split()
In the loop, main
you can use the improvements suggested by Brad Boudlon like this
def main():
fullDict = {}
files = ["file1.txt", "file2.txt"]
for docNumber, currentFile in enumerate(files, 1):
for word in cleanData(currentFile):
if word not in fullDict:
fullDict[word] = [[docNumber, 1]]
else:
for x in fullDict[word]:
if docNumber == x[0]:
x[1] += 1
break
else:
fullDict[word].append([docNumber, 1])
return fullDict
source to share
My preferred implementation of for loops does not iterate using len and range functions. Since these are all mutable lists, you don't need to know the index, you just need to have each of the lists, and then it can be changed without an index. I replaced the for loop with the following and got the same result as the others.
for word in wordList:
if word not in fullDict:
fullDict[word] = [[docNumber, 1]]
else:
for val in fullDict[word]:
if val[0] == docNumber:
val[1] += 1
break
else:
fullDict[word].append([docNumber, 1])
source to share