How to find doubled words in a file

Question

How to find doubled words in a file

I am having some problems with the code. I am trying to find duplicate words in a file like "the" and then print out the line where it occurs. So far, my code works for counting lines, but gives me all the words that are repeated throughout the file, not just letter by letter. What do I need to change so that it only counts doubled words?

my_file = input("Enter file name: ")
lst = []
count = 1
with open(my_file, "r") as dup:
for line in dup:
    linedata = line.split()
    for word in linedata:
        if word not in lst:
            lst.append(word)
        else:
           print("Found word: {""} on line {}".format(word, count))
           count = count + 1
dup.close()

+3

python-3.x

RandomUser 03 Apr 17 at 13:35

source to share

3 answers

my_file = input("Enter file name: ")
with open(my_file, "r") as dup:
    for line_num, line in enumerate(dup):
        words_in_line = line.split()
        duplicates = [word for i, word in enumerate(words_in_line[1:]) if words_in_line[i] == word]
        # now you have a list of duplicated words in line in duplicates
        # do whatever you want with it

+1

Maciek 03 Apr 17 at 13:45

source to share

Place the code below in a file named THISfile.py and run it to see what happens:

# myFile = input("Enter file name: ")
# line No 2: line with with double 'with'
# line No 3: double ( word , word ) is not a double word
myFile="THISfile.py"
lstUniqueWords = []
noOfFoundWordDoubles = 0
totalNoOfWords       = 0
lineNo               = 0
lstLineNumbersWithWordDoubles = []
with open(myFile, "r") as myFile:
    for line in myFile:
        lineNo+=1 # memorize current line number 
        lineWords = line.split()
        if len(lineWords) > 0: # scan line only if it contains words
            currWord = lineWords[0] # remember already 'visited' word
            totalNoOfWords += 1
            if currWord not in lstUniqueWords: 
                lstUniqueWords.append(currWord) 
                # put 'visited' word word into lstAllWordsINmyFile (if it is not already there)
            lastWord = currWord # we are done with current, so current becomes last one
            if len(lineWords) > 1 : # proceed only if line has two or more words
                for word in lineWords[1:] : # loop over all other words
                    totalNoOfWords += 1
                    currWord = word
                    if currWord not in lstUniqueWords: 
                        lstUniqueWords.append(currWord) 
                        # put 'visited' word into lstAllWordsINmyFile (if it is not already there)
                    if( currWord == lastWord ): # duplicate word found: 
                        noOfFoundWordDoubles += 1
                        print("Found double word: ['{""}'] in line {}".format(currWord, lineNo))
                        lstLineNumbersWithWordDoubles.append(lineNo)
                    lastWord = currWord 
                    #        ^--- now after all all work is done, the currWord is considered lastWord
print(
    "noOfDoubles", noOfFoundWordDoubles, "\n",
    "totalNoOfWords", totalNoOfWords, "uniqueWords", len(lstUniqueWords), "\n",
    "linesWithDoubles", lstLineNumbersWithWordDoubles
)

The output should be:

Found double word: ['with'] in line 2
Found double word: ['word'] in line 19
Found double word: ['all'] in line 33
noOfDoubles 3 
 totalNoOfWords 221 uniqueWords 111 
 linesWithDoubles [2, 19, 33]

You can now check the comments in the code to get a better understanding of how this works. Have fun coding :)

0

Claudio 03 Apr 17 at 15:36

source to share

Claudio · Accepted Answer · 2017-04-03T15:50:02+0000

Here's just a clean answer to the question:

"What do I need to change so that it only counts doubled words?"

Here you are:

my_file = input("Enter file name: ")
count = 0
with open(my_file, "r") as dup:
for line in dup:
    count = count + 1
    linedata = line.split()
    lastWord = ''
    for word in linedata:
        if word == lastWord:
            print("Found word: {""} on line {}".format(word, count))
        lastWord = word
dup.close()

How to find doubled words in a file

More articles: