Java scanner not completely reading every line in .txt

Question

Java scanner not completely reading every line in .txt

This program tries to separate the text file from the words and then count each time each word is used. The scanner seems to only read parts of each line and I have no idea why. This is my first time using this scanning method.

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;


public class WordStats {

    public static void main(String args[]){
        ArrayList<String> words = new ArrayList<>(1);
        ArrayList<Integer> num = new ArrayList<>(1);
        Scanner sc2 = null;
        try {
            sc2 = new Scanner(new File("source.txt"));
        } catch (FileNotFoundException e) {
            e.printStackTrace();  
        }
        while (sc2.hasNextLine()) {
            Scanner s2 = new Scanner(sc2.nextLine());
            boolean set=false;
            while (s2.hasNext()) {
                num.add(1);
                String s = s2.next().replaceAll("[^A-Za-z ]", " ").toLowerCase().trim();
                for(int i=0;i<words.size(); i++){
                    if(s.equals(words.get(i))){
                        num.set(i,num.get(i)+1);
                        set=true;
                    }
                }
                if(!set){
                words.add(s);
                num.add(1);
                }
            }
        }
        for(int i=0;i<words.size();i++){
            System.out.println(words.get(i)+" "+num.get(i));
        }
    }
}

the text file is the Gettysburg address:

ABRAHAM LINCOLN, "ADDRESS OF GETTISBURGH" (19 NOVEMBER 1863)

Fourscore and seven years ago our fathers performed on this continent, a new nation conceived in Liberty and dedicated that all people are created equal.

Now we have fought a great civil war, testing whether this nation, or any nation so conceived and so dedicated, can endure for a long time. We met on the great battlefield of this war. We came to dedicate part of this field as a final resting place for those who gave their lives here so that this nation could live. It is quite suitable and right that we should do it.

But in a broader sense, we cannot consecrate - we sanctify - we cannot Halloween - this land. The brave people, living and dead, who fought here, sanctified it, far beyond our poor strength to add or diminish. The world will not notice much and will not remember what we say here, but it will never forget what they did here. For us, it is rather to be dedicated here to the unfinished work that they have fought here so far so nobly. It is rather for us to be here dedicated to the great task left before us - that of these revered dead, we accept an increased devotion to the cause for which they gave the last full measure of devotion - that we are here very determined that these dead will not die. in vain - that this people, under God, will have a new birth of freedom, and that the government is people, nations,for the people will not perish from the earth.

the original line breaks are preserved. my output seems to only count as part of each line and also counts as whitespace as a word twice. Output:

abraham 1
lincoln 1
gettysburg 1
address 1
 2
november 1
fourscore 1
and 5
seven 1
years 1
ago 1
our 2
fathers 1
brought 1
forth 1
on 2
this 3
continent 1
a 7
new 2
nation 5
conceived 2
in 4
liberty 1
now 1
we 8
are 2
engaged 1
but 2

It may be something other than the scanning method, but I am more familiar with this part of the code and I do not think it is.

+3

java file

dilucidis 11 Sep 14 at 16:13

source to share

3 answers

The problem is your code is adding 1

to the list unconditionally num

on every iteration of the loop. This shifts num

in relation to words

, producing the wrong conclusion.

Removing num.add(1);

from the nested loop while

would fix the problem. However, it is best used Map<String,Integer>

to keep track of counts. In addition to keeping counts and words always in sync, this change will allow you to completely remove the nested loop while

and use a quick search based on your map algorithm.

+1

dasblinkenlight 11 Sep 14 at 16:21

source to share

The logic is a little distorted. You have parallel lists that should have the same number of elements, but not be added in parallel.

    Map<String, Integer> wordFrequencies = new TreeMap<>();

    while (sc2.hasNextLine()) {
        Scanner s2 = new Scanner(sc2.nextLine());
        while (s2.hasNext()) {
            String word = s2.next().replaceAll("[^A-Za-z ]", " ")
                .toLowerCase().trim();
            Integer n = wordFrequencies.get(word);
            wordFrequencies.put(word, n == null ? 1 : 1 + n);
        }
    }
    for (Map.Entry<String, Integer> entry : wordFrequencies.entrySet()) {
        System.out.printf("%-40s %5d%n", entry.getKey(), entry.getValue());
    }

+1

Joop eggen 11 Sep 14 at 16:26

source to share

Justin · Accepted Answer · 2014-09-11T18:34:56+0000

You need to reset your boolean set at the start of this while loop

 while (s2.hasNext()) {
 set = false;

after you meet your first repeated word on each line, set it to always true and no new words are added to your list.

And the counting of whitespace has to do with how your replaceall handles "(19" and "1863") as there are no alphabetic characters in those "words".

Java scanner not completely reading every line in .txt

More articles: