One Hashmap key holding the class. count the key and get the counter

I am working on my own database project. I have an input file obtained from: http://ir.dcs.gla.ac.uk/resources/test_collections/cran/

After processing 1400 separate files, each one is named 00001.txt, ... 01400.txt ...) and after applying Stemming , I will store them separately in a specific folder, name it StemmedFolder in the following format:

at StemmedFolder: 00001.txt includes:

investig
aerodynam
wing
slipstream
brenckman
experiment
investig
aerodynam
wing

      

at StemmedFolder: 00756.txt includes:

remark
eddi
viscos
compress
mix
flow
lu
ting

      

Etc....

I wrote codes that do:

  • get StemmedFolder , count unique words
  • Sort alphabetically
  • Add document ID
  • save each to a new file 00001.txt at 01400.txt as described

{I can provide my codes for these 4 sections in case someone needs to see how the implementation or change or any editing happens}


the output of each file will be the result of a separate file. (1400, each named 00001.txt, 00002.txt ...) in a specific folder lets you call it FrequentlyyFolder in the following format:

The FrequencyFolder: 00001.txt includes:

00001,aerodynam,2
00001,agre,3
00001,angl,1
00001,attack,7
00001,basi,4
....

      

The FrequencyFolder: 00999.txt includes:

00999,aerodynam,5
00999,evalu,1
00999,lift,3
00999,ratio,2
00999,result,9
....

      

The FrequencyFolder: 01400.txt includes:

01400,subtract,1
01400,support,1
01400,theoret,1
01400,theori,1
01400,.....

      


______________

Now my question is :

I need to merge those 1400 files again to output a txt file that looks like this format with some calculations:

'aerodynam' totalFrequency=3docs: [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
'book' totalFrequncy=2docs: [[Doc_00562,6],[Doc_01111,1]
....
....
'result' totalFrequency=1doc: [[Doc_00010,5]]
....
....

'zzzz' totalFrequency=1doc: [[Doc_01235,1]]

      


Thanks for taking the time to read this long post.

+3


source to share


1 answer


You can use Map

for List

.

Map<String,List<FileInformation>> statistics = new HashMap<>()

In the above map, the keyword would be a word, and the value would be an object List<FileInformation>

describing the statistics of the individual files containing that word. The class FileInformation

can be declared like this:

class FileInformation {
    int occurrenceCount;
    String fileName;

    //getters and setters
}

      



To complete the above card, follow these steps:

  • Read each file into FrequencyFolder

  • When you first meet a word, place it as a key in Map

    .
  • Create an object FileInformation

    and set occurrenceCount

    to the number of occurrences found and set the fileName

    name of the file in which it was found. Add this object to the List<FileInformation>

    corresponding key created in step 2.
  • The next time you come across the same word in another file, create a new object FileInfomation

    and add it to the List<FileInformation>

    corresponding map entry for the word.

Once filled Map

, the statistics stamp should be part of the cake.

for(String word : statistics.keySet()) {
  List<FileInformation> fileInfos = statistics.get(word);
  for(FileInformation fileInfo : fileInfos) {
      //sum up the occureneceCount for the word to get the total frequency
  }
}

      

+1


source







All Articles