One Hashmap key holding the class. count the key and get the counter
I am working on my own database project. I have an input file obtained from: http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
After processing 1400 separate files, each one is named 00001.txt, ... 01400.txt ...) and after applying Stemming , I will store them separately in a specific folder, name it StemmedFolder in the following format:
at StemmedFolder: 00001.txt includes:
investig
aerodynam
wing
slipstream
brenckman
experiment
investig
aerodynam
wing
at StemmedFolder: 00756.txt includes:
remark
eddi
viscos
compress
mix
flow
lu
ting
Etc....
I wrote codes that do:
- get StemmedFolder , count unique words
- Sort alphabetically
- Add document ID
- save each to a new file 00001.txt at 01400.txt as described
{I can provide my codes for these 4 sections in case someone needs to see how the implementation or change or any editing happens}
the output of each file will be the result of a separate file. (1400, each named 00001.txt, 00002.txt ...) in a specific folder lets you call it FrequentlyyFolder in the following format:
The FrequencyFolder: 00001.txt includes:
00001,aerodynam,2 00001,agre,3 00001,angl,1 00001,attack,7 00001,basi,4 ....
The FrequencyFolder: 00999.txt includes:
00999,aerodynam,5 00999,evalu,1 00999,lift,3 00999,ratio,2 00999,result,9 ....
The FrequencyFolder: 01400.txt includes:
01400,subtract,1 01400,support,1 01400,theoret,1 01400,theori,1 01400,.....
______________
Now my question is :
I need to merge those 1400 files again to output a txt file that looks like this format with some calculations:
'aerodynam' totalFrequency=3docs: [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
'book' totalFrequncy=2docs: [[Doc_00562,6],[Doc_01111,1]
....
....
'result' totalFrequency=1doc: [[Doc_00010,5]]
....
....
'zzzz' totalFrequency=1doc: [[Doc_01235,1]]
Thanks for taking the time to read this long post.
source to share
You can use Map
for List
.
Map<String,List<FileInformation>> statistics = new HashMap<>()
In the above map, the keyword would be a word, and the value would be an object List<FileInformation>
describing the statistics of the individual files containing that word. The class FileInformation
can be declared like this:
class FileInformation {
int occurrenceCount;
String fileName;
//getters and setters
}
To complete the above card, follow these steps:
- Read each file into
FrequencyFolder
- When you first meet a word, place it as a key in
Map
. - Create an object
FileInformation
and setoccurrenceCount
to the number of occurrences found and set thefileName
name of the file in which it was found. Add this object to theList<FileInformation>
corresponding key created in step 2. - The next time you come across the same word in another file, create a new object
FileInfomation
and add it to theList<FileInformation>
corresponding map entry for the word.
Once filled Map
, the statistics stamp should be part of the cake.
for(String word : statistics.keySet()) {
List<FileInformation> fileInfos = statistics.get(word);
for(FileInformation fileInfo : fileInfos) {
//sum up the occureneceCount for the word to get the total frequency
}
}
source to share