PriorityQueue with indexes for sorting invoices

A problem that I often face in Java (usually when writing computational linguistics code) is the need to count the number of occurrences of some elements in a dataset and then sort the elements according to their count. The simplest concrete example is word counting: I need to count the number of occurrences of each word in a text file and then sort the words by their counts to find the most frequently used words.

Unfortunately Java doesn't seem to have a good data structure for this task. I need to use words as collection indices while I count, so that I can efficiently search for the correct counter to increment every time I read a word, but the values ​​I want to sort are values, not words.

Map<String, Integer>

provides the interface needed to find a counter associated with a word, but Maps can only be sorted by their keys (i.e. TreeMap

). PriorityQueue

is a good heap implementation that will sort whatever comparator you give it, but does not provide access to the elements by some sort of index, and cannot update and retype an element (other than removing and adding It). Its single-type parameter also means that I need to insert words and their abacus into one object in order to use it.

My current "solution" is to store the counts on a map while counting them, and then copy them all in PriorityQueue

to sort them:

Map<String, Integer> wordCounts = countStuff();
PriorityQueue<NamedCount> sortedCounts = new PriorityQueue<>(wordCounts.size(),
                                             Collections.reverseOrder());
for(Entry<String, Integer> count : wordCounts.entrySet()) {
    sortedCounts.add(new NamedCount(count.getKey(), count.getValue()));
}

      

(Note that NamedCount

- this is simple pair<string, int>

, which implements Comparable

to compare integers). But this is inefficient, especially since the dataset can be very large, and it is wasteful to keep two copies of the sample set in memory.

Is there a way to randomly access the objects internally PriorityQueue

so that I can just keep one copy of the counts in the PriorityQueue and re-heapify as they are updated? Does it make sense to use Map<String, NamedCount>

one that stores "pointers" to objects in PriorityQueue<NamedCount>

?

+3


source to share


2 answers


First, for a basic data structure, Guava is generally Multiset<String>

preferred Map<String, Integer>

in the same way as Set<String>

preferred Map<String, Boolean>

. This is a cleaner API and encapsulates an increment.

Now, if it were me, I would implement a custom Multiset

one that adds additional logic to index the counters and returns them. Something like that:

class IndexedMultiset<T extends Comparable<T>> extends ForwardingMultiset<T> {

    private final Multiset<T> delegate = HashMultiset.create();
    private final TreeMultimap<Integer, T> countIndex = TreeMultimap.create();

    @Override
    protected Multiset<T> delegate() {
        return delegate;
    }


    @Override
    public int add(T element, int occurrences) {
        int prev = super.add(element, occurrences);
        countIndex.remove(prev, element);
        countIndex.put(count(element), element);
        return prev;
    }

    @Override
    public boolean add(T element) {
        return super.standardAdd(element);
    }

    //similar for remove, setCount, etc


}

      

Then I would add all the required count-based query functionality. For example, extracting the iterable of word / count pairs in descending order might look something like this:



public Iterable<CountEntry<T>> descendingCounts() {
    return countIndex.keySet().descendingSet().stream()
            .flatMap((count) -> countIndex.get(count).stream())
            .map((element) -> new CountEntry<>(element, count(element)))
            .collect(Collectors.toList());
}

public static class CountEntry<T> {
    private final T element;
    private final int count;

    public CountEntry(T element, int count) {
        this.element = element;
        this.count = count;
    }

    public T element() {
        return element;
    }

    public int count() {
        return count;
    }

    @Override
    public String toString() {
        return element + ": " + count;
    }
}

      

And all of this will be used like this:

public static void main(String... args) {
    IndexedMultiset<String> wordCounts = new IndexedMultiset<>();

    wordCounts.add("foo");
    wordCounts.add("bar");
    wordCounts.add("baz");
    wordCounts.add("baz");

    System.out.println(wordCounts.descendingCounts()); //[baz: 2, bar: 1, foo: 1]


    wordCounts.add("foo");
    wordCounts.add("foo");
    wordCounts.add("foo");

    System.out.println(wordCounts.descendingCounts()); //[foo: 4, baz: 2, bar: 1]
}

      

+2


source


If you can use third party libraries like Guava Multiset

specifically designed to solve this problem:



Multiset<String> multiset = HashMultiset.create();
for (String word : words) {
  multiset.add(word);
}
System.out.println(Multisets.copyHighestCountFirst(multiset));

      

+1


source







All Articles