Java <key, value> to extract minimum element in O (1) or O (log (n)) worst case

I iterate through a huge file reading the key and value from each line. I need to get a specific number (for example 100k) of the elements with the highest values. To keep them, I decided that I needed a collection that allows me to check the minimum value in O (1) or O (log (n)), and if the current read value is higher then remove the element with the minimum value and put a new one. Which collection allows me to do this? The values ​​are not unique, so BiMap probably doesn't fit here.

EDIT: The end goal is to get a better [key, value] to be used later. Let's say my file looks like below (first column is key, second value):
3 6
5 9
2 7
1 6
4 5
Suppose I am looking for the best two elements and an algorithm to achieve this. I decided that I would use a key-based collection to store the best items. The first two items (<3, 6>, <5, 9>) will obviously be added to the collection since its capacity is 2. But when I get to the third row I need to check if <2, 7> can be added to the collection (so I need to be able to check if the 7 is greater than the minimum size in the collection (6)

+3


source share


5 answers


It looks like you don't really need the structure, because you are just looking for the largest N values ​​with their respective keys, and the keys are not really used for sorting or searching for purposes of this problem.

I would use a PriorityQueue with a minimum value at the root. This allows you to get the smallest element in constant time, and if your next value is larger, delete and insert in O (log N).



class V{
    int key;
    int value;
}

class ComparatorV implements Comparator<V>{
    int compare(V a, V b){
        return Integer.compare(a.value, b.value);
    }
}

      

+1


source


In your particular situation, you can use TreeSet

, and to get around the uniqueness of the elements in a set, you can store pairs that are comparable, but that never seem to be equal when compared. This will allow you to break a contract with Set

that specifies that it Set

does not contain equal values.

Documentation for TreeSet

contains:

The behavior of a set is well defined, even if its order is incompatible with equals; it just doesn't obey the general contract of the Set interface

Therefore, using TreeSet

with Comparable

incompatible with equals

should be fine in this situation. If you ever need to compare your chess pairs for a different reason (maybe some other algorithm that you also use in this app) where the comparison must match equals, then please specify Comparator

for a different use. Note that it TreeSet

has a constructor that takes Comparator

, so you can use it instead of the ChessPair

implementation Comparable

.

Note. A TreeSet

provides more flexibility than PriorityQueue

in general because of all its utility methods, but by violating the "comparable negotiated equals" contract Set

, some of the functionality is TreeSet

lost. For example, you can remove the first element of a set with Set.pollFirst

, but you cannot remove an arbitrary element using remove

, as this will rely on equivalent elements.

As required " n

or worse log(n)

", the documentation also states:

This implementation provides a guaranteed log (n) time cost for the main operation (add, remove, and contain).

Also, I provide the optimization below that reduces the minimum value query to O (1) .

Example

Set s = new TreeSet<ChessPair>();

      

and

public class ChessPair implements Comparable<ChessPair>
{
    final int location;
    final int value;

    public ChessPair(final int location, final int value)
    {
        this.location = location;
        this.value = value;
    }

    @Override
    public int compareTo(ChessPair o)
    {
        if(value < o.value) return -1;
        return 1;
    }
}

      

Now you have an ordered set containing your pairs of numbers, they are ordered by your value, you can have duplicate values, and you can get the corresponding locations. You can also easily grab the first element ( set.first

), last ( set.last

), or get a subset ( set.subSet(a,b)

), or iterate over the first (or last, using descendingSet

) n

elements. This provides everything you asked for.

Usage example

You have indicated that you want to keep the top 100,000 items. So I would use one algorithm for the first 100,000 features that are simply added every time.

for(int i = 0; i < 100000 && dataSource.hasNext(); i += 1)
{
    ChessPair p = dataSource.next(); // or whatever you do to get the next line
    set.add(p);
}

      

and then another after that



while(dataSource.hasNext())
{
    ChessPair p = dataSource.next();
    if(p.value > set.first().value)
    {
        set.remove(set.pollFirst());
        set.add(p);
    }
}

      

Optimization

In your case, you can insert optimization into the algorithm where you compare it to the lowest value. The above simple version does an O (log (n)) operation every time it is compared to the minimum value since set.first()

it is O (log (n)). Instead, you can store the minimum value in a local variable.

This optimization works well for scaling this algorithm because the impact is negligible - no gain, no loss - when n

close to the total dataset size (i.e. you want the best 100 values ​​out of 110), but when the total dataset is significantly larger. than n

(ie: 100,000 out of 100,000,000,000), asking for the minimum value will be your most common operation and will now be persistent.

So now we have (after loading the initial values n

) ...

int minimum = set.first().value;
while(dataSource.hasNext())
{
    ChessPair p = dataSource.next();
    if(p.value > minimum)
    {
        set.remove(set.pollFirst());
        set.add(p);
        minimum = set.first().value;
    }
}

      

Now your most common operation is the minimum query value is constant time (O (1)), the second most commonly used operation is add is the worst log time (n) and your least common operation is delete is the worst log log (n) ...

For arbitrarily large datasets, each input is processed in constant O (1) time.

See java.util.TreeSet


Previous answer (now deprecated)

Based on the editing and discussion questions in the comments to the question, I no longer believe my original answer to be correct. I leave it below for your reference.

If you want a collection Map

that allows you to quickly access items based on order, then you need ordered Map

, for which there is a sub-interface SortedMap

. Luckily for you, Java has a large implementation SortedMap

: it TreeMap

, a Map

, which is supported by a red-black tree structure, which is an ordered tree.

Red and black trees are good as they rotate the branches to balance the tree. That is, you will not get a tree that branches n

once in one direction, giving layers n

only because your data can already be sorted. You are guaranteed to have approximately log(n)

layers in the tree, so it is always fast and guarantees a log(n)

query even for the worst case.

Try java.util.TreeMap for your situation . The page linked in the previous sentence also has links to Map

and SortedMap

. You should also check it out for SortedMap

so you can see where the TreeMap

specific functionality you are looking for is getting. It allows you to get the first key, the last key and a sub-map that extracts the range from that map.

For your situation, it is probably enough to just grab an iterator from TreeMap

and iterate over the first pairs n

, where n

is the number of lowest (or highest) values ​​you want.

0


source


Use TreeSet

that offers nesting O (log n) and O (1) finding the highest or lowest typed item.

Your class should:

  • Implementation Comparable

  • Do not apply equals()

To keep only the top 100K positions use this code:

Item item; // to add
if (treeSet.size() == 100_000) {
    if (treeSet.first().compareTo(item) < 0) {
        treeSet.remove(treeSet.first());
        treeSet.add(item);
    }
} else {
    treeSet.add(item);
}

      

0


source


If you need a collection ordered by value, you can use a TreeSet, which stores the tuples of your keys and values. TreeSet has access time O (log (n)).

class KeyValuePair<Key, Value: Comparable<Value>> implements Comparable<KeyValuePair<Key, Value>> {
    Key key;
    Value value;

    KeyValuePair(Key key, Value value) {
        this.key = key;
        this.value = value;
    }

    public int compare(KeyValuePair<Key, Value> other) {
        return this.value.compare(other.value);
    }
}

      

or instead of implementing, Comparable

you can pass Comparator

to the collection at creation time.

Then you can get the first value with treeSet.first().value

.

-1


source


Something like that?

for your data structure which can be sorted based on value

class Entry implements Comparable<Entry> {
    public final String key;
    public final long value;
    public Entry(String key, long value) {
        this.key = key;
        this.value = value;
    }

    public int compareTo(Entry other) {
        return this.value - other.value;
    }

    public int hashCode() {
        //hashcode based on the same values on which equals works
    }
}

      

valid code that works with PriorityQueue

. Sorting is value based, not key as in TreeMap. This is due to the compareMethod method defined in Entry

. If the sets grow above 100000, the bottom record (with the lowest value) is removed.

public class ProcessData {
    private int maxSize;
    private PriorityQueue<Entry> largestEntries = new PriorityQueue<>(maxSize);

    public ProcessData(int maxSize) {
        this.maxSize = maxSize;
    }

    public void addKeyValue(String key, long value) {
        largestEntries.add(new Entry(key, value));
        if (largestEntries.size() > maxSize) {
            largestEntries.poll();
        }
    }
}

      

-1


source







All Articles