Find highest N numbers in an infinite list

I was asked this question in a recent Java interview.

Given a list of millions of items, keep the list of the highest n items. Sorting the list in descending order and then taking the first n elements is definitely inefficient due to the size of the list.

This is what I did, I would appreciate it if someone could provide a more efficient or elegant solution as I believe this can be solved with PriorityQueue

:

public TreeSet<Integer> findTopNNumbersInLargeList(final List<Integer> largeNumbersList, 
final int highestValCount) {

    TreeSet<Integer> highestNNumbers = new TreeSet<Integer>();

    for (int number : largeNumbersList) {
        if (highestNNumbers.size() < highestValCount) {
            highestNNumbers.add(number);
        } else {
            for (int i : highestNNumbers) {
                if (i < number) {
                    highestNNumbers.remove(i);
                    highestNNumbers.add(number);
                    break;
                }
            }
        }
    }
    return highestNNumbers;
}

      

+3


source to share


4 answers


You don't need nested loops, just keep inserting and removing the smallest number when the set is too big:

public Set<Integer> findTopNNumbersInLargeList(final List<Integer> largeNumbersList, 
  final int highestValCount) {

  TreeSet<Integer> highestNNumbers = new TreeSet<Integer>();

  for (int number : largeNumbersList) {
    highestNNumbers.add(number);
    if (highestNNumbers.size() > highestValCount) {
      highestNNumbers.pollFirst();
    }
  }
  return highestNNumbers;
}

      



The same code should work with PriorityQueue

. The lead time should be O(n log highestValCount)

anyway.

PS As pointed out in another answer, you can optimize this a little more (at the expense of readability) by keeping track of the smallest number, avoiding unnecessary insertions.

+3


source


The loop for

at the bottom is unnecessary because you can tell right away whether to save number

or not.

TreeSet

allows you to find the smallest element in O(log N)

* . Compare this smallest element with number

. If number

more, add it to the set and remove the smallest item. Otherwise, keep going to the next item largeNumbersList

.

In the worst case, the original list is sorted in ascending order, because you will need to replace the element in TreeSet

at each step. In this case, the algorithm will take O(K log N)

, where K

is the number of items in the original list, improving the log N K by solving the sorting of the array.



Note. If your list consists of Integer

s, you can use a non-comparison linear sorting algorithm to get the total asymptotic complexity up to O(K)

. This does not mean that the linear solution will necessarily be faster than the original for any fixed number of elements.

* You can store the value of the smallest element when you go to make it O(1)

.

+6


source


It is possible to support amortization of O (1) processing of new items and O (n) queries of the current top items as follows:

Maintain a 2n buffer, and whenever you see a new item, add it to the buffer. When the buffer fills up, use quickselect or another linear median search algorithm to select the current top n elements and discard the rest. This is an O (n) operation, but you only need to perform it every n elements, which balances to O (1) amortized time.

This is the algorithm that Guava uses for Ordering.leastOf , which pulls the top n elements from an Iterator or Iterable. In practice, it is fast enough to compete with the PriorityQueue approach and it is much more robust to worst case entry.

+4


source


I would say your question as said is impossible. It is impossible to find the highest elements n

in List

without completely passing it. And there is no way to completely overcome the infinite List

.

However, the text of your question is different from the title. There is a significant massive difference between very large and infinite. Please keep this in mind.

To answer a possible question, I would start by implementing a buffer class to encapsulate the behavior of keeping the top n

, lets call it TopNBuffer

:

class TopNBuffer<T extends Comparable<T>> {
    private final NavigableSet<T> backingSet = new TreeSet<>();

    private final int limit;

    public TopNBuffer(int limit) {
        this.limit = limit;
    }

    public void add(final T t) {
        if (backingSet.add(t) && backingSet.size() > limit) {
            backingSet.pollFirst();
        }
    }

    public SortedSet<T> highest() {
        return Collections.unmodifiableSortedSet(backingSet);
    }
}

      

Everything we do here on add

, if the number is not unique, and adding the number makes it Set

over its limit, then we just remove the bottom-most element from Set

.

The method highest

gives an unmodifiable view of the current highest items. So, in Java 8 syntax, all you have to do is:

final TopNBuffer<Integer> topN = new TopNBuffer<>(n);
largeNumbersList.foreach(topN::add);
final Set<Integer> highestN = topN.highest();

      

I think it's not enough in an interview environment to just slap a lot of code into a method. Demonstration of understanding of OO programming and separation of concerns are also important.

0


source







All Articles