Sum of the maximum element from all possible subsets of size 'k' of the array

I have a very large list with about 10,000 items and each item has an integer of 5 billion. I would like to find the sum of the maximum elements from all possible subsets of size "k" (user-specified) of an array whose maximum size is 10000 elements. The only solution that comes to my mind is to generate each of the subsets (using itertools) and find its maximum element. But that would take an insane amount of time! What would be a pythonic way to solve this problem?

+3


source to share


1 answer


Don't use python, use math first. This is a combinatorial problem: if you have an array of S

n numbers (n is large) and generates all possible subsets of size k, you want to compute the sum of the maximum elements of the subsets.

Assuming the numbers are different (although they also work if they aren't), you can calculate exactly how often each will appear in the subset, and continue from there without even creating the subset. You had to pass it on math.stackexchange.com

, they would take you apart in an instant. Here it is, but without the pretty math notation:

Sort the array in ascending order and let it S_1

be the smallest (first) number, the S_2

next smallest, etc. (Note: indexing from 1).



  • S_n

    , the largest element is obviously the maximal element of any subset it is a part of and there are exactly (n-1 choose k-1)

    such subsets.

  • Of the subsets that do not contain S_n, there are (n-2 choose k-1)

    subsets containing S_{n-1}

    , in which it is the largest element.

  • Continue this until you reach S_k

    the smallest number k-th

    (counting from the smallest), which is the maximum exactly one subset: (k-1 choose k-1) = 1

    . Smaller numbers (from S_1

    to S_{k-1}

    ) can never be maximum: each set of elements k

    will contain something larger.

  • Sum up the above (n-k+1 terms)

    and there your answer is:

    S_n*(n-1 choose k-1) + S_{n-1}*(n-2 choose k-1) + ... + S_k*(k-1 choose k-1)
    
          

    Writing terms from smallest to largest is just the sum

    Sum(i=k..n) S_i * (i-1 choose k-1)    
    
          

If we were on math.stackexchange, you would get it in correct mathematical notation, but you get the idea.

+6


source







All Articles