A strategy for splitting a set to obtain the minimum sum of variances from subsets

Question

A strategy for splitting a set to obtain the minimum sum of variances from subsets

The problem is this: I have a set of numbers and I need to divide it into k subsets. I have to find the best partitioning strategy to get the minimum amount of variance from each subset. The subset cannot be empty (Deviation is the square of the standard deviation.)

k is an integer greater than 0. The approximation can be 1e + 7

This is my solution so far, which works for some examples, but not always:

sort the pattern (set of numbers) in ascending order.
calculate the distance of two continuous elements. Build a list of lists, the sublist has left element index and distance (ie [[Idx, dist], [idx, dist] ......]). sort the list in descending order by distance.
use the indices on the list I have, get the indices from left to right to split the sort by ascending sort.

Python code:

class MinimumVariancePartition(object):
    def minDev(self, mixedSamples, k):
        # mixedSamples is a tuple, k is an integer.

        samples_ascending = sorted(mixedSamples)

        # Build a list of lists contains indices and distances.
        idx_dist = []
        for index in range(len(samples_ascending) - 1):
            starting_idx = index
            dist = abs(samples_ascending[index] - samples_ascending[index + 1])
            idx_dist.append([starting_idx, dist])

        sorted_idx_dist = sorted(idx_dist, key=lambda x: x[1], reverse=True)

        # Get a list of indices to split the sample.
        split_idx = []
        for i in range(k - 1):
            split_idx.append(sorted_idx_dist[i][0])

        # Get a list of subsets.    
        partitions = []
        if len(split_idx) == 0:
            partitions.append(mixedSamples)
        else:
            split_idx = sorted(split_idx)
            prev = 0
            for idx in split_idx:
                partitions.append(samples_ascending[prev:idx + 1])
                prev = idx + 1
            partitions.append(samples_ascending[prev:])

        # Compute the sum of variances
        result = 0
        for partition in partitions:
            variance = self.variance(partition)
            result += variance
        return result

    def variance(self, partition):
        # Compute variance of a subset
        size = len(partition)
        s = sum(partition)
        mean = float(s) / size
        variance = 0
        for n in partition:
            temp = round(n - mean, 14)**2  # use round() to avoid float number 'trick'
            variance += temp
        variance /= size
        return variance

passed tests:

input: (3, 4, 7, 10), 1
output: 7.5

input: (1000,500,1,500), 3
output: 0.0

input: (42,234,10,1,123,545,436,453,74,85,34,999), 5
output: 1700.7397959183672

failed tests:

input: (197, 611, 410, 779, 203, 15, 727, 446, 992, 722, 439, 296, 201, 820, 416, 272, 89, 146, 687, 203, 598, 65, 865, 945, 446, 783, 581, 270, 960, 22, 970, 698, 456, 706, 14, 901, 371, 688, 914, 925, 551, 15, 326, 620, 842, 82, 594, 99, 827, 660), 21
expected output: 757.3225
actual output: 824.586388889

input: (359, 408, 124, 89, 26, 878, 677, 341, 166, 434, 886, 539, 227, 420, 655, 330, 835, 378, 763, 401, 883, 332, 215, 424, 365, 841, 113, 825, 777, 969, 970, 668, 602, 708, 874, 930, 423, 549, 236), 13
expected output: 1588.0486111111109
actual output: 2163.79166667

input: (706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956), 4
expected output: 8172.465
actual output: 11259.875

I think the problem in my solution might be determining the index of the partition index, but still don't know why it doesn't work.

+3

python algorithm statistics

ChesterL 15 nov. 14 at 18:31

source to share

1 answer

kraskevich · Accepted Answer · 2014-11-15T18:58:14+0000

It doesn't work because the idea of your algorithm is wrong (splitting only by considering the distance between two adjacent elements does not always give the optimal solution).

Instead, you can use dynamic programming:
1. Sort the array.
2. Suppose f(first_free, sets_count)

is the minimum sum of variances if the item first_free

is the first item that has not yet been added to any set and the tags are accurately generated sets_count

.
3. Base unit f(0, 0) = 0

. It matches empty prefixes.
4. Transitions look like this:

for first_free = 0 ... n - 1:
    for new_first_free = first_free + 1 ... n:
        for sets_count = 0 ... k - 1:
            f(new_first_free, sets_count + 1) = min(f(new_first_free, sets_count + 1),
                f(first_free, sets_count) + variance of the subset [first_free, new_first_free - 1])

Answer f(n, k)

(where n

is the number of elements in the set).

Here is my implementation (it can be optimized, it's just a sketch, but it works correctly):

a = [706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956]
k = 4
mem = dict()
INF = 1e10


def get_variance(partition):
    size = len(partition)
    s = sum(partition)
    mean = float(s) / size
    variance = 0
    for n in partition:
        temp = round(n - mean, 14) ** 2
        variance += temp
    variance /= size
    return variance


def calc(pos, cnt):
    if (pos, cnt) in mem.keys():
        return mem[(pos, cnt)]
    if pos == 0 and cnt >= 0:
        return 0.0
    if cnt < 0:
        return INF
    res = INF
    for old_pos in range(0, pos):
        res = min(res, calc(old_pos, cnt - 1) + get_variance(a[old_pos: pos]))
    mem[(pos, cnt)] = res
    return res


if __name__ == '__main__':
    a.sort()
    print(calc(len(a), k))

A strategy for splitting a set to obtain the minimum sum of variances from subsets

More articles: