A strategy for splitting a set to obtain the minimum sum of variances from subsets
The problem is this: I have a set of numbers and I need to divide it into k subsets. I have to find the best partitioning strategy to get the minimum amount of variance from each subset. The subset cannot be empty (Deviation is the square of the standard deviation.)
k is an integer greater than 0. The approximation can be 1e + 7
This is my solution so far, which works for some examples, but not always:
- sort the pattern (set of numbers) in ascending order.
- calculate the distance of two continuous elements. Build a list of lists, the sublist has left element index and distance (ie [[Idx, dist], [idx, dist] ......]). sort the list in descending order by distance.
- use the indices on the list I have, get the indices from left to right to split the sort by ascending sort.
Python code:
class MinimumVariancePartition(object):
def minDev(self, mixedSamples, k):
# mixedSamples is a tuple, k is an integer.
samples_ascending = sorted(mixedSamples)
# Build a list of lists contains indices and distances.
idx_dist = []
for index in range(len(samples_ascending) - 1):
starting_idx = index
dist = abs(samples_ascending[index] - samples_ascending[index + 1])
idx_dist.append([starting_idx, dist])
sorted_idx_dist = sorted(idx_dist, key=lambda x: x[1], reverse=True)
# Get a list of indices to split the sample.
split_idx = []
for i in range(k - 1):
split_idx.append(sorted_idx_dist[i][0])
# Get a list of subsets.
partitions = []
if len(split_idx) == 0:
partitions.append(mixedSamples)
else:
split_idx = sorted(split_idx)
prev = 0
for idx in split_idx:
partitions.append(samples_ascending[prev:idx + 1])
prev = idx + 1
partitions.append(samples_ascending[prev:])
# Compute the sum of variances
result = 0
for partition in partitions:
variance = self.variance(partition)
result += variance
return result
def variance(self, partition):
# Compute variance of a subset
size = len(partition)
s = sum(partition)
mean = float(s) / size
variance = 0
for n in partition:
temp = round(n - mean, 14)**2 # use round() to avoid float number 'trick'
variance += temp
variance /= size
return variance
passed tests:
input: (3, 4, 7, 10), 1
output: 7.5
input: (1000,500,1,500), 3
output: 0.0
input: (42,234,10,1,123,545,436,453,74,85,34,999), 5
output: 1700.7397959183672
failed tests:
input: (197, 611, 410, 779, 203, 15, 727, 446, 992, 722, 439, 296, 201, 820, 416, 272, 89, 146, 687, 203, 598, 65, 865, 945, 446, 783, 581, 270, 960, 22, 970, 698, 456, 706, 14, 901, 371, 688, 914, 925, 551, 15, 326, 620, 842, 82, 594, 99, 827, 660), 21
expected output: 757.3225
actual output: 824.586388889
input: (359, 408, 124, 89, 26, 878, 677, 341, 166, 434, 886, 539, 227, 420, 655, 330, 835, 378, 763, 401, 883, 332, 215, 424, 365, 841, 113, 825, 777, 969, 970, 668, 602, 708, 874, 930, 423, 549, 236), 13
expected output: 1588.0486111111109
actual output: 2163.79166667
input: (706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956), 4
expected output: 8172.465
actual output: 11259.875
I think the problem in my solution might be determining the index of the partition index, but still don't know why it doesn't work.
source to share
It doesn't work because the idea of your algorithm is wrong (splitting only by considering the distance between two adjacent elements does not always give the optimal solution).
Instead, you can use dynamic programming:
1. Sort the array.
2. Suppose f(first_free, sets_count)
is the minimum sum of variances if the item first_free
is the first item that has not yet been added to any set and the tags are accurately generated sets_count
.
3. Base unit f(0, 0) = 0
. It matches empty prefixes.
4. Transitions look like this:
for first_free = 0 ... n - 1:
for new_first_free = first_free + 1 ... n:
for sets_count = 0 ... k - 1:
f(new_first_free, sets_count + 1) = min(f(new_first_free, sets_count + 1),
f(first_free, sets_count) + variance of the subset [first_free, new_first_free - 1])
- Answer
f(n, k)
(wheren
is the number of elements in the set).
Here is my implementation (it can be optimized, it's just a sketch, but it works correctly):
a = [706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956]
k = 4
mem = dict()
INF = 1e10
def get_variance(partition):
size = len(partition)
s = sum(partition)
mean = float(s) / size
variance = 0
for n in partition:
temp = round(n - mean, 14) ** 2
variance += temp
variance /= size
return variance
def calc(pos, cnt):
if (pos, cnt) in mem.keys():
return mem[(pos, cnt)]
if pos == 0 and cnt >= 0:
return 0.0
if cnt < 0:
return INF
res = INF
for old_pos in range(0, pos):
res = min(res, calc(old_pos, cnt - 1) + get_variance(a[old_pos: pos]))
mem[(pos, cnt)] = res
return res
if __name__ == '__main__':
a.sort()
print(calc(len(a), k))
source to share