Calculate the median for a huge amount of data

I have a huge amount of data (> 10000000) with type int with each new element that I want to calculate the median (so I will have> 1,000,000 medians). Should I maintain a sorted list and insert items into that list so that I can calculate the median each time, or should I insert and then sort the list each time.

Also would be a std::vector

suitable data structure for this? or another data structure will give better complexity.

Note. I cannot use std::set

as there may be duplicates in there too. If using std::multiset

find the median will increase the difficulty, as I will loop from start to middle to get its value.

+3


source to share


1 answer


I would use std::multiset

since it can handle duplicates and sort the order automatically. I would insert the numbers one by one, maintaining an iterator pointing to the median (step forward or backward depending on whether the new element is greater or less).



Note that if this is too large for comfortable memory storage, you can package many of the highest and lowest items into files; it is unlikely that the median will ever move that far, and if it does, you can unpack and repackage.

+2


source







All Articles