Finding the median without data structures

Question

Finding the median without data structures

(my code is written in Java, but the question is agnostic, I'm just looking for an algorithm idea)

So here's the problem: I created a method that just finds the median of a dataset (given as an array). Here's the implementation:

public static double getMedian(int[] numset) {
    ArrayList<Integer> anumset = new ArrayList<Integer>();
    for(int num : numset) {
        anumset.add(num);
    }
    anumset.sort(null);

    if(anumset.size() % 2 == 0) {
        return anumset.get(anumset.size() / 2);
    } else {
        return (anumset.get(anumset.size() / 2)
                   + anumset.get((anumset.size() / 2) + 1)) / 2;
    }
}

The teacher at school I came to challenged me to write a method to find the median again, but without using any data structures. This includes anything that can contain more than one value, so this includes strings, any form of arrays, etc. For a long time I even tried to imagine an idea and I was stumped. Any ideas?

+3

language-agnostic median

jake May 21 '15 at 14:56

source to share

3 answers

Sort array in place. Take an element in the middle of the array as you already do. No additional storage required.

It will take a while n log n

in Java. The best time possible is linear (you must check each element at least once to ensure the correct answer). For pedagogical purposes, the additional complexity reduction is not worth it.

If you cannot modify the array in-place, you will have to trade in significant additional time complexity to avoid using additional storage proportional to half the size of the input. (If you agree to accept approximations, you are not.)

+1

Jay Kominek May 21 '15 at 15:45

source to share

Some not very effective ideas:

For each value in the array, iterate through the array, counting the number of values less than the current value. If this number is "half" the length of the array, you have the median. O (n ^ 2) (Need to think about how to handle duplicate median values.)

You can improve performance slightly by keeping track of the minimum and maximum values. For example, if you have already determined that 50 is too high to be the median, then you can loop the counting score through the array for each value that is greater than or equal to 50. Similarly, if you have already determined that 25 is too low, you can skip the counting count for each value less than or equal to 25.

In C ++:

    int Median(const std::vector<int> &values) {
        assert(!values.empty());
        const std::size_t half = values.size() / 2;
        int min = *std::min_element(values.begin(), values.end());
        int max = *std::max_element(values.begin(), values.end());
        for (auto candidate : values) {
            if (min <= candidate && candidate <= max) {
                const std::size_t count =
                    std::count_if(values.begin(), values.end(), [&](int x)
                                    { return x < candidate; });
                if (count == half)     return candidate;
                else if (count > half) max = candidate;
                else                   min = candidate;
            }
        }
        return min + (max - min) / 2;
    }

Terrible performance, but it doesn't use data structures and doesn't modify the input array.

+1

Adrian mcarthy May 21 '15 at 18:08

source to share

Jerry coffin · Accepted Answer · 2015-05-21T22:16:30+0000

The usual task algorithm is the Hoare Select algorithm. This is very similar to quicksort, except that in quicksort you recursively sort both halves after partitioning, but for selection you only recursively call the section containing the object of interest.

For example, consider an input like this, where we find the fourth element:

[7, 1, 17, 21, 3, 12, 0, 5]

We will arbitrarily use the first item ( 7

) as our pivot. First, we split it like this (with the axis marked with *:

[1, 3, 0, 5,] * 7, [17, 21, 12]

We are looking for the fourth element and 7 is the fifth element, so we split (only) on the left side. We'll again use the first element as our pivot, giving (using {

and }

to mark the part of the input that we're just ignoring for now).

[0] 1 [3, 5] {7, 17, 21, 12}

1

turned out to be the second element, so we need to split elements on the right (3 and 5):

{0, 1} 3 [5] {7, 17, 21, 12}

Using it 3

as a pivot element, we end up with nothing left and 5

right. 3

is the third element, so we need to look to the right. This is the only element, so ( 5

) is our median.

By ignoring the unused side, this reduces the complexity from O (n log n) for sorting to only O (N) [although I overuse the notation a little - in this case we are dealing with the expected behavior, not the worst case, since big-O is usually does].

There is also a median of the median algorithm if you want to provide good behavior (on average, to a lesser extent).

This provides a guaranteed O (N) complexity.

Finding the median without data structures

More articles: