Beating the medians
You're collecting a fairly large number of subsystem Q measurements in a std::vector and need to find the median. What's a good way to leverage the standard C++ library for this simple task?
If we follow the definition of median as the middle sample in an ordered sequence, we might end up with an implementation similar to the following (ignore the issue of a potentially even number of samples for now – assume that subsystem Q samples cannot be meaningfully interpolated so that we always need to pick one):
using Q::sample;
sample& Q::median_sample(std::vector<sample>& samples)
{
std::sort(samples.begin(), samples.end());
return samples.at(samples.size() / 2);
}
Simple and straightforward code, right? We're letting std::sort do all the heavy lifting for us, making the sequence ordered, after which the median sample is conveniently located in the middle of the vector. Still, there's a problem with the above code that may not be obvious unless you have thought about it.
Making the computer work hard is all well and good – that's what it's paid to do, after all. Indeed, good programming practice are sometimes similar to good employment practice; we pay for our resources (silicon- and carbon-based, respectively) whether they work or not, and idle resources are wasted resources.
But inefficient work is also a kind of waste. Neither computers nor employees should have to break their backs working harder than necessary. While C++ has a union, it's not likely to complain about worker exploitation. Your users may not be as accommodating about substandard performance, however.
By sorting the entire vector even though we're only going to use the middle sample, we are doing an awful lot of work needlessly. Luckily for us, the C++ standard library offers an often overlooked algorithm called std::nth_element which is well suited for solving a variety of selection problems:
template <class RandIter>
void std::nth_element(RandIter first, RandIter nth, RandIter last);
The first and third arguments to std::nth_element indicate a range, while the second argument indicates a position (the nth element) in that range. The algorithm arranges for this position to contain the element that would be in that position if the entire range were sorted. In addition it will move any elements that were before the nth position but would be after the nth position in a sorted range so that they are after the nth position, and vice versa.
The latter guarantee is not critical for median element calculation, but this property is helpful for understanding the algorithm and frequently useful in its own right. To help visualize what std::nth_element does, consider an unsorted range like this:
![[99] [110] [93] [102] [259] [94] [93] [95] [99] [101] [99] [102] [137] [112] [97]](/static/2005/05/before-nth_element.png)
In this case the middle element is our chosen nth element. After std::nth_element, the data will be arranged in a way similar to this:
![[95] [93] [93] [94] [97] [99] [99] [99] [102] [259] [101] [102] [137] [112] [110]](/static/2005/05/after-nth_element.png)
I had to say similar because there is no guarantee regarding the relative order within subranges on either side of the nth element. It's interesting to note that if we were to sort those two subranges separately we would end up with a fully sorted range. This suggests that doing std::nth_element recursively on subranges might be a valid sorting algorithm. In fact, that would be an implementation of quicksort.
The operation performed by std::nth_element is actually a generalized form of the quicksort partitioning step. Since doing one step of a divide-and-conquer algorithm must be faster than doing the entire thing, it should come as no surprise whatsoever that the average complexity of std::nth_element is Θ(n), i.e. linear, while that of std::sort is Θ(n log n).
Getting back to the original problem, the (average) complexity of our median calculation can be reduced by a factor of log n by using std::nth_element rather than sorting:
using Q::sample;
sample& Q::median_sample(std::vector<sample>& samples)
{
std::vector<sample>::iterator i = samples.begin();
std::vector<sample>::size_type m = samples.size() / 2;
std::nth_element(i, i + m, samples.end());
return samples.at(m);
}
That log n factor gets significant fairly quickly, and in practice the difference is even larger than the complexity bounds might lead you to believe. Choosing the appropriate STL algorithm can easily make the difference between multi- and sub-second response times for your application.
28 May, 2005
Feedback
Feedback is closed for this entry.