How to improve the performance of boost_map searches
I am using boost::icl::interval_map
to map byte ranges to a set of strings. The map is loaded from a (sorted) disk file and then I go through the code using the following code.
The problem is that the search is very slow.
In my test, I entered 66425 ranges into the map. I have profiled the code and basically> 99.9% of the time is spent on various Boost features (there is no specific feature that is slow, there is a lot of time spreading over many features).
What can be done to make it faster?
Is my tree unbalanced (how do I know how it can be rebalanced?)
Is using set <string> the problem?
Calculates the intersection of the display and the problem window? (Although this is what I need, so I can't see how to do it).
using namespace std;
typedef set<string> objects;
typedef boost::icl::interval_map<uint64_t, objects> ranges;
void
find_range (const ranges *map, uint64_t start, uint64_t end,
void (*f) (uint64_t start, uint64_t end, const char *object,
void *opaque),
void *opaque)
{
boost::icl::interval<uint64_t>::type window;
window = boost::icl::interval<uint64_t>::right_open (start, end);
ranges r = *map & window;
ranges::iterator iter = r.begin ();
while (iter != r.end ()) {
boost::icl::interval<uint64_t>::type range = iter->first;
uint64_t start = range.lower ();
uint64_t end = range.upper ();
objects obj_set = iter->second;
objects::iterator iter2 = obj_set.begin ();
while (iter2 != obj_set.end ()) {
f (start, end, iter2->c_str (), opaque);
iter2++;
}
iter++;
}
}
The first few profile entries:
% cumulative self self total
time seconds seconds calls us/call us/call name
9.77 0.13 0.13 21866814 0.01 boost::icl::interval_bounds::interval_bounds(unsigned char)
6.02 0.21 0.08 9132134 0.01 boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::lower(boost::icl::discrete_interval<unsigned long, std::less> const&)
6.02 0.29 0.08 6004967 0.01 boost::enable_if<boost::icl::is_discrete_interval<boost::icl::discrete_interval<unsigned long, std::less> >, bool>::type boost::icl::is_empty<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::discrete_interval<unsigned long, std::less> const&)
5.26 0.36 0.07 21210093 0.00 boost::icl::discrete_interval<unsigned long, std::less>::bounds() const
5.26 0.43 0.07 11964109 0.01 std::less<unsigned long>::operator()(unsigned long const&, unsigned long const&) const
4.51 0.49 0.06 35761849 0.00 std::_Rb_tree<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_S_left(std::_Rb_tree_node_base const*)
4.51 0.55 0.06 12009934 0.00 boost::icl::operator==(boost::icl::interval_bounds, boost::icl::interval_bounds)
3.76 0.60 0.05 12078493 0.00 boost::icl::discrete_interval<unsigned long, std::less>::upper() const
3.76 0.65 0.05 12077959 0.00 boost::enable_if<boost::icl::is_interval<boost::icl::discrete_interval<unsigned long, std::less> >, boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type>::type boost::icl::upper<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::discrete_interval<unsigned long, std::less> const&)
3.76 0.70 0.05 8837475 0.01 boost::icl::interval_bounds::bits() const
3.76 0.75 0.05 6004967 0.01 boost::enable_if<boost::icl::is_interval<boost::icl::discrete_interval<unsigned long, std::less> >, bool>::type boost::icl::domain_less_equal<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type const&, boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type const&)
3.01 0.79 0.04 5891650 0.01 boost::icl::is_right_closed(boost::icl::interval_bounds)
Dataset: http://oirase.annexia.org/tmp/bmap.txt
Full Code: http://git.annexia.org/?p=virt-bmap.git;a=tree
source to share
In this answer, I present three optimizations:
-
replacing objects
std::set
withboost::container::flat_set
to improve location (and probably reallocation costs as most of the object sets are <4 items)UPDATE . In my latest version below, simply replacing
boost::container::flat_map
thestd::set
reduced performance by afind_range
factor of ~ 2x andfind_range_ex
by a factor of ~ 4x on my test system -
replaces the object identifier
std::string
withstring_atom
(which is technicallychar const*
, but logically unique). This method is similar to interned strings in various VM implementations (e.g. Java / .NET).NOTE . The current implementation is
make_atom
extremely simplistic and never frees atoms. You might want to put rows back in the deque, use Boost Flyweights, a pool allocator, or some combination of these to make it smarter. However, whether it depends on your use cases -
replace the intersection of the map with a call
equal_range
, which saves a ton of time by avoiding copying (large amounts) of data_ UPDATE When applying this optimization alone, the speedup is already 26 ~ 30x . Note that memory usage roughly doubles at ~ 20MiB compared to enabling all three optimizations _
Data volume and efficiency
Before I get started, I like to know what the data looks like. So, after writing some code to parse this bmap.txt
input, we get:
Parsed ok
Histogram of 66425 input lines
d: 3367
f: 20613
p: 21222
v: 21223
ranges size: 6442450944
ranges iterative size: 21223
Min object set: 1.000000
Max object set: 234.000000
Average object set: 3.129859
Min interval width: 1024.000000
Max interval width: 2526265344.000000
Average interval width: 296.445177k
First: [0,1048576)
Last: [3916185600,6442450944)
String atoms: 23904 unique in 66425 total
Atom efficiency: 35.986451%
As you can see, sets are usually ~ 3 items, and many are duplicated.
Using the object naming method make_atom
with boost::flat_set
reduced memory allocation from ~ 15GiB ~ 10Gib .
This optimization also reduces string comparison with pointer matching for set set and Combiner strategy interval_map
, so for large datasets this can have a lot of speedups.
Query efficiency
Query performance is indeed severely crippled by the partial copy of the input.
Don't copy, but view the overlap range by simply replacing:
const ranges r = *map & window;
ranges::const_iterator iter = r.begin ();
while (iter != r.end ()) {
from
auto r = map->equal_range(window);
ranges::const_iterator iter = r.first;
while (iter != r.second) {
On my system, running 10,000 of the same randomized queries from both versions, the speedup is 32x :
10000 'random' OLD lookups resulted in 156729884 callbacks in 29148ms
10000 'random' NEW lookups resulted in 156729884 callbacks in 897ms
real 0m31.715s
user 0m31.664s
sys 0m0.012s
Currently, parsing / statistics are used at runtime. Full reference code here: In Coliru
#define BOOST_RESULT_OF_USE_DECTYPE
#define BOOST_SPIRIT_USE_PHOENIX_V3
/* virt-bmap examiner plugin
* Copyright (C) 2014 Red Hat Inc.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
#include <assert.h>
#include <boost/icl/interval.hpp>
#include <boost/icl/interval_set.hpp>
#include <boost/icl/interval_map.hpp>
#include <boost/container/flat_set.hpp>
using namespace std;
/* Maps intervals (uint64_t, uint64_t) to a set of strings, where each
* string represents an object that covers that range.
*/
static size_t atoms_requested = 0;
static size_t atoms_unique_created = 0;
using string_atom = const char*;
string_atom make_atom(std::string&& s)
{
static std::set<std::string> s_atoms;
atoms_requested += 1;
auto it = s_atoms.find(s);
if (it != s_atoms.end())
return it->c_str();
atoms_unique_created += 1;
return s_atoms.insert(std::move(s)).first->c_str();
}
typedef boost::container::flat_set<string_atom> objects;
typedef boost::icl::interval_map<uint64_t, objects> ranges;
ranges*
new_ranges (void)
{
return new ranges ();
}
void
free_ranges (ranges *mapv)
{
ranges *map = (ranges *) mapv;
delete map;
}
extern "C" void
insert_range (void *mapv, uint64_t start, uint64_t end, const char *object)
{
ranges *map = (ranges *) mapv;
objects obj_set;
obj_set.insert (obj_set.end(), object);
map->add (std::make_pair (boost::icl::interval<uint64_t>::right_open (start, end), // SEHE added std::
obj_set));
}
extern "C" void
iter_range (void *mapv, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
{
ranges *map = (ranges *) mapv;
ranges::iterator iter = map->begin ();
while (iter != map->end ()) {
boost::icl::interval<uint64_t>::type range = iter->first;
uint64_t start = range.lower ();
uint64_t end = range.upper ();
objects obj_set = iter->second;
objects::iterator iter2 = obj_set.begin ();
while (iter2 != obj_set.end ()) {
f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
iter2++;
}
iter++;
}
}
extern "C" void
find_range (void const *mapv, uint64_t start, uint64_t end, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
{
const ranges *map = (const ranges *) mapv;
boost::icl::interval<uint64_t>::type window;
window = boost::icl::interval<uint64_t>::right_open (start, end);
const ranges r = *map & window;
ranges::const_iterator iter = r.begin ();
while (iter != r.end ()) {
boost::icl::interval<uint64_t>::type range = iter->first;
uint64_t start = range.lower ();
uint64_t end = range.upper ();
objects obj_set = iter->second;
objects::iterator iter2 = obj_set.begin ();
while (iter2 != obj_set.end ()) {
f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
iter2++;
}
iter++;
}
}
extern "C" void
find_range_ex (void const *mapv, uint64_t start, uint64_t end, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
{
const ranges *map = (const ranges *) mapv;
boost::icl::interval<uint64_t>::type window;
window = boost::icl::interval<uint64_t>::right_open (start, end);
#if 0
const ranges r = *map & window;
ranges::const_iterator iter = r.begin ();
while (iter != r.end ()) {
#else
auto r = map->equal_range(window);
ranges::const_iterator iter = r.first;
while (iter != r.second) {
#endif
boost::icl::interval<uint64_t>::type range = iter->first;
uint64_t start = range.lower ();
uint64_t end = range.upper ();
objects obj_set = iter->second;
objects::iterator iter2 = obj_set.begin ();
while (iter2 != obj_set.end ()) {
f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
iter2++;
}
iter++;
}
}
#include <memory>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics.hpp>
#include <fstream>
#include <chrono>
std::map<char, size_t> histo;
bool insert_line_of_input(ranges& bmap_data, uint64_t b, uint64_t e, char type, std::string& object) {
if (!object.empty())
histo[type]++;
//std::cout << std::hex << b << " " << e << " " << type << " " << object << "\n";
#if 0
object.insert(object.begin(), ':');
object.insert(object.begin(), type);
#endif
insert_range(&bmap_data, b, e, make_atom(std::move(object)));
return true;
}
std::vector<std::pair<uint64_t, uint64_t> > generate_test_queries(ranges const& bmap_data, size_t n) {
std::vector<std::pair<uint64_t, uint64_t> > queries;
queries.reserve(n);
for (size_t i = 0; i < n; ++i)
{
auto start = (static_cast<uint64_t>(rand()) * rand()) % bmap_data.size();
auto end = start + rand();
queries.emplace_back(start,end);
}
return queries;
}
ranges read_mapfile(const char* fname) {
std::ifstream ifs(fname);
boost::spirit::istream_iterator f(ifs >> std::noskipws), l;
ranges bmap_data;
namespace phx = boost::phoenix;
using namespace boost::spirit::qi;
uint_parser<uint64_t, 16> offset;
if (!phrase_parse(f,l,
("1 " >> offset >> offset >> char_("pvdf") >> as_string[lexeme[+graph] >> attr('/') >> lexeme[*~char_("\r\n")]])
[ _pass = phx::bind(insert_line_of_input, phx::ref(bmap_data), _1, _2, _3, _4) ] % eol >> *eol,
blank))
{
exit(255);
} else
{
std::cout << "Parsed ok\n";
}
if (f!=l)
std::cout << "Warning: remaining input '" << std::string(f,l) << "\n";
return bmap_data;
}
void report_statistics(ranges const& bmap_data) {
size_t total = 0;
for (auto e : histo) total += e.second;
std::cout << "Histogram of " << total << " input lines\n";
for (auto e : histo)
std::cout << e.first << ": " << e.second << "\n";
namespace ba = boost::accumulators;
ba::accumulator_set<double, ba::stats<ba::tag::mean, ba::tag::max, ba::tag::min> >
object_sets, interval_widths;
for (auto const& r : bmap_data)
{
auto width = r.first.upper() - r.first.lower();
assert(width % 1024 == 0);
interval_widths(width);
object_sets(r.second.size());
}
std::cout << std::fixed;
std::cout << "ranges size: " << bmap_data.size() << "\n";
std::cout << "ranges iterative size: " << bmap_data.iterative_size() << "\n";
std::cout << "Min object set: " << ba::min(object_sets) << "\n" ;
std::cout << "Max object set: " << ba::max(object_sets) << "\n" ;
std::cout << "Average object set: " << ba::mean(object_sets) << "\n" ;
std::cout << "Min interval width: " << ba::min(interval_widths) << "\n" ;
std::cout << "Max interval width: " << ba::max(interval_widths) << "\n" ;
std::cout << "Average interval width: " << ba::mean(interval_widths)/1024.0 << "k\n" ;
std::cout << "First: " << bmap_data.begin()->first << "\n" ;
std::cout << "Last: " << bmap_data.rbegin()->first << "\n" ;
std::cout << "String atoms: " << atoms_unique_created << " unique in " << atoms_requested << " total\n";
std::cout << "Atom efficiency: " << (atoms_unique_created*100.0/atoms_requested) << "%\n";
}
void perform_comparative_benchmarks(ranges const& bmap_data, size_t number_of_queries) {
srand(42);
auto const queries = generate_test_queries(bmap_data, number_of_queries);
using hrc = std::chrono::high_resolution_clock;
{
auto start = hrc::now();
size_t callbacks = 0;
for (auto const& q: queries)
{
find_range(&bmap_data, q.first, q.second,
[](uint64_t start, uint64_t end, const char *object, void *opaque) {
++(*static_cast<size_t*>(opaque));
}, &callbacks);
}
std::cout << number_of_queries << " 'random' OLD lookups resulted in " << callbacks
<< " callbacks in " << std::chrono::duration_cast<std::chrono::milliseconds>((hrc::now()-start)).count() << "ms\n";
}
{
auto start = hrc::now();
size_t callbacks = 0;
for (auto const& q: queries)
{
find_range_ex(&bmap_data, q.first, q.second,
[](uint64_t start, uint64_t end, const char *object, void *opaque) {
++(*static_cast<size_t*>(opaque));
}, &callbacks);
}
std::cout << number_of_queries << " 'random' NEW lookups resulted in " << callbacks
<< " callbacks in " << std::chrono::duration_cast<std::chrono::milliseconds>((hrc::now()-start)).count() << "ms\n";
}
}
int main() {
auto bmap = read_mapfile("bmap.txt");
report_statistics(bmap);
perform_comparative_benchmarks(bmap, 1000);
#if 0 // to dump ranges to console
for (auto const& r : bmap)
{
std::cout << r.first << "\t" << r.second.size() << "\t";
std::copy(r.second.begin(), r.second.end(), std::ostream_iterator<std::string>(std::cout, "\t"));
std::cout << "\n";
}
#endif
}
source to share
You have every chance against you with this code.
-
Memory usage.
With 66425 ranges, you are going through the L1 cache, and with rowset enabled, you are also using L2D and may even exceed L3. This means that you will often have a latency of 50-200 cpu cycles for each data access, and your out-of-order execution will only span 32 cycles, which means the processor will constantly stall. This is greatly facilitated if each memory is accessed sequentially through hardware preferencers. -
Pointer chasing the map.
Maps and collection are usually implemented as rb_tree with pointers. (interval_map could be different?). To further exacerbate the problem, pointer access will ensure that data is not consistently available, which means high latency will hit you. -
Pointer / virtual function call. Surprisingly, this doesn't show up in the top 12 unless you use more interval functions internally
f
. Later, when you solve other problems, you can see that each call to this function will delay X cycles for each call, where X is the length of the CPU pipeline.
If you are using perf to get performance data, add the execution result with perf stat -d
. This should reveal the problems mentioned above with a lot of cache misses and an unoccupied processor.
Usage is set<string>
bad because its pointer is chased, you should use vector<string>
instead, you will need to sort it yourself. This should speed up access f
, but does not mitigate other problems.
Adding a distributor that implements the arena interval_map
can speed up access, since the data must be better localized.
source to share