C ++ random access iterators for containers with items loaded on demand
I am currently working on a small project that requires loading messages from a file. Messages are saved sequentially in the file and the files can get huge, so loading the entire contents of the file into memory is not warranted.
Therefore, we decided to implement a class FileReader
that can quickly navigate to specific elements in a file and load them on demand. Usually used something along the following lines
SpecificMessage m;
FileReader fr;
fr.open("file.bin");
fr.moveTo(120); // Move to Message #120
fr.read(&m); // Try deserializing as SpecificMessage
The FileReader itself works great. So we thought about adding STL-compatible iterator support: A random access iterator that provides read-only references for specific messages. Used as follows
for (auto iter = fr.begin<SpecificMessage>(); iter != fr.end<SpecificMessage>(); ++iter) {
// ...
}
Note: the above assumes that the file contains only SpecificMessages. We used boost::iterator_facade
to simplify implementation.
Now my question boils down to this: how to properly implement an iterator? Since FileReader
it doesn't actually contain the sequence of messages inside, but downloads them on demand.
What we have tried so far:
Storing a message as a member of an iterator
This approach stores the message in an iterator instance. Which is great for simple use cases, but not good for more complex purposes. For example. std::reverse_iterator
has a dereference operation that looks like this:
reference operator*() const
{ // return designated value
_RanIt _Tmp = current;
return (*--_Tmp);
}
This breaks our approach when a reference is returned from a temporary iterator.
Make reference type equal to value type
@DDrmmr in the comments suggesting to make the reference type equal to the value type so that a copy of the internally stored object is returned. However, I think this is not true for a reverse iterator that implements the -> operator as
pointer operator->() const {
return (&**this);
}
which itself shares, calls the * operator, which then returns a copy of the temporary and finally returns the address of that temporary.
Saving a message from the outside
Alternatively, although I want to store the message externally:
SpecificMessage m;
auto iter = fr.begin<SpecificMessage>(&m);
// ...
which also seems to be wrong for
auto iter2 = iter + 2
which will have both iter2
, and iter
will point to the same content.
source to share
As I hinted in another answer, you could use memory mapped files. In a comment, you asked:
As far as memory mapped files are concerned, that doesn't seem to be what I want to have, how could you provide them with an iterator over SpecificMessages for them?
Ok, if your SpecificMessage is a POD type, you can just iterate over the raw memory directly. If not, you can have a deserialization helper (as you already have) and use Boost to do the deserialization on demand. transform_iterator
Note that we can make a managed file with memory, which effectively means that you can just use it like a normal heap and you can store all the standard containers. This includes node-based containers ( map<>
for example), dynamically sized containers (for example vector<>
) in addition to fixed sized containers ( array<>
) - and any combination of these.
Here's a demo that uses a simple SpecificMessage
one that contains a string and (de) derives directly into shared memory:
using blob_t = shm::vector<uint8_t>;
using shared_blobs = shm::vector<blob_t>;
The part you are interested in will be the consumption part:
bip::managed_mapped_file mmf(bip::open_only, DBASE_FNAME);
shared_blobs* table = mmf.find_or_construct<shared_blobs>("blob_table")(mmf.get_segment_manager());
using It = boost::transform_iterator<LazyLoader<SpecificMessage>, shared_blobs::const_reverse_iterator>;
// for fun, let reverse the blobs
for (It first(table->rbegin()), last(table->rend()); first < last; first+=13)
std::cout << "blob: '" << first->contents << "'\n";
// any kind of random access is okay, though:
auto random = rand() % table->size();
SpecificMessage msg;
load(table->at(random), msg);
std::cout << "Random blob #" << random << ": '" << msg.contents << "'\n";
So this prints every 13th message in reverse order, followed by a random blob.
Full demo
The online example uses source strings as "messages".
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/managed_mapped_file.hpp>
#include <boost/container/scoped_allocator.hpp>
#include <boost/interprocess/containers/vector.hpp>
#include <iostream>
#include <boost/iterator/transform_iterator.hpp>
#include <boost/range/iterator_range.hpp>
static char const* DBASE_FNAME = "database.map";
namespace bip = boost::interprocess;
namespace shm {
using segment_manager = bip::managed_mapped_file::segment_manager;
template <typename T> using allocator = boost::container::scoped_allocator_adaptor<bip::allocator<T, segment_manager> >;
template <typename T> using vector = bip::vector<T, allocator<T> >;
}
using blob_t = shm::vector<uint8_t>;
using shared_blobs = shm::vector<blob_t>;
struct SpecificMessage {
// for demonstration purposes, just a string; could be anything serialized
std::string contents;
// trivial save/load serialization code:
template <typename Blob>
friend bool save(Blob& blob, SpecificMessage const& msg) {
blob.assign(msg.contents.begin(), msg.contents.end());
return true;
}
template <typename Blob>
friend bool load(Blob const& blob, SpecificMessage& msg) {
msg.contents.assign(blob.begin(), blob.end());
return true;
}
};
template <typename Message> struct LazyLoader {
using type = Message;
Message operator()(blob_t const& blob) const {
Message result;
if (!load(blob, result)) throw std::bad_cast(); // TODO custom excepion
return result;
}
};
///////
// for demo, create some database contents
void create_database_file() {
bip::file_mapping::remove(DBASE_FNAME);
bip::managed_mapped_file mmf(bip::open_or_create, DBASE_FNAME, 1ul<<20); // Even sparse file size is limited on Coliru
shared_blobs* table = mmf.find_or_construct<shared_blobs>("blob_table")(mmf.get_segment_manager());
std::ifstream ifs("main.cpp");
std::string line;
while (std::getline(ifs, line)) {
table->emplace_back();
save(table->back(), SpecificMessage { line });
}
std::cout << "Created blob table consisting of " << table->size() << " blobs\n";
}
///////
void display_random_messages() {
bip::managed_mapped_file mmf(bip::open_only, DBASE_FNAME);
shared_blobs* table = mmf.find_or_construct<shared_blobs>("blob_table")(mmf.get_segment_manager());
using It = boost::transform_iterator<LazyLoader<SpecificMessage>, shared_blobs::const_reverse_iterator>;
// for fun, let reverse the blobs
for (It first(table->rbegin()), last(table->rend()); first < last; first+=13)
std::cout << "blob: '" << first->contents << "'\n";
// any kind of random access is okay, though:
auto random = rand() % table->size();
SpecificMessage msg;
load(table->at(random), msg);
std::cout << "Random blob #" << random << ": '" << msg.contents << "'\n";
}
int main()
{
#ifndef CONSUMER_ONLY
create_database_file();
#endif
srand(time(NULL));
display_random_messages();
}
source to share
You're in trouble because your iterator doesn't meet the requirements of a direct iterator. In particular:
-
*i
must be an lvalue reference tovalue_type
orconst value_type
([forward.iterators] /1.3) -
*i
cannot be a reference to an object stored in the iterator itself, due to the requirement that two iterators are equal if and only if they are bound to the same object ([forward.iterators] / 6)
Yes, these requirements are a huge pain in the butt, and yes, that means things like std::vector<bool>::iterator
are not random access iterators, although some standard library implementations incorrectly state they are.
EDIT: The following proposed solution is horribly broken as dereferencing the temporary iterator returns a reference to an object, which may not work until the reference is used. For example, after the auto& foo = *(i + 1);
referenced object foo
may have been released. The implementation reverse_iterator
referenced in the OP will cause the same problem.
I suggest you split the design into two classes: FileCache
one that stores file resources and a cache of loaded messages, and FileCache::iterator
one that contains the message number and lazily retrieves it from FileCache
when dereferenced. The implementation can be as simple as storing the container weak_ptr<Message>
in FileCache
and shared_ptr<Message>
out of an iterator: Simple hit demo
source to share
I must admit that I cannot fully understand what you have while holding the current MESSAGE as an Iter member. I would associate each iterator with the FileReader it needs to read, and implement it as a lightweight encapsulation of the read index for FileReader: :( read | moveTo). The most important overwrite method is boost::iterator_facade<...>::advance(...)
, which changes the current index and tries to pull a new MESSAGE from the FileReader. If that fails, it ignores the iterator as invalid and dereferencing will fail.
template<class MESSAGE,int STEP>
class message_iterator;
template<class MESSAGE>
class FileReader {
public:
typedef message_iterator<MESSAGE, 1> const_iterator;
typedef message_iterator<MESSAGE,-1> const_reverse_iterator;
FileReader();
bool open(const std::string & rName);
bool moveTo(int n);
bool read(MESSAGE &m);
// get the total count of messages in the file
// helps us to find end() and rbegin()
int getMessageCount();
const_iterator begin() {
return const_iterator(this,0);
}
const_iterator end() {
return const_iterator(this,getMessageCount());
}
const_reverse_iterator rbegin() {
return const_reverse_iterator(this,getMessageCount()-1);
}
const_reverse_iterator rend() {
return const_reverse_iterator(this,-1);
}
};
// declaration of message_iterator moving over MESSAGE
// STEP is used to specify STEP size and direction (e.g -1 == reverse)
template<class MESSAGE,int STEP=1>
class message_iterator
: public boost::iterator_facade<
message_iterator<MESSAGE>
, const MESSAGE
, boost::random_access_traversal_tag
>
{
typedef boost::iterator_facade<
message_iterator<MESSAGE>
, const MESSAGE
, boost::random_access_traversal_tag
> super;
public:
// constructor associates an iterator with its FileReader and a given position
explicit message_iterator(FileReader<MESSAGE> * p=NULL,int n=0): _filereader(p),_idx(n),_valid(false) {
advance(0);
}
bool equal(const message_iterator & i) const {
return i._filereader == _filereader && i._idx == _idx;
}
void increment() {
advance(+1);
}
void decrement() {
advance(-1);
}
// overwrite with central functionality. Move to a given relative
// postion and check wether the position can be read. If move/read
// fails we flag the iterator as incalid.
void advance(int n) {
_idx += n*STEP;
if(_filereader!=NULL) {
if( _filereader->moveTo( _idx ) && _filereader->read(_m)) {
_valid = true;
return;
}
}
_valid = false;
}
// Return a ref to the currently cached MESSAGE. Throw
// an acception if positioning at this location in advance(...) failes.
typename super::reference dereference() const {
if(!_valid) {
throw std::runtime_error("access to invalid pos");
}
return _m;
}
private:
FileReader<MESSAGE> * _filereader;
int _idx;
bool _valid;
MESSAGE _m;
};
source to share
Boost PropertyMap
You can avoid writing the bulk of your code with the Boost PropertyMap:
#include <boost/property_map/property_map.hpp>
#include <boost/property_map/function_property_map.hpp>
using namespace boost;
struct SpecificMessage {
// add some data
int index; // just for demo
};
template <typename Message>
struct MyLazyReader {
typedef Message type;
std::string fname;
MyLazyReader(std::string fname) : fname(fname) {}
Message operator()(size_t index) const {
Message m;
// FileReader fr;
// fr.open(fname);
// fr.moveTo(index); // Move to Message
// fr.read(&m); // Try deserializing as SpecificMessage
m.index = index; // just for demo
return m;
}
};
#include <iostream>
int main() {
auto lazy_access = make_function_property_map<size_t>(MyLazyReader<SpecificMessage>("file.bin"));
for (int i=0; i<10; ++i)
std::cout << lazy_access[rand()%256].index << "\n";
}
Sample output
103 198 105 115 81 255 74 236 41 205
Using memory files
You can save a map of objects index → BLOBs in general vector<array<byte, N>>
, flat_map<size_t, std::vector<uint8_t> >
or similar.
So now you only need to deserialize from myshared_map[index].data()
( begin()
and end()
in the case of BLOB resizing)
source to share