How to get around the lack of a NUL terminator in strings returned from mmap ()?
When mmap () contains a text file like
int fd = open("file.txt", O_RDWR);
fstat(fd, &sb)
char *text = mmap(0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
the contents of the file are mapped directly to memory, and text
it will not contain a NUL terminator, so working with it with normal string functions is not safe. On Linux (at least) the remaining bytes of an unused page are filled with zeros, so effectively you end up with a NUL terminator in all cases where the file size is not a multiple of the page size.
But relying on how messy it feels and other implementations mmap()
(like in FreeBSD, I guess) don't fill in the zero partial pages. When matching files that are multiples of the page size, the NUL terminator will also be missing.
Are there any sensible ways to get around this or add a NUL terminator?
Things I have considered
- Using the
strn*()
function exclusively and tracking the distance to the end of the buffer.- Pros: no need for a NUL terminator
- Cons: additional tracking required to determine the distance to the end of the file when parsing text; some functions
str*()
have no analoguestrn*()
, for examplestrstr
.
- As the suggested other answer , do anonymous mapping at a fixed address after displaying your text file.
- Pros: Normal C functions can be used
str*()
- Cons: Usage is
MAP_FIXED
not thread safe; Looks like a terrible hack
- Pros: Normal C functions can be used
-
mmap()
extra byte and make the card writable and write the NUL terminator. The OpenGroup mmap man page says that you can make the mapping larger than the size of your object, but still accessing data outside of the actual mapped object will generateSIGBUS
.- Pros: Normal C functions can be used
str*()
- Cons: Processing required (ignored?)
SIGBUS
, Which could potentially mean something else happened. I'm really not sure if writing a NUL terminator will work?
- Pros: Normal C functions can be used
- Expand files that are multiples of the page size with
ftruncate()
one byte.- Pros: Normal C functions can be used
str*()
;ftruncate()
will write a NUL byte for the newly allocated area for you - Cons: means that we have to write to files, which may not be possible or acceptable in all cases; Doesn't address issues for implementations
mmap()
that don't populate null partial pages
- Pros: Normal C functions can be used
- Just
read()
file into some memorymalloc()
'd and forget aboutmmap()
- Pros: Avoids all of these solutions; Easy
malloc()
and extra byte for NUL - Cons: different performance than
mmap()
- Pros: Avoids all of these solutions; Easy
Solution # 1 seems to be generally the best, and just requires some extra work on the part of the text-reading functions.
Are there any better alternatives or are these the best solutions? Are there aspects of these solutions that I have not considered that make them more or less attractive?
source to share
I would suggest holding a paradigm here.
You are looking at an entire universe of \ 0'-delimited strings that define your text. Instead of looking at the world this way, why don't you try looking at a world where text is defined as a sequence, defined by a start and end iterator.
You are mmap
your file and then first set the start iterator, call it beg_iter
at the start of the mmap-ed segment and the end iterator, call it end_iter
, on the first byte following the last byte in the mmap-ed segment beg_iter+number_of_pages*pagesize
, or then until
A) end_iter
is equal beg_iter
, or
B) is beg_iter[-1]
not null, then
C) decrement end_iter
and return to step A.
When you're done, you have a pair of iterators, an iterator start value and an iterator end value that defines your text string.
Of course, in this case, your iterators are equal char *
, but that really doesn't really matter. The important thing is that you now find yourself with a rich set of algorithms and templates from the C ++ Standard Library at your disposal, which allow you to implement many complex operations, both mutable (for example std::transform
) and non-mutable (for example std::find
).
Null-terminated strings are actually a break in the days of plain C. With C ++, null-terminated strings are somewhat archaic and commonplace. Modern C ++ code must use objects std::string
and sequences defined by the start and end of iterators.
One small footnote: instead of figuring out how much NULL
padding you ended up with mmap-ing (), you might find it easier to fstat () a file and get the length of the file in bytes before mmap -. Then you now know for sure that a lot has turned out, and you do not need to redesign it by looking at the registration.
source to share