How to get around the lack of a NUL terminator in strings returned from mmap ()?

When mmap () contains a text file like

int fd = open("file.txt", O_RDWR);
fstat(fd, &sb)
char *text = mmap(0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

      

the contents of the file are mapped directly to memory, and text

it will not contain a NUL terminator, so working with it with normal string functions is not safe. On Linux (at least) the remaining bytes of an unused page are filled with zeros, so effectively you end up with a NUL terminator in all cases where the file size is not a multiple of the page size.

But relying on how messy it feels and other implementations mmap()

(like in FreeBSD, I guess) don't fill in the zero partial pages. When matching files that are multiples of the page size, the NUL terminator will also be missing.

Are there any sensible ways to get around this or add a NUL terminator?

Things I have considered

  • Using the strn*()

    function exclusively and tracking the distance to the end of the buffer.
    • Pros: no need for a NUL terminator
    • Cons: additional tracking required to determine the distance to the end of the file when parsing text; some functions str*()

      have no analogue strn*()

      , for example strstr

      .
  • As the suggested other answer , do anonymous mapping at a fixed address after displaying your text file.
    • Pros: Normal C functions can be used str*()

    • Cons: Usage is MAP_FIXED

      not thread safe; Looks like a terrible hack
  • mmap()

    extra byte and make the card writable and write the NUL terminator. The OpenGroup mmap man page says that you can make the mapping larger than the size of your object, but still accessing data outside of the actual mapped object will generate SIGBUS

    .
    • Pros: Normal C functions can be used str*()

    • Cons: Processing required (ignored?) SIGBUS

      , Which could potentially mean something else happened. I'm really not sure if writing a NUL terminator will work?
  • Expand files that are multiples of the page size with ftruncate()

    one byte.
    • Pros: Normal C functions can be used str*()

      ; ftruncate()

      will write a NUL byte for the newly allocated area for you
    • Cons: means that we have to write to files, which may not be possible or acceptable in all cases; Doesn't address issues for implementations mmap()

      that don't populate null partial pages
  • Just read()

    file into some memory malloc()

    'd and forget aboutmmap()

    • Pros: Avoids all of these solutions; Easy malloc()

      and extra byte for NUL
    • Cons: different performance than mmap()

Solution # 1 seems to be generally the best, and just requires some extra work on the part of the text-reading functions.

Are there any better alternatives or are these the best solutions? Are there aspects of these solutions that I have not considered that make them more or less attractive?

+3


source to share


1 answer


I would suggest holding a paradigm here.

You are looking at an entire universe of \ 0'-delimited strings that define your text. Instead of looking at the world this way, why don't you try looking at a world where text is defined as a sequence, defined by a start and end iterator.

You are mmap

your file and then first set the start iterator, call it beg_iter

at the start of the mmap-ed segment and the end iterator, call it end_iter

, on the first byte following the last byte in the mmap-ed segment beg_iter+number_of_pages*pagesize

, or then until

A) end_iter

is equal beg_iter

, or

B) is beg_iter[-1]

not null, then



C) decrement end_iter

and return to step A.

When you're done, you have a pair of iterators, an iterator start value and an iterator end value that defines your text string.

Of course, in this case, your iterators are equal char *

, but that really doesn't really matter. The important thing is that you now find yourself with a rich set of algorithms and templates from the C ++ Standard Library at your disposal, which allow you to implement many complex operations, both mutable (for example std::transform

) and non-mutable (for example std::find

).

Null-terminated strings are actually a break in the days of plain C. With C ++, null-terminated strings are somewhat archaic and commonplace. Modern C ++ code must use objects std::string

and sequences defined by the start and end of iterators.

One small footnote: instead of figuring out how much NULL

padding you ended up with mmap-ing (), you might find it easier to fstat () a file and get the length of the file in bytes before mmap -. Then you now know for sure that a lot has turned out, and you do not need to redesign it by looking at the registration.

+2


source







All Articles