How to use regexec with memory mapped files?

I am trying to find a regex in a large memory file using the regexec () function. I found that the program crashes when the file size is a multiple of the page size.

Is there a regexec () function that has length as an optional argument?

Or:

How do I find a regex in a memory mapped file?

Here is a minimal example due to which ALWAYS (if I run less than a program of three threads does not crash):

ls -la ttt.txt 
-rwx------ 1 bob bob 409600 Jun 14 18:16 ttt.txt

gcc -Wall mal.c -o mal -lpthread -g && ./mal
[1]    11364 segmentation fault (core dumped)  ./mal

      

And the program:

#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>

#include <stdio.h>
#include <assert.h>
#include <pthread.h>
#include <regex.h>

void* f(void*arg) {
  int size = 409600;
  int fd = open("ttt.txt", O_RDONLY);
  char* text = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
  close(fd);

  fd = open("/dev/zero", O_RDONLY);
  char* end = mmap(text + size, 4096, PROT_READ, MAP_PRIVATE | MAP_FIXED, fd, 0);
  close(fd);

  assert(text+size == end);

  regex_t myre;
  regcomp(&myre, "XXXXX", REG_EXTENDED);
  regexec(&myre, text, 0, NULL, 0);
  regfree(&myre);
  return NULL;
}

int main(int argc, char* argv[]) {
  int n = 10;
  int i;
  pthread_t t[n];
  for (i = 0; i < n; ++i) {
    pthread_create(&t[n], NULL, f, NULL);
  }
  for (i = 0; i < n; ++i) {
    pthread_join(t[n], NULL);
  }
  return 0;
}

      

PS This is the output from gdb:

gdb ./mal 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/bob/prog/c/mal...done.
(gdb) r

Starting program: /home/srdjan/prog/c/mal 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff77ff700 (LWP 11817)]
[New Thread 0x7ffff6ffe700 (LWP 11818)]
[New Thread 0x7ffff6799700 (LWP 11819)]
[New Thread 0x7fffeffff700 (LWP 11820)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6799700 (LWP 11819)]
__strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:72
72  ../sysdeps/x86_64/multiarch/../strlen.S: No such file or directory.
(gdb) bt
#0  __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:72
#1  0x00007ffff78df254 in __regexec (preg=0x7ffff6798e80, string=0x7fffef79b000 'a' <repeats 200 times>..., nmatch=<optimized out>, 
pmatch=0x0, eflags=<optimized out>) at regexec.c:245
#2  0x00000000004008e6 in f (arg=0x0) at mal.c:24
#3  0x00007ffff7bc4e9a in start_thread (arg=0x7ffff6799700) at pthread_create.c:308
#4  0x00007ffff78f24bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()
(gdb) 

      

+2


source to share


2 answers


The problem is what is being regexec()

used to match a null-terminated string against a precompiled template buffer, but the mmap

ed file is not necessarily (really, not usually) null terminated. So it looks outside the end of the file to find a NUL character (0 bytes).



You'll need a version regexec()

that takes a buffer and size argument instead of a null-terminated string, but doesn't seem to exist.

+2


source


Celada correctly identifies the problem - the file data does not necessarily contain a null delimiter.

You can fix the problem by mapping the page of zeros right after the file:

int fd;
char *text;

fd = open("ttt.txt", O_RDONLY);
text = mmap(NULL, 409600, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);

fd = open("/dev/zero", O_RDONLY);
mmap(text + 409600, 4096, PROT_READ, MAP_PRIVATE | MAP_FIXED, fd, 0);
close(fd);

      



(Note that you can close fd

immediately after mmap()

, because it mmap()

adds a link to the description of the open file).

You should of course add error checking to the above. In addition, many UNIX systems support a flag MAP_ANONYMOUS

that you can use instead of opening /dev/zero

(but this is not the case for POSIX).

+2


source







All Articles