Quick string search to find an element of a string array that matches a givven pattern

I have an array of constant strings that I iterate over to find the index of an element, which is the string that contains the search pattern. Which search algorithm should I choose to improve the search speed for this item? I have no time limit before running the application to prepare lookup tables if needed.

I corrected the question - I am not doing an exact string match - I am looking for a pattern inside an element that is in an array

array of strings:
[0] Red fox jumps over the fence
[1] Blue table
[2] Red flowers on the fence

      

I need to find an element that contains the word "table" - in this case its element is 1

I like 50,000 iterations of a set of 30 arrays that can hold up to 30,000 lines of at least 128 characters. Now I am using old type brute force strstr

which is too slow ...

Ok, posting part of my function, the first one strstr

is to look through the unresolved array of strings if there are any occurrences, then a rough search follows. I know I can speed up this part, but I am not optimizing this approach ...

// sheets[i].buffer - contains a page of a text which is split into lines
// fullfunccall - is the pattern
// sheets[i].cells[k] - are the pointers to lines in a buffer

for( i=0; i<sheetcount; i++) {
  if( i!= currsheet && sheets[i].name && sheets[i].name[0] != '\0') {
    if( strstr(sheets[i].buffer, fullfunccall )) {
      usedexternally = 1;

      int foundatleastone = 0;
      for( k=0; k<sheets[i].numcells; k++ ) {
        strncpy_s(testline, MAX_LINE_SIZE, sheets[i].cells[k]->line, sheets[i].cells[k]->linesize);
        testline[sheets[i].cells[k]->linesize] = '\0';

        if( strstr(testline, fullfunccall )) {
          dependency_num++;

          if( dependency_num >= MAX_CELL_DEPENDENCIES-1) {
            printf("allocation for sheet cell dependencies is insuficcient\n");
            return;
          }

          sheets[currsheet].cells[currcellposinsheet]->numdeps = dependency_num+1;
          foundatleastone++;
          sheets[currsheet].cells[currcellposinsheet]->deps[dependency_num] = &sheets[i].cells[k];
        }
      }

      if( foundatleastone == 0 ) {
        printf("error locating dependency for external func: %s\n", fullfunccall );
        return;
      }
    }
  };
}

      

+3


source to share


3 answers


You wrote that your "haystack" (set of search strings) is approximately 30,000 lines with approx. 200 characters. You also wrote that "needle" (a search term) is a string of 5 or 20 characters.

Based on this, you can pre-copy the hash table that displays any 5-character subsequence of the string (s) in the haystack it appears in. For 30,000 lines (200 characters each) there are no more than 30,000 * (200-5) = 5,850,000 different 5-character substrings. If you hash each of them into a 16-bit checksum, you will need a minimum of 11MB of memory (for hash keys) plus some pointers to point to the string (s) where the substring occurs.

For example, given a simplified haystack

static const char *haystack[] = { "12345", "123456", "23456", "345678" };

      



you are precompiling a hashmap that maps any possible 5-character string such that

12345 => haystack[0], haystack[1]
23456 => haystack[1], haystack[2]
34567 => haystack[3]
45678 => haystack[4]

      

With this, you can take the first five characters of the given key (5 to 20 characters long), hash it, and then execute the usual strstr

one through all the lines to which the key is mapped by the hash.

+2


source


For each sheet that you process, you can build a suffix array as described in this article . Before starting your search, read the sheet, find the starting lines (as integer indices in the sheet buffer), create a suffix array and sort it as described in the article.

Now if you are looking for lines where a pattern occurs, such as "table", you can search for the next entry after "table" and the next entry after "tablf", match where you moved the right letter, odometer style.

If both indices are the same, there is no match. If they are different, you will get a list of pointers in the sheet:

"tab. And now ..."
----------------------------------------------------------------
"table and ..."                0x0100ab30
"table water for ..."          0x0100132b
"tablet computer ..."          0x01000208
----------------------------------------------------------------
"tabloid reporter ..."

      

This will give you a list of pointers from which, by subtracting the base pointer of the sheet buffer, you can get the offsets of the integers. Comparison with the starting lines will give you the line numbers corresponding to those pointers. (The line numbers are sorted, so you can do a binary search here.)

The memory overhead is an array of pointers that are the same size as the leaf buffer, so for 30,000 lines of 200 characters, this would be about 48MB on a 64-bit machine. (The overhead of row indexes is negligible.)



Sorting the array will take a long time, but only done once for each sheet.

Edit . The idea seems to work well. I have executed it and can scan a dictionary of about 130,000 words in a text file of almost 600k in one second:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define die(...) exit((fprintf(stderr, "Fatal: " __VA_ARGS__), \
    putc(10, stderr), 1))



typedef struct Sheet Sheet;    

struct Sheet {
    size_t size;    /* Number of chars */
    char *buf;      /* Null-terminated char buffer */
    char **ptr;     /* Pointers into char buffer */
    size_t nline;   /* number of lines */
    int *line;      /* array of offset of line beginnings */
    size_t naux;    /* size of scratch array */
    char **aux;     /* scratch array */
};


/*
 *      Count occurrence of c in zero-terminated string p.
 */
size_t strcount(const char *p, int c)
{
    size_t n = 0;

    for (;;) {
        p = strchr(p, c);
        if (p == NULL) return n;
        p++;
        n++;        
    }

    return 0;
}

/*
 *      String comparison via pointers to strings.
 */
int pstrcmp(const void *a, const void *b)
{
    const char *const *aa = a;
    const char *const *bb = b;

    return strcmp(*aa, *bb);
}

/*
 *      Pointer comparison.
 */
int ptrcmp(const void *a, const void *b)
{
    const char *const *aa = a;
    const char *const *bb = b;

    if (*aa == *bb) return 0;   
    return (*aa < *bb) ? -1 : 1;
}

/*
 *      Create and prepare a sheet, i.e. a text file to search.
 */
Sheet *sheet_new(const char *fn)
{
    Sheet *sheet;
    FILE *f = fopen(fn, "r");
    size_t n;
    int last;
    char *p;
    char **pp;

    if (f == NULL) die("Couldn't open %s", fn);

    sheet = malloc(sizeof(*sheet));
    if (sheet == NULL) die("Allocation failed");

    fseek(f, 0, SEEK_END);
    sheet->size = ftell(f);
    fseek(f, 0, SEEK_SET);

    sheet->buf = malloc(sheet->size + 1);
    sheet->ptr = malloc(sheet->size * sizeof(*sheet->ptr));

    if (sheet->buf == NULL) die("Allocation failed");
    if (sheet->ptr == NULL) die("Allocation failed");

    fread(sheet->buf, 1, sheet->size, f);
    sheet->buf[sheet->size] = '\0';
    fclose(f);

    sheet->nline = strcount(sheet->buf, '\n');
    sheet->line = malloc(sheet->nline * sizeof(*sheet->line));

    sheet->aux = NULL;
    sheet->naux = 0;

    n = 0;
    last = 0;
    p = sheet->buf;
    pp = sheet->ptr;
    while (*p) {
        *pp++ = p;
        if (*p == '\n') {
            sheet->line[n++] = last;
            last = p - sheet->buf + 1;
        }
        p++;
    }

    qsort(sheet->ptr, sheet->size, sizeof(*sheet->ptr), pstrcmp);

    return sheet;
}

/*
 *      Clean up sheet.
 */
void sheet_delete(Sheet *sheet)
{
    free(sheet->buf);
    free(sheet->ptr);
    free(sheet->line);
    free(sheet->aux);
    free(sheet);
}

/*
 *      Binary range search for string pointers.
 */
static char **pstr_bsearch(const char *key,
    char **arr, size_t high)
{
    size_t low = 0;

    while (low < high) {
        size_t mid = (low + high) / 2;
        int diff = strcmp(key, arr[mid]);

        if (diff < 0) high = mid;
        else low = mid + 1;
    }

    return arr + low;
}

/*
 *      Binary range search for line offsets.
 */
static const int *int_bsearch(int key, const int *arr, size_t high)
{
    size_t low = 0;

    while (low < high) {
        size_t mid = (low + high) / 2;
        int diff = key - arr[mid];

        if (diff < 0) high = mid;
        else low = mid + 1;
    }

    if (low < 1) return NULL;
    return arr + low - 1;
}

/*
 *      Find occurrences of the string key in the sheet. Returns the
 *      number of lines in which the key occurs and assigns up to
 *      max lines to the line array. (If max is 0, line may be NULL.)
 */
int sheet_find(Sheet *sheet, char *key,
    int line[], int max)
{
    char **begin, **end;
    int n = 0;
    size_t i, m;
    size_t last;

    begin = pstr_bsearch(key, sheet->ptr, sheet->size);
    if (begin == NULL) return 0;

    key[strlen(key) - 1]++;
    end = pstr_bsearch(key, sheet->ptr, sheet->size);
    key[strlen(key) - 1]--;
    if (end == NULL) return 0;
    if (end == begin) return 0;

    m = end - begin;
    if (m > sheet->naux) {
        if (sheet->naux == 0) sheet->naux = 0x100;
        while (sheet->naux < m) sheet->naux *= 2;
        sheet->aux = realloc(sheet->aux, sheet->naux * sizeof(*sheet->aux));
        if (sheet->aux == NULL) die("Re-allocation failed");        
    }

    memcpy(sheet->aux, begin, m * sizeof(*begin));
    qsort(sheet->aux, m, sizeof(*begin), ptrcmp);

    last = 0;
    for (i = 0; i < m; i++) {
        int offset = sheet->aux[i] - sheet->buf;
        const int *p;

        p = int_bsearch(offset, sheet->line + last, sheet->nline - last);

        if (p) {
            if (n < max) line[n] = p - sheet->line;
            last = p - sheet->line + 1;
            n++;
        }
    }

    return n;
}

/*
 *      Example client code
 */
int main(int argc, char **argv)
{
    Sheet *sheet;
    FILE *f;

    if (argc != 3) die("Usage: %s patterns corpus", *argv);

    sheet = sheet_new(argv[2]);

    f = fopen(argv[1], "r");
    if (f == NULL) die("Can't open %s.", argv[1]);
    for (;;) {
        char str[80];
        int line[50];
        int i, n;

        if (fgets(str, sizeof(str), f) == NULL) break;
        strtok(str, "\n");
        n = sheet_find(sheet, str, line, 50);
        printf("%8d %s\n", n, str);

        if (n > 50) n = 50;
        for (i = 0; i < n; i++) printf("    [%d] %d\n", i, line[i] + 1);
    }
    fclose(f);

    sheet_delete(sheet);

    return 0;
}

      

The implementation has its rough edges, but it works. I don't particularly like the scratch array and additional sorting by the found range of pointers, but it turns out that even sorting a large array of suffixes doesn't take too long.

You can expand this solution to more sheets if you like.

+2


source


I find the DFA to be the most practical, since it reads each character of the input at most once - more precisely, it reads each character of the input once and stops as soon as the pattern is not exactly defined (if configured correctly). With DFA, you can also check multiple templates at the same time.

Two good (but different) implementations of DFA algorithms that have been well tested in practice are

It is impossible to tell what is right for your task unless you provide more about it.

edit: DFA remains for "Deterministic State Machines"

edit: since you indicated that your patterns are exact substrings, the most common solution is the KMP algorithm (Knuth-Morris-Pratt)

0


source







All Articles