C ++ suffix tree library with simple usage examples

I'm looking for a suffix library (which has a linear time construct) and all I found is PATL, but PATL has no documentation and I can't figure out any of the examples. So, is there a suffix tree library for C ++ that has decent documentation?

PATL home: http://code.google.com/p/patl/

Motivation: I need to process a large number of strings and find frequent common substrings and report if there are more than n occurrences of any substring in t seconds. I have implemented a tree (using a counter in the nodes, it is not actually a counter, but a std :: vector of the visit time, since I said I need time), but it is very slow. So I thought about concatenating (concatenating some random things between lines so that substrings don't span more than one line) a certain number of messages (say 30 seconds of data), and then built a suffix tree on that line.


source to share

3 answers

Take a look at the SeqAn library , which offers high-performance implementations of various search algorithms and data structures with documentation.

For example, the suffix array class can be used as a replacement for suffixes.

Also, your problem sounds inherently complex, I'm not sure how much you can speed it up. In its general formulation, its the multiple alignment problem, which is NP hard. You can probably turn this into something more palatable, since you are only interested in exact submatrices, but its still tricky.



You might want to take a look at the implementations done for the Pizza & Chili project . They don't have suffix trees, but suffix arrays and various compressed indices. A regular (uncompressed) suffix array should be perfect for your purposes, although it is not a suffix tree.

(You will find the downloadable code in the Index Collection section.)



SDSL is very mature, with suffix tree implementations, suffix , wavelet tree and many other structures in C ++.

" The Data String Library (SDSL) is a powerful and flexible C ++ 11 library that implements compressed data structures. In total, the library contains basic information about 40 scientific publications. Concise data structures can represent an object (such as bitvector or tree) in space close to the information-theoretic lower limit of the object, effectively supporting the operations of the original object The theoretical time complexity of the operation performed on the classical data structure and the equivalent concise data structure (in most cases) are identical.

A list of structures implemented in SDSL can be found here .

Medium LCP example - longest common prefix search using suffix tree (example from SDSL sources, file text-statistics.cpp


#include <sdsl/suffix_trees.hpp>
#include <iostream>

using namespace std;
using namespace sdsl;

typedef cst_sct3<> cst_t;
typedef cst_t::char_type char_type;

int main(int argc, char* argv[])
    if (argc < 2) {
        cout << "Usage: "<< argv[0] << " file" << endl;
        cout << "(1) Generates the CST of file." << endl;
        cout << "(2) Calculates the avg LCP value and the runs in the BWT." << endl;
        return 1;
    cst_t cst;
    construct(cst, argv[1], 1);

    long double runs = 1;
    long double avg_lcp = 0;
    if (cst.csa.size()) {
        char_type prev_bwt = cst.csa.bwt[0];
        for (uint64_t i=1; i<cst.csa.size(); ++i) {
            char_type bwt = cst.csa.bwt[i];
            if (prev_bwt != bwt) {
                runs += 1.0;
            prev_bwt = bwt;
            avg_lcp += cst.lcp[i];
        avg_lcp /= cst.csa.size();
        for (size_t k=0; k<=5; k++) {
            cout << "H_" << k << ": " << Hk(cst,k).first << endl;
        cout << "avg LCP: " << avg_lcp << endl;
        cout << "runs in BWT: " << runs << endl;




All Articles