Meta information in DAWG / DAFSA

I would like to implement a string search data structure for dynamic strings that will support efficient search and insert. I am currently using trie, but I would like to reduce the memory footprint if possible. This Wikipedia article describes DAWG / DAFSA, which will obviously save a lot of trie space by compressing suffixes. However, while it will clearly check if the string is legal, it is not obvious to me if there is a way to exclude illegal lines. For example, using the words "cite" and "cat" where "t" and "e" are terminal states, the DAWG / DAFSA would look like this:

      c
     / \
    a   i
     \ /
      t
      |
      e

      

and "cit" and "cate" will be misunderstood as legal strings without any meta information.

Questions:

1) Is there a preferred way to store string / path metadata (such as legality) in DAWG / DAFSA?

2) If DAWG / DAFSA is incompatible with the requirements (efficient search / insert and storage of meta information), what is the best data structure to use? A minimum amount of memory would be nice, but perhaps not entirely necessary.

+3


source to share


1 answer


In a DAWG, you simply squeeze states together if they are completely indistinguishable from each other. This means that you would not actually combine the T nodes for CAT and CITE together for the very reason you gave it — it gives you either a false positive for CIT or a false negative for CAT.

DAWGs are usually most effective for static dictionaries when you have a huge number of words with common suffixes. For example, a DAWG for all English could save a lot of space by combining all the "s" suffixes at the end of a set of words and most of the ING suffixes from gerunds. If you're going to be doing a lot of attachments or deletions, the DAWG is almost certainly the wrong data structure to work with, because adding or removing one word from the DAWG can cause ripple effects that require a lot of branches that were previously merged with split or vice versa.



Honestly, for reasonably sized datasets, trie is not a bad call. Three for all English will only use something like 26MB, which isn't very much. I would only go with a DAWG if the use of space is really great and you don't do a lot of insertions or exceptions.

Hope this helps!

+3


source







All Articles