An efficient way to determine if randomly generated code in .NET contains any obscene words.

We have an application that generates random base 35 numbers [0-9A-Z] excluding the letter O. I'm looking for a solution to find codes containing any obscene English words without having to search through a list of 10,000 entries for each code. With tens of thousands of codes generated, the lookup time for these huge lists of salacious words will either overwhelm our servers or require many more instances to support significant performance degradation.

Due to the nature of this code generator, obscenity checks must be efficient and high.

Note that vowel dropping is not an option as base 35 is required and mixed corpus is not an option. The problem here is more than just the algorithm, efficient string matching and searching are just a small part of the problem.

Part of this would entail taking a list of strings and looking for usually duplicate substrings of a given length (like 3) and omitting all words in a list containing those regular substrings to create an optimized list. This will help cut down on long, obscene filter lists in the wild to solve this problem.

+3


source to share


2 answers


Based on @yazanpro trie's suggestion, it was possible to create a solution that not only could search for words, but also stop additional searches where ABC needs to be omitted, so there is no need to check ABCFACE, ABCHOLE, ABCHEAD ...

There is currently logic inside hydrating a tree to format from an unoptimized wordlist to build an optimized list of 10 words per line. Copy the output and paste it back into the new reference list.

Storing values ​​for each tree node, this can be optimized to not require storing a word. However, saving the word makes it easier to debug changing IsObscene to a settable property instead of using the word:

    public class ObsceneValue
{
    public bool IsObscene
    {
        get { return Word != null; }
    }

    public string Word { get; set; }
    public char Character { get; set; }

    public ObsceneValue(char character, string fullWord = null)
    {
        Character = character;
        Word = fullWord;
    }
}

      



node to represent a tree structure:

    public class ObsceneNode
{
    public ObsceneNode(ObsceneValue value)
    {
        Value = value;
        Children = new Dictionary<char, ObsceneNode>();
    }

    public ObsceneValue Value { get; set; }
    public Dictionary<char, ObsceneNode> Children { get; set; }

    public bool HasChild(char character)
    {
        return Children.ContainsKey(character);
    }

    public ObsceneNode SafeAddChild(ObsceneValue value)
    {
        if (HasChild(value.Character))
        {
            return GetChild(value.Character);
        }

        var node = new ObsceneNode(value);
        Children.Add(value.Character, node);
        return node;
    }

    public ObsceneNode GetChild(char character)
    {
        if (HasChild(character))
        {
            return Children[character];
        }

        return null;
    }
}

      

An actual obscenity filter that has debug logic to omit duplicates found in almost all obscenity lists on the internet that would require redundant checks, or in this case deeper than the required trees:

        public static string[] WORDS = new[]
    {
        "ANY","LIST","OF","WORDS"
    };

    private static ObsceneNode _root;

    static SmutBlocker()
    {
        //abbreviatedList can be omitted once an optimized list of
        //terms is set to WORDS.
        var abbreviatedList = new List<string>();
        _root = new ObsceneNode(new ObsceneValue(default(char), null));
        var iter = _root;
        foreach (var word in WORDS)
        {
            for(var i = 0; i < word.Length; i++)
            {
                if (iter.Value.IsObscene)
                    break;

                var isObscene = (word.Length - 1) == i;
                iter = iter.SafeAddChild(new ObsceneValue(word[i], isObscene ? word : null));
                //The below list is to capture the optimized list
                if (isObscene)
                    abbreviatedList.Add(word.ToUpper());
            }
            iter = _root;
        }

        //Purely for fixing a non-optimized list
        //Remove once list is optimized
        var output = String.Empty;
        for (var i = 0; i < abbreviatedList.Count(); i += 10)
        {
            var segment = abbreviatedList.Skip(i).Take(10);
            output += "\"" + String.Join("\",\"", segment) + "\"," + Environment.NewLine;
        }
    }

    //Finally the actual IsObscene check that loops through the tree.
    public static bool IsObscene(string text)
    {
        var iters = new List<ObsceneNode>(text.Length);
        for (var i = 0; i < text.Length; i++)
        {
            iters.Add(_root);
            var c = text[i];

            for (int j = iters.Count() - 1; j >= 0; j--)
            {
                if (iters[j].HasChild(c))
                {
                    iters[j] = iters[j].GetChild(c);
                    if (iters[j].Value.IsObscene)
                    {
                        return true;
                    }
                }
                else
                {
                    iters.RemoveAt(j);
                }
            }
        }

        return false;
    }

      

+1


source


Considering that traditional chaining does not perform well in your scenario, a Trie data structure might be a good place to start. The Trie travel time is pretty fast. This is O (m), where m is the length of the search string. In other words, it's almost constant time if the Trie is well structured.



Here is an example in C #

+3


source







All Articles