How to extract phrases and then words into a string of text?

I have a search method that takes a user-entered string, splits it at each whitespace, and then proceeds to find matches based on a list of shared terms:

string[] terms = searchTerms.ToLower().Trim().Split( ' ' );

      

Now I am given one more requirement: to be able to search for phrases through double quote delimiters a la Google. Therefore, if search terms are provided:

"string" text

The search will match occurrences of "string" and "text" rather than four separate terms [open and close double quotes must also be removed before searching].

How can I achieve this in C #? I would suggest regex is the way to go, but not too much of it, so don't know if they are the best solution.

If you need more information, please ask. Thanks in advance for your help.

+2


source to share


6 answers


Here's a regex pattern that will return matches in groups named ' term

':

("(?<term>[^"]+)"\s*|(?<term>[^ ]+)\s*)+

      

So for the input:



"a line" of text

      

Output items identified by the group " term

" will be:

a line
of
text

      

+2


source


Regular expressions will definitely be the way ...

You should check this MSDN link for information on the Regex class: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx

and here is a great link for learning the regex syntax: http://www.radsoftware.com.au/articles/regexlearnsyntax.aspx

Then, to add code examples, you can do something along these lines:



string searchString = "a line of";

Match m = Regex.Match(textToSearch, searchString);

      

or if you just want to know if a string contains a string or not:

bool success = Regex.Match(textToSearch, searchString).Success;

      

+1


source


use regex builder here

http://gskinner.com/RegExr/

and you will be able to manipulate the regex as you need it.

+1


source


Use regular expressions ...

string textToSearchIn = "string" text ",
string result = Regex.Match (textToSearchIn," (? <= "). *? (? =") "). Value,

or if there is more than one, put that in a match collection ...

MatchCollection allPhrases = Regex.Matches (textToSearchIn, "(? <="). *? (? = ")");

+1


source


Knuth-Morris-Pratt (KMP algorithm) is recognized as the fastest algorithm for finding substrings in strings (well, technically not strings, but byte arrays).

using System.Collections.Generic;

namespace KMPSearch
{
    public class KMPSearch
    {
        public static int NORESULT = -1;

        private string _needle;
        private string _haystack;
        private int[] _jumpTable;

        public KMPSearch(string haystack, string needle)
        {
            Haystack = haystack;
            Needle = needle;
        }

        public void ComputeJumpTable()
        {
            //Fix if we are looking for just one character...
            if (Needle.Length == 1)
            {
                JumpTable = new int[1] { -1 };
            }
            else
            {
                int needleLength = Needle.Length;
                int i = 2;
                int k = 0;

                JumpTable = new int[needleLength];
                JumpTable[0] = -1;
                JumpTable[1] = 0;

                while (i <= needleLength)
                {
                    if (i == needleLength)
                    {
                        JumpTable[needleLength - 1] = k;
                    }
                    else if (Needle[k] == Needle[i])
                    {
                        k++;
                        JumpTable[i] = k;
                    }
                    else if (k > 0)
                    {
                        JumpTable[i - 1] = k;
                        k = 0;
                    }

                    i++;
                }
            }
        }

        public int[] MatchAll()
        {
            List<int> matches = new List<int>();
            int offset = 0;
            int needleLength = Needle.Length;
            int m = Match(offset);

            while (m != NORESULT)
            {
                matches.Add(m);
                offset = m + needleLength;
                m = Match(offset);
            }

            return matches.ToArray();
        }

        public int Match()
        {
            return Match(0);
        }

        public int Match(int offset)
        {
            ComputeJumpTable();

            int haystackLength = Haystack.Length;
            int needleLength = Needle.Length;

            if ((offset >= haystackLength) || (needleLength > ( haystackLength - offset))) 
                return NORESULT;

            int haystackIndex = offset;
            int needleIndex = 0;

            while (haystackIndex < haystackLength)
            {
                if (needleIndex >= needleLength)
                    return haystackIndex;

                if (haystackIndex + needleIndex >= haystackLength)
                    return NORESULT;

                if (Haystack[haystackIndex + needleIndex] == Needle[needleIndex])
                {
                    needleIndex++;
                } 
                    else
                {
                    //Naive solution
                    haystackIndex += needleIndex;

                    //Go back
                    if (needleIndex > 1)
                    {
                        //Index of the last matching character is needleIndex - 1!
                        haystackIndex -= JumpTable[needleIndex - 1];
                        needleIndex = JumpTable[needleIndex - 1];
                    }
                    else
                        haystackIndex -= JumpTable[needleIndex];


                }
            }

            return NORESULT;
        }

        public string Needle
        {
            get { return _needle; }
            set { _needle = value; }
        }

        public string Haystack
        {
            get { return _haystack; }
            set { _haystack = value; }
        }

        public int[] JumpTable
        {
            get { return _jumpTable; }
            set { _jumpTable = value; }
        }
    }
}

      

Usage: -

using System;
using System.Collections.Generic;
using System.Text;
using System.Reflection;
namespace KMPSearch
{
    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length != 2)
            {
                Console.WriteLine("Usage: " + Environment.GetCommandLineArgs()[0] + " haystack needle");
            }
            else
            {
                KMPSearch search = new KMPSearch(args[0], args[1]);
                int[] matches = search.MatchAll();
                foreach (int i in matches)
                    Console.WriteLine("Match found at position " + i+1);
            }
        }

    }
}

      

0


source


Try this, it will return an array for text. ex: {"string of" text "notepad}}:

string textToSearch = "\"a line of\" text \" notepad\"";

MatchCollection allPhrases = Regex.Matches(textToSearch, "(?<=\").*?(?=\")");

var RegArray = allPhrases.Cast<Match>().ToArray();

      

output: {"string", "text", "notepad"}

0


source







All Articles