MySQL fetch maximum length of match string

I need to return all text result (s), if any, that share the total length bounded by the left substring common to the search string.

Given a search for "StackOverflow" in a table column containing

"Stack",
"Sta", 
"StackOv", 
"StackOverthrow",
"StackOverSlow",
"StackFlow", 
"Soverflow",
"StackOverCrow",
"StackOverSlow",
etc. 

      

the query will return "StackOverthrow" because it contains the most matching characters, and StackOverSlow and StackOverCrow in the unique result set. I am currently doing something inefficient, which has to start with a LIKE search for the first characters and keep iterating and expanding the search string until nothing is found and the last good result is stored.

i.e.

select names from table where name like 'XX%';


 "S" ->Results
 "St"->Results
 . .
 "StackOver"->Results 
 "StackOverf"-> No results (Last result returning items beginning with StackOver etc  as being the correct answer)

      

I know this approach is extremely inefficient, can anyone provide a single query to achieve this result? I know that I could search all combinations at once and filter the longest results in the code, however, I think the DB should be better at this.

Edit1: Please note that the above example simplifies somewhat. The vast majority of data in the database is between 2 and 10 characters, with the most common match length being about 3 characters. The table contains more than 100 thousand records.

Edit2: Apologies, I need to clarify that there may be more than one correct result, and that the results may contain duplicates that need to be removed. Nowadays, with my ineffective method of selecting individuals is easy.

+2


source to share


3 answers


Indexed on name

, the following should be extremely efficient:

SELECT DISTINCT name
FROM   myTable
WHERE  name LIKE CASE
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'S%') THEN '%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'St%') THEN 'S%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Sta%') THEN 'St%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Stac%') THEN 'Sta%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Stack%') THEN 'Stac%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackO%') THEN 'Stack%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOv%') THEN 'StackO%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOve%') THEN 'StackOv%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOver%') THEN 'StackOve%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverf%') THEN 'StackOver%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverfl%') THEN 'StackOverf%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverflo%') THEN 'StackOverfl%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverflow%') THEN 'StackOverflo%'
  ELSE 'StackOverflow%'
END

      



Have a look at the sqlfiddle .

+3


source


Not sure why you look at the smallest first. I would do it in reverse ... try LONGEST EXACT first, and if not found, open 1 character at a time until you find it.



0


source


You can make a request after creating a saved Levenshtein Distance function . This can lead to better results for you.

This is not my code. I got this from here . It seems to be well tested on sqlfiddle.

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
  RETURNS INT
  DETERMINISTIC
  BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    -- max strlen=255
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
      RETURN 0;
    ELSEIF s1_len = 0 THEN
      RETURN s2_len;
    ELSEIF s2_len = 0 THEN
      RETURN s1_len;
    ELSE
      WHILE j <= s2_len DO
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
      END WHILE;
      WHILE i <= s1_len DO
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
        WHILE j <= s2_len DO
          SET c = c + 1;
          IF s1_char = SUBSTRING(s2, j, 1) THEN 
            SET cost = 0; ELSE SET cost = 1;
          END IF;
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
          IF c > c_temp THEN SET c = c_temp; END IF;
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
            IF c > c_temp THEN 
              SET c = c_temp; 
            END IF;
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
        END WHILE;
        SET cv1 = cv0, i = i + 1;
      END WHILE;
    END IF;
    RETURN c;
  END;

      

Then your request might look something like this:

SELECT names, levenshtein(`names`, 'StackOverflow') as dist
FROM mytable
ORDER BY dist;

      

This is how it looks on the sqlfiddle .

The results will look like this, the smallest distance will be the closest:

NAMES           DIST
StackOverthrow  3
StackFlow       4
Soverflow       4
StackOv         6
Stack           8
Sta             10

      

0


source







All Articles