MySQL fetch maximum length of match string
I need to return all text result (s), if any, that share the total length bounded by the left substring common to the search string.
Given a search for "StackOverflow" in a table column containing
"Stack",
"Sta",
"StackOv",
"StackOverthrow",
"StackOverSlow",
"StackFlow",
"Soverflow",
"StackOverCrow",
"StackOverSlow",
etc.
the query will return "StackOverthrow" because it contains the most matching characters, and StackOverSlow and StackOverCrow in the unique result set. I am currently doing something inefficient, which has to start with a LIKE search for the first characters and keep iterating and expanding the search string until nothing is found and the last good result is stored.
i.e.
select names from table where name like 'XX%';
"S" ->Results
"St"->Results
. .
"StackOver"->Results
"StackOverf"-> No results (Last result returning items beginning with StackOver etc as being the correct answer)
I know this approach is extremely inefficient, can anyone provide a single query to achieve this result? I know that I could search all combinations at once and filter the longest results in the code, however, I think the DB should be better at this.
Edit1: Please note that the above example simplifies somewhat. The vast majority of data in the database is between 2 and 10 characters, with the most common match length being about 3 characters. The table contains more than 100 thousand records.
Edit2: Apologies, I need to clarify that there may be more than one correct result, and that the results may contain duplicates that need to be removed. Nowadays, with my ineffective method of selecting individuals is easy.
source to share
Indexed on name
, the following should be extremely efficient:
SELECT DISTINCT name
FROM myTable
WHERE name LIKE CASE
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'S%') THEN '%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'St%') THEN 'S%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Sta%') THEN 'St%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Stac%') THEN 'Sta%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Stack%') THEN 'Stac%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackO%') THEN 'Stack%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOv%') THEN 'StackO%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOve%') THEN 'StackOv%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOver%') THEN 'StackOve%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverf%') THEN 'StackOver%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverfl%') THEN 'StackOverf%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverflo%') THEN 'StackOverfl%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverflow%') THEN 'StackOverflo%'
ELSE 'StackOverflow%'
END
source to share
You can make a request after creating a saved Levenshtein Distance function . This can lead to better results for you.
This is not my code. I got this from here . It seems to be well tested on sqlfiddle.
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END;
Then your request might look something like this:
SELECT names, levenshtein(`names`, 'StackOverflow') as dist
FROM mytable
ORDER BY dist;
This is how it looks on the sqlfiddle .
The results will look like this, the smallest distance will be the closest:
NAMES DIST
StackOverthrow 3
StackFlow 4
Soverflow 4
StackOv 6
Stack 8
Sta 10
source to share