How do you detect duplicates in a list of strings?

I have a sequence of SQL calls that I want to use to detect loops (and thus unnecessary duplicate sql calls), but I became aware of this more general problem.

Given a list, let's say [a,b,c,b,c,a,b,c,b,c,a,b,b]

Can I somehow turn this into a,[[b,c]*2,a]*2,b*2

or, [a,[b,c]*2]*2,a,b*2

That is, the detection of repetitions (possibly nested).

+4


source to share


4 answers


Look at the Lempel-Ziv-Welsh compression algorithm . It is built around detecting duplicate strings and using them for compression. I suppose you can use Trie  for it.



+4


source


If you can sort it first, then it's easy to go through it one more time to find duplicate runs. Sorting something like free-form, of course, since SQL queries sound a little scary.



0


source


I am not an expert in this field, but you can check some compression algorithms, it seems to me that this is exactly what they do.

0


source


If the string is large enough, an interesting approach is to run a compression tool (like gzip, bzip, or 7zip). These tools work by finding repetitions (at different levels) and substituting them with pointers to the first instance of the text (or dictionary). The squeeze you achieve is a measure of repetition. Dropping the file (you have to write code to do this) will give you duplicate content.

0


source







All Articles