How do you detect duplicates in a list of strings?
I have a sequence of SQL calls that I want to use to detect loops (and thus unnecessary duplicate sql calls), but I became aware of this more general problem.
Given a list, let's say
[a,b,c,b,c,a,b,c,b,c,a,b,b]
Can I somehow turn this into
a,[[b,c]*2,a]*2,b*2
or, [a,[b,c]*2]*2,a,b*2
That is, the detection of repetitions (possibly nested).
source to share
Look at the Lempel-Ziv-Welsh compression algorithm . It is built around detecting duplicate strings and using them for compression. I suppose you can use Trie for it.
source to share
If the string is large enough, an interesting approach is to run a compression tool (like gzip, bzip, or 7zip). These tools work by finding repetitions (at different levels) and substituting them with pointers to the first instance of the text (or dictionary). The squeeze you achieve is a measure of repetition. Dropping the file (you have to write code to do this) will give you duplicate content.
source to share