How do you detect duplicates in a list of strings?

Question

How do you detect duplicates in a list of strings?

I have a sequence of SQL calls that I want to use to detect loops (and thus unnecessary duplicate sql calls), but I became aware of this more general problem.

Given a list, let's say [a,b,c,b,c,a,b,c,b,c,a,b,b]

Can I somehow turn this into a,[[b,c]*2,a]*2,b*2

or, [a,[b,c]*2]*2,a,b*2

That is, the detection of repetitions (possibly nested).

+4

string algorithm analysis

Greg 08 dec. '08 at 15:12

source to share

4 answers

If you can sort it first, then it's easy to go through it one more time to find duplicate runs. Sorting something like free-form, of course, since SQL queries sound a little scary.

0

unwind 08 dec. '08 at 15:18

source to share

I am not an expert in this field, but you can check some compression algorithms, it seems to me that this is exactly what they do.

0

Bombe 08 dec. '08 at 15:19

source to share

If the string is large enough, an interesting approach is to run a compression tool (like gzip, bzip, or 7zip). These tools work by finding repetitions (at different levels) and substituting them with pointers to the first instance of the text (or dictionary). The squeeze you achieve is a measure of repetition. Dropping the file (you have to write code to do this) will give you duplicate content.

0

Diomidis Spinellis 08 dec. '08 at 15:20

source to share

Yuval F · Accepted Answer · 2008-12-08T15:19:14+0000

Look at the Lempel-Ziv-Welsh compression algorithm . It is built around detecting duplicate strings and using them for compression. I suppose you can use Trie for it.

How do you detect duplicates in a list of strings?

More articles: