How do I find unique records in a large dataset?
I have 100 million lines of data, data is a word of no more than 15 characters, one word per line. This data is stored in several files.
My goal is to find unique words among all files.
One solution is to import all words into the database and add a unique key for that field. but it is too slow for this large dataset.
Is there a faster solution?
thank
source to share
I'm not sure there will be many faster ways than using a database. Personally, I usually use a UNIX shell script for this:
cat * | sort | uniq
I don't know how fast it will be with 100,000,000 words, and I'm not sure how fast you want it. (For example, do you need to run it many times, or just once? If only once, I would go with the sort and uniq parameter, and let it run overnight if you can).
Alternatively, you can write a script in ruby or a similar language that stores words in an associative array. I suspect it will almost certainly be slower than the database approach.
I think if you really need speed, and need to do this task (or such as it) often, then you can write something in C, but to me it looks a bit like overreacting.
Ben
source to share
If you have a lot of data, then it must be on the SQL server. This is why SQL was developed in the first place. If you keep using these files, you will always run into performance problems.
Even if these files are modified from external programs (or via FTP), you need to create an import process that runs overnight.
source to share
If there is significant duplication in individual files, it might be faster to run the file by file and then combine the results. Something like:
{ for n in * ; do sort -u $n ; done } | sort -u
(I'm assuming GNU bash and GNU sort)
I think the choice of the best solution will largely depend on the distribution of duplicates and the number of individual files, although you did not share with us.
Considering myhusky's clarification (many cheats, 10 ~ 20 files), I definitely suggest this as a good solution. In particular, dense duplication will speed up compared to sort -u
sort|uniq
source to share
You can maintain speed, space, or your judgment. Pick any two.
Throwing it all into the database, sacrificing speed and space as you've learned. But it was easy.
If space is your main concern (memory, disk space) then split the work. Filter all 1 character lines from files and use one of the above solutions (sort, uniq). Repeat with two character lines for each file. Etc. Unique solutions from each pass form your solution set.
If speed is your main concern, read each file exactly once, creating a hash table (dictionary, whatever) to look for duplicates. Depending on the implementation of hashes, this can lead to memory (or disk) load. But it will be fast.
If you need to conserve speed and space, consider mixing the two methods. But be prepared to sacrifice the third point.
source to share