How do I find unique records in a large dataset?

I have 100 million lines of data, data is a word of no more than 15 characters, one word per line. This data is stored in several files.

My goal is to find unique words among all files.

One solution is to import all words into the database and add a unique key for that field. but it is too slow for this large dataset.

Is there a faster solution?

thank

+1


source to share


7 replies


I'm not sure there will be many faster ways than using a database. Personally, I usually use a UNIX shell script for this:

cat * | sort | uniq

      

I don't know how fast it will be with 100,000,000 words, and I'm not sure how fast you want it. (For example, do you need to run it many times, or just once? If only once, I would go with the sort and uniq parameter, and let it run overnight if you can).



Alternatively, you can write a script in ruby ​​or a similar language that stores words in an associative array. I suspect it will almost certainly be slower than the database approach.

I think if you really need speed, and need to do this task (or such as it) often, then you can write something in C, but to me it looks a bit like overreacting.

Ben

+3


source


Using a database for this is insane. 100,000 records of 15 characters fit in a ram. If there is at least some duplication, just create a trick. Should be able to handle 50MB / second or so on a modern machine.



+1


source


If you need to stick to the structure of the file, you need to index the files in some way and then maintain the index.

Otherwise, I would recommend going to the database and transferring all operations on this file to work with the database.

0


source


You can store words in a hash table. Assuming there are quite a few duplicates, O (1) lookup time would be a big performance gain.

  • Read the line.
  • Search for a word in a hash table.
  • If not found, add it to the table.
0


source


If you have a lot of data, then it must be on the SQL server. This is why SQL was developed in the first place. If you keep using these files, you will always run into performance problems.

Even if these files are modified from external programs (or via FTP), you need to create an import process that runs overnight.

0


source


If there is significant duplication in individual files, it might be faster to run the file by file and then combine the results. Something like:

{ for n in * ; do sort -u $n ; done } | sort -u

      

(I'm assuming GNU bash and GNU sort)

I think the choice of the best solution will largely depend on the distribution of duplicates and the number of individual files, although you did not share with us.


Considering myhusky's clarification (many cheats, 10 ~ 20 files), I definitely suggest this as a good solution. In particular, dense duplication will speed up compared to sort -u

sort|uniq

0


source


You can maintain speed, space, or your judgment. Pick any two.

Throwing it all into the database, sacrificing speed and space as you've learned. But it was easy.

If space is your main concern (memory, disk space) then split the work. Filter all 1 character lines from files and use one of the above solutions (sort, uniq). Repeat with two character lines for each file. Etc. Unique solutions from each pass form your solution set.

If speed is your main concern, read each file exactly once, creating a hash table (dictionary, whatever) to look for duplicates. Depending on the implementation of hashes, this can lead to memory (or disk) load. But it will be fast.

If you need to conserve speed and space, consider mixing the two methods. But be prepared to sacrifice the third point.

0


source







All Articles