How to store visited URLs and maintain a job queue while writing a crawler

I am writing a scanner. I am storing the visited urls in a redis set and storing the job queue using a redis list. As the data grows, memory is used up, my memory is 4G. How can I save them without redis? I have no idea if I store them in files, they must be in memory as well.

If I use mysql to store this I think it might be much slower than redis.

I have 5 machines with 4G memory, if anyone has stuff to create a redis cluster it helps as well. I have stuff to set up a cluster for failover, but I need to set up a weighted cluster.

THH

+3


source to share


1 answer


If you're just doing basic add / remove from sets and lists, take a look at twemproxy / nutcracker . With it you can use all the nodes.

As far as the usage pattern itself, are you deleting or dismissing jobs and urls? How many reps are in the system? For example, do you view the same URLs repeatedly? If so, perhaps you only need to map the URLs to their last crawl time, and instead of the job queue, you pull in URLs that are new or outside of the given window since their last run.



Without details on how your crawler actually works or interacts with Redis, this is what I can suggest. If the memory keeps growing, it probably means that you are not flushing the DB.

+2


source







All Articles