Save deque to text file

I am writing a crawler in Python so that Ctrl + C does not cause my crawler to run on the next run, I need to store the deque handler in a text file (one item per line) and update it every iteration, the update operation should be very fast. To avoid reinventing the wheel, I ask if there is an installed module for this?

+1


source to share


4 answers


Alternatively, you can customize the exit function and sort the deque on exit.



Pickle exit function

+4


source


You can use pickle to serialize your lists.



+1


source


I'm not sure if I understood the question correctly, I'm just curious, so here are some questions and suggestions:

Are you planning to catch the Ctrl + C interrupt and do a deque? What happens if the crawler fails for any reason, such as an unhandled exception or crash? Do you lose your queue status and start over? from the documentation:

Note

The exit function is not called when the program is killed by a signal, when an internal Python error is detected, or when os._exit () is called.

What happens if you visit the same URI again, are you maintaining a visited list, or something else?

I think you should maintain some kind of visit and session / status information for every URI you are viewing. You can use the visit information to decide whether to bypass a URI or not when you visit the same URI next time. Other information - session information - for the last session with this URI will help in collecting only incremental material, and if the page doesn't change, it doesn't need to be brought up, saving some I / O costs, duplicates, etc.

This way you don't have to worry about ctrl + C or crashing. If the crawler descends for any reason, say after crawling 60K messages, when there is still 40K left, next time the crawler fills up the queue, although the queue may be huge, the crawler can check if it has visited the URI or not and what is was the state of the page when crawling it - optimization - whether the page is required for a new record that it changed or not.

Hope this helps.

+1


source


Some things that come to my mind:

  • leave the file descriptor open (don't close the file every time you write something)
  • or write a file every n elements and catch a close signal to write the current unwritten elements
0


source







All Articles