Memory leak in therapy

Question

Memory leak in therapy

I wrote the following code to clean up email addresses (for testing purposes):

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from crawler.items import EmailItem

class LinkExtractorSpider(CrawlSpider):
    name = 'emailextractor'
    start_urls = ['http://news.google.com']

    rules = ( Rule (LinkExtractor(), callback='process_item', follow=True),)

    def process_item(self, response):
        refer = response.url
        items = list()
        for email in Selector(response).re("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}"):

            emailitem = EmailItem()
            emailitem['email'] = email
            emailitem['refer'] = refer
            items.append(emailitem)
        return items

Unfortunately, it looks like the request links are not closed properly as using the telnet scrapy console the number of requests increases by 5 fps. After ~ 3 min and 10k scrambling pages my system starts replacing (8GB of RAM). Has anyone figured out what is wrong? I already tried to remove the link and "copy" the line using

emailitem['email'] = ''.join(email)

without success. After scraping, the items are stored in BerkeleyDB by counting their occurrences (using pipelines), so the links should disappear after that.

What is the difference between returning a set of items and getting each item separately?

EDIT:

After quite a bit of debugging, it turned out that the requests are not released, so I get:

$> nc localhost 6023
>>> prefs()
Live References
Request 10344   oldest: 536s ago
>>> from scrapy.utils.trackref import get_oldest
>>> r = get_oldest('Request')
>>> r.url
<GET http://news.google.com>

which is actually the start url. Does anyone know what the problem is? Where is the reference to the Request object missing?

EDIT2:

After running for ~ 12 hours on a server (with 64GB of RAM), the used RAM is ~ 16GB (using ps, even if ps is not suitable for that). The problem is that the number of scanned pages decreases significantly, and the number of scraping elements remains 0 with hours:

INFO: Crawled 122902 pages (at 82 pages/min), scraped 3354 items (at 0 items/min)

EDIT3: I did an obgraph analysis that leads to the following graph (thanks to @Artur Gaspar): Python Objgraph Backlink

I can't seem to influence it?

+3

python web-scraping scrapy

Robin May 25 '15 at 15:03

source to share

2 answers

If you take yield

each element separately, the code is executed differently by the Python interpreter: it is no longer a function, but a generator .

This way, the complete list is never created, and each item will have memory allocated while the code that uses the generator asks for the next item.

So it may be that you don't have a memory leak, you just have a lot of allocated memory, about 10k pages of time for the memory used by the list for one page.

Of course, you could still have a real memory leak, although there are tips for debugging leaks in Scrapy here .

+2

elias May 25 '15 at 18:30

source to share

Robin · Accepted Answer · 2015-06-12T12:14:03+0000

The last answer for me was using a disk queue in combination with the working directory as a runtime parameter.

This adds the following code to settings.py:

DEPTH_PRIORITY = 1 
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

after that, running the crawler using the following command line makes the changes permanent in the given directory:

scrapy crawl {spidername} -s JOBDIR = crawls / {spidername} See scrapy docs for details

An additional advantage of this approach is that the scan can be paused and resumed at any time. My spider is now running for over 11 days, blocking ~ 15GB memory (file cache for FIFO disk queues)

Memory leak in therapy

More articles: