Connecting to Databases in Scrapy

The Scrapy documentation perfectly describes how to work with the database in it MongoDB Pipeline Example .

If I write another pipeline that also needs database access, I will need to create another connection. If I write Middleware Downloader (for registering requests in the database) then another. Database connections are quite expensive and it seems quite wasteful. SQL Alchemy, for example, uses a connection pool for this.

To my question (s): Is there a better way to establish a connection and reuse it via Extensions, Middlewares and Pipelines? Are there any issues with the async nature of Scrapy and the standard DBAPI2 (namely: and it would be better / useless to look at usage twisted.enterprise.adbapi

)?

I reviewed an extension with something similar to the following (assume correct function calls via signals):

import MySQLdb

def __init__(self):
    self.db = MySQLdb.connect('...')

def spider_opened(self, spider):
    spider.db = self.db

def spider_closer(self, spider):
    spider.db.close()

      

Thanks in advance.

+3


source to share


1 answer


You can store your common code in an arbitrary module like a singleton and call that code from any of your element pipeline, loader middleware, spider middleware, or extensions.



As for twisted.enterprise.adbapi

, it would definitely be better if you are up twisted.enterprise.adbapi

to the task so that your database connections don't block your traversals.

0


source







All Articles