Connecting to Databases in Scrapy
The Scrapy documentation perfectly describes how to work with the database in it MongoDB Pipeline Example .
If I write another pipeline that also needs database access, I will need to create another connection. If I write Middleware Downloader (for registering requests in the database) then another. Database connections are quite expensive and it seems quite wasteful. SQL Alchemy, for example, uses a connection pool for this.
To my question (s): Is there a better way to establish a connection and reuse it via Extensions, Middlewares and Pipelines? Are there any issues with the async nature of Scrapy and the standard DBAPI2 (namely: and it would be better / useless to look at usage twisted.enterprise.adbapi
)?
I reviewed an extension with something similar to the following (assume correct function calls via signals):
import MySQLdb
def __init__(self):
self.db = MySQLdb.connect('...')
def spider_opened(self, spider):
spider.db = self.db
def spider_closer(self, spider):
spider.db.close()
Thanks in advance.
source to share
You can store your common code in an arbitrary module like a singleton and call that code from any of your element pipeline, loader middleware, spider middleware, or extensions.
As for twisted.enterprise.adbapi
, it would definitely be better if you are up twisted.enterprise.adbapi
to the task so that your database connections don't block your traversals.
source to share