Scrapy by Script. Will not export data

Question

Scrapy by Script. Will not export data

I am trying to run scrapy from within a script and I cannot get the program to create an export file

I tried to export the file in two ways:

With conveyor
Export feed.

Both of these methods work when I run scrapy from the command line, but don't work when I run scrapy from a script.

I am not alone with this problem. Here are two other unanswered questions. I didn't notice them until after I posted the question.

Here is my code for running scrapy from within a script. It includes settings for printing the output file with both conveyor and feed exporter.

from twisted.internet import reactor

from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
import logging

from external_links.spiders.test import MySpider
from scrapy.utils.project import get_project_settings
settings = get_project_settings()

#manually set settings here
settings.set('ITEM_PIPELINES',{'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline':200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')

def stop_reactor():
    reactor.stop()

dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('reactor running...')
reactor.run()
log.msg('Reactor stopped...')

After running this code, the log says "Saved csv feed (341 items) in: output.csv" but output.csv does not exist.

here is my feed exporter code:

settings = get_project_settings()

#manually set settings here
settings.set('ITEM_PIPELINES',   {'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline': 200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')


from scrapy.contrib.exporter import CsvItemExporter


class CsvOptionRespectingItemExporter(CsvItemExporter):

    def __init__(self, *args, **kwargs):
        delimiter = settings.get('CSV_DELIMITER', ',')
        kwargs['delimiter'] = delimiter
        super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)

Here is my code of code:

import csv

class CsvWriterPipeline(object):

def __init__(self):
    self.csvwriter = csv.writer(open('items2.csv', 'wb'))

def process_item(self, item, spider): #item needs to be second in this list otherwise get spider object
    self.csvwriter.writerow([item['all_links'], item['current_url'], item['start_url']])

    return item

+3

python-2.7 web web-scraping scrapy twisted.internet

12Ryan12 Dec 19. 14 at 20:14

source to share

1 answer

eenagy · Answer 1 · 2016-06-14T08:57:31+0000

I had the same problem.

Here's what works for me:

Put the uri export in settings.py

FEED_URI='file:///tmp/feeds/filename.jsonlines'

Create a scrape.py

script next to yours scrapy.cfg

with the following content

 
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


process = CrawlerProcess(get_project_settings())

process.crawl('yourspidername') #'yourspidername' is the name of one of the spiders of the project.
process.start() # the script will block here until the crawling is finished

Run: python scrape.py

Result: the file is created.

Note . I have no pipelines in my project. So not sure if the pipeline will filter or not your results.

Also . Here is a general error section of the docs that helped me

Scrapy by Script. Will not export data

More articles: