Starting a new job (launch) for each launch url using scrapyd

Question

Starting a new job (launch) for each launch url using scrapyd

I have two separate spiders ...

Spider 1 will get a list of URLs from HTML pages
Spider 2 will use the cleaned url in the previous spider as the launch url and start cleaning pages

.. now what i am trying to do is ... i am trying to schedule it in such a way that .... after every hour or so .. i want to run the whole 2 url spider in parallel, on at the same time

i deployed it to scrapyD and passed start_url from python script to each expanded pause as argument..like

for url in start_urls:
r = requests.post("http://localhost:6800/schedule.json",
                  params={
                      'project': 'project',
                      'spider': 'spider',
                      'start_urls': url
                  })

and inside the spider, reading this argument, start_urls, from kwargs and assigning it to Start_urls

but what i noticed is when i pass multiple urls to the same deployed spider using for loop it never runs in parallel

only one task is running at any given time, other tasks are in a pending state (not working)

scrapyd and service settings as they are changed by default after only two settings

max_proc    = 100
max_proc_per_cpu = 25

how can i achieve while approaching real parallelism using python-scrapy-scrapyd

or will I need to use python-multi processing-pool-apply_async or some other solution

+3

python multiprocessing scrapy scrapyd

MrPandav May 16 '15 at 10:59

source to share

No one has answered this question yet

Check out similar questions:

nine

Run multiple spider spiders at the same time with scrapyd

2

Scrapyd jobs don't end

1

Can't connect to scrapyd api

1

Executing scrapy commands using os.system or subprocess.call

1

Scrapyd, Celery and Django work with Supervisor - GenericHTTPChannellProtocol Error

1

scrapyd work is not over

1

Keep the scrapyd running

0

Scrapyd Can't Crawl Spider-Created Port | Windows

0

I / O for scrapyd instance hosted on Amazon EC2 linux instance

0

scrapyd works like daemon can't find spider or project

Starting a new job (launch) for each launch url using scrapyd

More articles: