Starting a new job (launch) for each launch url using scrapyd

I have two separate spiders ...

  • Spider 1 will get a list of URLs from HTML pages

  • Spider 2 will use the cleaned url in the previous spider as the launch url and start cleaning pages

.. now what i am trying to do is ... i am trying to schedule it in such a way that .... after every hour or so .. i want to run the whole 2 url spider in parallel, on at the same time

i deployed it to scrapyD and passed start_url from python script to each expanded pause as argument..like

for url in start_urls:
r = requests.post("http://localhost:6800/schedule.json",
                  params={
                      'project': 'project',
                      'spider': 'spider',
                      'start_urls': url
                  })

      

and inside the spider, reading this argument, start_urls, from kwargs and assigning it to Start_urls

but what i noticed is when i pass multiple urls to the same deployed spider using for loop it never runs in parallel

only one task is running at any given time, other tasks are in a pending state (not working)

scrapyd and service settings as they are changed by default after only two settings

max_proc    = 100
max_proc_per_cpu = 25

      

how can i achieve while approaching real parallelism using python-scrapy-scrapyd

or will I need to use python-multi processing-pool-apply_async or some other solution

+3


source to share





All Articles