Starting a new job (launch) for each launch url using scrapyd
I have two separate spiders ...
-
Spider 1 will get a list of URLs from HTML pages
-
Spider 2 will use the cleaned url in the previous spider as the launch url and start cleaning pages
.. now what i am trying to do is ... i am trying to schedule it in such a way that .... after every hour or so .. i want to run the whole 2 url spider in parallel, on at the same time
i deployed it to scrapyD and passed start_url from python script to each expanded pause as argument..like
for url in start_urls:
r = requests.post("http://localhost:6800/schedule.json",
params={
'project': 'project',
'spider': 'spider',
'start_urls': url
})
and inside the spider, reading this argument, start_urls, from kwargs and assigning it to Start_urls
but what i noticed is when i pass multiple urls to the same deployed spider using for loop it never runs in parallel
only one task is running at any given time, other tasks are in a pending state (not working)
scrapyd and service settings as they are changed by default after only two settings
max_proc = 100
max_proc_per_cpu = 25
how can i achieve while approaching real parallelism using python-scrapy-scrapyd
or will I need to use python-multi processing-pool-apply_async or some other solution
source to share
No one has answered this question yet
Check out similar questions: