Loop issue in scrapy + selenium + phantomjs

Question

Loop issue in scrapy + selenium + phantomjs

I was trying to create a small ebay scraper (college assignment). I've figured out a lot already, but I ran into a problem with my loop.

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from loop.items import loopitems

class myProjectSpider(CrawlSpider):
name = 'looper'
allowed_domains = ['ebay.com']
start_urls = [l.strip() for l in open('bobo.txt').readlines()]

def __init__(self):
    service_args = ['--load-images=no',]
    self.driver = webdriver.PhantomJS(executable_path='/Users/localhost/desktop/.bin/phantomjs.cmd', service_args=service_args)

def parse(self, response):
    self.driver.get(response.url)
    item = loopitems()
    for abc in range(2,50):
        abc = str(abc)
        jackson = self.driver.execute_script("return !!document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;")
        if jackson == True:
             item['title'] = self.driver.execute_script("return document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue.textContent;")
             yield item
        else:
             break

URLs (start_urls are sent from txt file):

http://www.ebay.com/itm/Mens-Jeans-Slim-Fit-Straight-Skinny-Fit-Denim-      Trousers-Casual-Pants-14-color-/221560999664?pt=LH_DefaultDomain_0&var=&hash=item3396108ef0
http://www.ebay.com/itm/New-Apple-iPad-3rd-Generation-16GB-32GB-or-64GB-WiFi-Retina-Display-Tablet-/261749018535?pt=LH_DefaultDomain_0&var=&hash=item3cf1750fa7

I am running scrapy version 0.24.6 and phantomjs version 2.0. The goal is to navigate to urls and extract variations or attributes from an ebay form. The if statement at the beginning of the loop is used to check if an element exists because selenium returns an error with an incorrect title if it cannot find the element. I am also looping (yield element) because I need each change on a new line. I am using execute_script because it is 100 times faster than using seleniums to get the element by xpath.

The main problem I'm having is how scrapy is returning the results of my item; if i use one url as my start_url it works as it should (it returns all items in neat order). Secondly, I am adding more urls. I get a completely different result, all my elements are scrambled and some elements are returned multiple times and this also changes almost every time. After countless testing, I noticed that profitability was causing some kind of problem; so I removed it and tried to just print the results and of course returned them completely. I really need each element on a new line and the only way I should be doing this is using the yield element (maybe there is a better way?).

At this point I have just copied the coded code by manually changing the xpath parameter. And it works as expected, but I really need to be able to iterate over the elements in the future. If anyone sees a bug in my code or a better way to try, please tell me. All answers are helpful ...

thank

+3

python loops selenium scrapy

therealdeal June 17. 15 at 15:11

source to share

1 answer

Bzisch · Answer 1 · 2015-06-17T15:47:57+0000

If I understand correctly what you want to do, I think this might help you.

Crawl urls ok

The problem is that start_urls are not processed in order. They are passed to the start_requests method and returned with a loaded response to the parse method. It's asynchronous.

Maybe it helps

#Do your thing
start_urls = [open('bobo.txt').readlines()[0].strip()]
other_urls = [l.strip() for l in open('bobo.txt').readlines()[1:]]
other_urls.reverse()

#Do your thing
def parse(self, response):

    #Do your thing
    if len(self.other_urls) != 0
        url = self.other_urls.pop()
        yield Request(url=url, callback=self.parse)

Loop issue in scrapy + selenium + phantomjs

More articles: