Wrapping a selenium driver (and other blocking calls) with Asyncio Run_In_Executor

I am experimenting with my first little scraper in Python and I want to use asyncio to fetch multiple sites at the same time. I already wrote a function that works with aiohttp, however, since aiohttp.request () does not execute javascript, this is not ideal for cleaning up some dynamic web pages. So this motivates trying to use Selenium with PhantomJS as a browser without a browser.

There are some code snippets demonstrating the use of BaseEventLoop.run_in_executor - for example here - however the documentation is sparse and my mojo copy and paste is not strong enough.

If anyone would be kind enough to extend the use of asyncio to wrap up call blocking in general, or to explain what's going on in this particular case, I'd appreciate it! Here's what I have knocked down so far:

@asyncio.coroutine
def fetch_page_pjs(self, url):
    '''
    (self, string, int) -> None
    Performs async website content retrieval
    '''
    loop = asyncio.get_event_loop()
    try:
        future = loop.run_in_executor(None, self.driver.get, url)
        print(url)
        response = yield from future
        print(response)
        if response.status == 200:
            body = BeautifulSoup(self.driver.page_source)
            self.results.append((url, body))
        else:
            self.results.append((url, ''))
    except:
        self.results.append((url, ''))

      

The answer returns "No" - why?

+3


source to share


1 answer


This is not an asyncio or run_in_executor issue. The selenium api simply cannot be used this way. The first driver driver.get returns nothing. See docs for selenium . Secondly it is not possible to get status codes with selenium directly, see this question

This code worked for me:



@asyncio.coroutine
def fetch_page_pjs(self, url):
    '''
    (self, string, int) -> None
    Performs async website content retrieval
    '''
    loop = asyncio.get_event_loop()
    try:
        future = loop.run_in_executor(None, self.driver.get, url)
        print(url)
        yield from future
        body = BeautifulSoup(self.driver.page_source)
        self.results.append((url, body))

    except:
        self.results.append((url, ''))

      

+2


source







All Articles