Urllib Python does not provide the html code I can see with the check element

Question

Urllib Python does not provide the html code I can see with the check element

I am trying to crawl the results from this link:

url = " http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F "

When I check it with firebug, I see the html code and I know what I need to do to extract the tweets. The problem is when I get a response with urlopen I don't get the same HTML. I only get tags. What am I missing?

Sample code below:

   def get_tweets(section_url):
     html = urlopen(section_url).read()
     soup = BeautifulSoup(html, "lxml")
     tweets = soup.find("div", "results")
     category_links = [dd.a["href"] for tweet in tweets.findAll("div", "result-tweet")]
     return category_links

url =  "http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F"
cat_links = get_tweets(url)

Thanks, YB

+3

python html web-scraping urllib

ybb 19 Sep '14 at 1:18

source to share

1 answer

alecxe · Accepted Answer · 2014-09-19T02:02:06+0000

The problem is that the content of the results

div is filled with an extra HTTP call and javascript executed on the browser side. urllib

only "sees" the original HTML page, which does not contain the required data.

One option is to follow @Himal's guidelines and simulate the basic request trackbacks.js

that is sent for data with tweets. In JSON format, you can load()

use json

with the standard library:

import json
import urllib2

url = 'http://otter.topsy.com/trackbacks.js?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F&infonly=0&call_timestamp=1411090809443&apikey=09C43A9B270A470B8EB8F2946A9369F3'
data = json.load(urllib2.urlopen(url))
for tweet in data['response']['list']:
    print tweet['permalink_url']

Printing

http://twitter.com/Evonomie/status/512179917610835968
http://twitter.com/abs_office/status/512054653723619329
http://twitter.com/TKE_Global/status/511523709677756416
http://twitter.com/trevinocreativo/status/510216232122200064
http://twitter.com/TomCrouser/status/509730668814028800
http://twitter.com/Evonomie/status/509703168062922753
http://twitter.com/peterchaly/status/509592878491136000
http://twitter.com/chandagarwala/status/509540405411840000
http://twitter.com/Ayjay4650/status/509517948747526144
http://twitter.com/Marketingccc/status/509131671900536832

It was the "went down to metal" option.

Otherwise, you can use a "high level" approach and not worry about what's going on under the hood. Let the real browser load the page you would interact with via selenium WebDriver :

from selenium import webdriver

driver = webdriver.Chrome()  # can be Firefox(), PhantomJS() and more
driver.get("http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F")

for tweet in driver.find_elements_by_class_name('result-tweet'):
    print tweet.find_element_by_xpath('.//div[@class="media-body"]//ul[@class="inline"]/li//a').get_attribute('href')

driver.close()

Printing

http://twitter.com/Evonomie/status/512179917610835968
http://twitter.com/abs_office/status/512054653723619329
http://twitter.com/TKE_Global/status/511523709677756416
http://twitter.com/trevinocreativo/status/510216232122200064
http://twitter.com/TomCrouser/status/509730668814028800
http://twitter.com/Evonomie/status/509703168062922753
http://twitter.com/peterchaly/status/509592878491136000
http://twitter.com/chandagarwala/status/509540405411840000
http://twitter.com/Ayjay4650/status/509517948747526144
http://twitter.com/Marketingccc/status/509131671900536832

This is how you can scale the second option to get all tweets after pagination:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

BASE_URL = 'http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F&offset={offset}'

driver = webdriver.Chrome()

# get tweets count
driver.get('http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F')
tweets_count = int(driver.find_element_by_xpath('//li[@data-name="all"]/a/span').text)

for x in xrange(0, tweets_count, 10):
    driver.get(BASE_URL.format(offset=x))

    # page header appears in case no more tweets found
    try:
        driver.find_element_by_xpath('//div[@class="page-header"]/h3')
    except NoSuchElementException:
        pass
    else:
        break

    # wait for results
    WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((By.ID, "results"))
    )

    # get tweets
    for tweet in driver.find_elements_by_class_name('result-tweet'):
        print tweet.find_element_by_xpath('.//div[@class="media-body"]//ul[@class="inline"]/li//a').get_attribute('href')

driver.close()

Urllib Python does not provide the html code I can see with the check element

More articles: