Clearing web data with python

Question

Clearing web data with python

I am trying to write some code to clear data from imdb top 250 webpage. Below is the code I wrote. The code works and gives me my intended results. But the problem I am having is the number of results returned by the code. When I use it on my laptop, it produces 23 results, 1 23 movies as indicated by imdb. But when I run away from one of my friend's friends, it gives the correct 250 results. Why is this happening? What should be done to avoid this?

from bs4 import BeautifulSoup
import requests
import sys
from StringIO import StringIO

try:
    import cPickle as pickle
except:
    import pickle

url = 'http://www.imdb.com/chart/top'

response = requests.get(url)
soup = BeautifulSoup(response.text)

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.titleColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

imdb = []

print(len(movies))

for index in range(0, len(movies)):
    data = {"movie": movies[index].get_text(),
            "link": links[index],
            "starCast": crew[index],
            "rating": ratings[index],
            "vote": votes[index]}
    imdb.append(data)

print(imdb)


Test Run from my laptop result :
['9.21', '9.176', '9.015', '8.935', '8.914', '8.903', '8.892', '8.889', '8.877', '8.817', '8.786', '8.76', '8.737', '8.733', '8.716', '8.703', '8.7', '8.69', '8.69', '8.678', '8.658', '8.629', '8.619']
23

+3

python web-scraping beautifulsoup

Reutzesen Sep 16 14 at 12:06

source to share

1 answer

Matthew kelley · Answer 1 · 2015-11-11T06:01:14+0000

I realize this is a pretty old question, but I really liked the idea to make the code work better. It now makes individual variable data more accessible. I fixed this for myself but thought I'd share this in the hopes it might help someone else.

#!/usr/bin/env Python3
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re

# Download IMDB Top 250 data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

imdb = []

# Store each item into dictionary (data), then put those into a list (imdb)
for index in range(0, len(movies)):
    # Seperate movie into: 'place', 'title', 'year'
    # Instead of "2.       The Godfather        (1972)"
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
    data = {"movie_title": movie_title,
            "year": year,
            "place": place,
            "star_cast": crew[index],
            "rating": ratings[index],
            "vote": votes[index],
            "link": links[index]}
    imdb.append(data)

# Print out some info
for item in imdb:
    print(item['place'], '-', item['movie_title'], '('+item['year']+') -', 'Starring:', item['star_cast'])

Clearing web data with python

More articles: