Getting image from url using BeautifulSoup
I am trying to extract important images, not thumbnails or other gifs from a Wikipedia page, and using the following code. However, "img" matches the length "0". any suggestion on how to fix it.
Code:
import urllib
import urllib2
from bs4 import BeautifulSoup
import os
html = urllib2.urlopen("http://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(html)
imgs = soup.findAll("div",{"class":"image"})
Also if someone can explain in detail how to use findAll by looking at the "original element" in the web page. It will be amazing.
source to share
The tags a
on the page are of a class image
and not div
:
>>> img_links = soup.findAll("a", {"class":"image"})
>>> for img_link in img_links:
... print img_link.img['src']
...
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...
Or better yet, use a.image > img
CSS selector
:
>>> for img in soup.select('a.image > img'):
... print img['src']
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...
UPD (upload images using urllib.urlretrieve
):
from urllib import urlretrieve
import urlparse
from bs4 import BeautifulSoup
import urllib2
url = "http://en.wikipedia.org/wiki/Main_Page"
soup = BeautifulSoup(urllib2.urlopen(url))
for img in soup.select('a.image > img'):
img_url = urlparse.urljoin(url, img['src'])
file_name = img['src'].split('/')[-1]
urlretrieve(img_url, file_name)
source to share