Using python to view a link and print data

Question

Using python to view a link and print data

I'm writing a web scraper and trying to get Drake's words back. My scraper has to visit one site (the main metrology site) and then visit each individual song link and then print out the lyrics.

I am having trouble visiting the second link. I have searched around BeautifulSoup and am pretty confused. I am wondering if you can help.

# this is intended to print all of the drake song lyrics on metrolyrics

from pyquery import PyQuery as pq
from lxml import etree
import requests
from bs4 import BeautifulSoup

# this visits the website
response = requests.get('http://www.metrolyrics.com/drake-lyrics.html')

# this separates the different types of content
doc = pq(response.content)

# this finds the titles in the content
titles = doc('.title')

# this visits each title, then prints each verse
for title in titles:
    # this visits each title
  response_title = requests.get(title)
    # this separates the content
  doc2 = pq(response_title.content)
    # this finds the song lyrics
  verse = doc2('.verse')
    # this prints the song lyrics
  print verse.text

In response_title = request.get (title), python doesn't recognize that the title is a link, which makes sense. How can I get real information there? Appreciate your help.

+3

python web-scraping beautifulsoup

Margot mazur 03 June 15 at 21:55

source to share

2 answers

Alex Paramonov · Answer 1 · 2015-06-03T22:00:45+0000

Replace

response_title = requests.get(title)

from

response_title = requests.get(title.attrib['href'])

Complete script working (with a fixed note from the comment below)

#!/usr/bin/python

from pyquery import PyQuery as pq
from lxml import etree
import requests
from bs4 import BeautifulSoup

# this visits the website
response = requests.get('http://www.metrolyrics.com/drake-lyrics.html')

# this separates the different types of content
doc = pq(response.content)

# this finds the titles in the content
titles = doc('.title')

# this visits each title, then prints each verse
for title in titles:
    # this visits each title
  #response_title = requests.get(title)
  response_title = requests.get(title.attrib['href'])

    # this separates the content
  doc2 = pq(response_title.content)
    # this finds the song lyrics
  verse = doc2('.verse')
    # this prints the song lyrics
  print verse.text()

Padraic cunningham · Answer 2 · 2015-06-04T00:02:06+0000

If you want all text to be used with BeautifulSoup:

r = requests.get('http://www.metrolyrics.com/drake-lyrics.html')
soup = (a["href"] for a in BeautifulSoup(r.content).find_all("a", "title", href=True))
verses = (BeautifulSoup(requests.get(url).content).find_all("p", "verse") for url in soup)

for verse in verses:
    print([v.text for v in verse])

Using python to view a link and print data

More articles: