How to clear deeply nested links using python beautifulSoup
I am trying to build a spider / website crawler for academic purposes to grab text from academic publications and add related links to a URL stack. I am trying to crawl 1 website called "PubMed". I can't seem to grab the links I want. Here is my code with an example page, this page should be the representative of others in their database:
website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'
from bs4 import BeautifulSoup
import requests
r = requests.get(website)
soup = BeautifulSoup(r.content)
I have split the html tree into multiple read-only variables so that it can fit 1 screen width.
key_text = soup.find('div', {'class':'grid'}).find('div',{'class':'col twelve_col nomargin shadow'}).find('form',{'id':'EntrezForm'})
side_column = key_text.find('div', {'xmlns:xi':'http://www.w3.org/2001/XInclude'}).find('div', {'class':'supplemental col three_col last'})
side_links = side_column.find('div').findAll('div')[1].find('div', {'id':'disc_col'}).findAll('div')[1]
for link in side_links:
print link
if you look at the html source using the chrome checkout element there should be several other nested divs with links in 'side_links'. However, the above code throws the following error:
Traceback (most recent call last):
File "C:/Users/ballbag/Copy/web_scraping/google_search.py", line 22, in <module>
side_links = side_column.find('div').findAll('div')[1].find('div', {'id':'disc_col'}).findAll('div')[1]
IndexError: list index out of range
if you go to the url there is a column on the right called "linked links" containing the urls I want to clean up. But I can't get to them. There is an expression that says I am trying to enter a div and I suspect it has something to do with it. Can anyone help grab these links? I would really appreciate any pointers
source to share
The problem is the sidebar is being loaded with an additional asynchronous request.
The idea here would be as follows:
- maintain a web cleanup session using
requests.Session
- parse the url that is used to get the sidebar.
- follow this link and get links from
div
cclass="portlet_content"
code:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.ncbi.nlm.nih.gov'
website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'
# parse the main page and grab the link to the side bar
session = requests.Session()
soup = BeautifulSoup(session.get(website).content)
url = urljoin(base_url, soup.select('div#disc_col a.disc_col_ph')[0]['href'])
# parsing the side bar
soup = BeautifulSoup(session.get(url).content)
for a in soup.select('div.portlet_content ul li.brieflinkpopper a'):
print a.text, urljoin(base_url, a.get('href'))
Printing
The metabolite 5'-methylthioadenosine signals through the adenosine receptor A2B in melanoma. http://www.ncbi.nlm.nih.gov/pubmed/25087184
Down-regulation of methylthioadenosine phosphorylase (MTAP) induces progression of hepatocellular carcinoma via accumulation of 5'-deoxy-5'-methylthioadenosine (MTA). http://www.ncbi.nlm.nih.gov/pubmed/21356366
Quantitative analysis of 5'-deoxy-5'-methylthioadenosine in melanoma cells by liquid chromatography-stable isotope ratio tandem mass spectrometry. http://www.ncbi.nlm.nih.gov/pubmed/18996776
...
Cited in PMC http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/23265702/citedby/?tool=pubmed
source to share