Scrambling page content from divs with beautufulsoup

Question

Scrambling page content from divs with beautufulsoup

I am trying to clear the title, summary, date and link from http://www.indiainfoline.com/top-news for each div. with class' : 'row'

.

link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
productDivs = soup.findAll('div', attrs={'class' : 'row'})
for div in productDivs:
    result = {}
    try:
        import pdb
        #pdb.set_trace()
        heading = div.find('p', attrs={'class': 'heading fs20e robo_slab mb10'}).get_text()
        title = heading.get_text()
        article_link = "http://www.indiainfoline.com"+heading.find('a')['href']
        summary = div.find('p')

But none of the components get sampled. Any suggestion on how to fix this?

+3

python web-scraping beautifulsoup

Programming_crazy 03 Aug 17 at 10:08

source to share

2 answers

try it

from bs4 import BeautifulSoup
from urllib.request import urlopen 

link = 'http://www.indiainfoline.com/top-news'
soup = BeautifulSoup(urlopen(link),"lxml")
fixed_html = soup.prettify()

ul = soup.find('ul', attrs={'class':'row'})
print(ul.find('li'))

You'll get

<li class="animated" onclick="location.href='/article/news-top-story/lupin-lupin-gets-usfda-nod-to-market-rosuvastatin-calcium-117080300815_1.html';">
<div class="row">
<div class="col-lg-9 col-md-9 col-sm-9 col-xs-12 ">
<p class="heading fs20e robo_slab mb10"><a href="/article/news-top-story/lupin-lupin-gets-usfda-nod-to-market-rosuvastatin-calcium-117080300815_1.html">Lupin gets USFDA nod to market Rosuvastatin Calcium</a></p>
<p><!--style="color: green !important"-->
<img class="img-responsive visible-xs mob-img" src="http://content.indiainfoline.com/_media/iifl/img/article/2016-08/19/full/1471586016-9754.jpg"/>
                                            Pharma major, Lupin announced on Thursday that the company has received the United States Food and Drug Administra...
                                                                        </p>
<p class="source fs12e">India Infoline News Service |                                           
                                            Mumbai                          15:42 IST |                                          August 03, 2017                 </p>
</div>
<div class="col-lg-3 col-md-3 col-sm-3 hidden-xs pl0 listing-image">
<img class="img-responsive" src="http://content.indiainfoline.com/_media/iifl/img/article/2016-08/19/full/1471586016-9754.jpg"/>
</div>
</div>
</li>

You can of course print out fixed_html to get all the site content.

+4

MishaVacic 03 Aug 17 at 10:27

source to share

akash karothiya · Accepted Answer · 2017-08-03T10:30:41+0000

See there is a lot in the HTML source class=row

, you need to filter the section snippet where the actual string data exists. In your case, id="search-list"

all 16 expected lines exist. Thus, select the section first and then the line. Since it .select

returns an array, we must use [0]

to retrieve the data. Once you've got the row data, you need to iterate and extract the header, articl_url, summary, etc.

from bs4 import BeautifulSoup
link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
section = soup.select('#search-list')
rowdata = section[0].select('.row')

for row in rowdata[1:]:
    heading = row.select('.heading.fs20e.robo_slab.mb10')[0].text
    title = 'http://www.indiainfoline.com'+row.select('a')[0]['href']
    summary = row.select('p')[0].text

Output:

PFC board to consider bonus issue; stock surges by 4%     
http://www.indiainfoline.com/article/news-top-story/pfc-pfc-board-to-consider-bonus-issue-stock-surges-by-4-117080300814_1.html
PFC board to consider bonus issue; stock surges by 4%
...
...

Scrambling page content from divs with beautufulsoup

More articles: