How to extract tags from HTML using Beautifulsoup in Python

I am trying to parse an HTML page that is simplified looks like this:

<div class="anotherclass part"
  <a href="http://example.com" >
    <div class="column abc"><strike>&#163;3.99</strike><br>&#163;3.59</div>
    <div class="column def"></div>
    <div class="column ghi">1 Feb 2013</div>
    <div class="column jkl">
      <h4>A title</h4>
      <p>
        <img class="image" src="http://example.com/image.jpg">A, List, Of, Terms, To, Extract - 1 Feb 2013</p>
    </div>
  </a>
</div>

      

I am new to python coding and I have read and re-read the beautifulsoup documentation at http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

I have this code:

from BeautifulSoup import BeautifulSoup

with open("file.html") as fp:
  html = fp.read()

soup = BeautifulSoup(html)

parts = soup.findAll('a', attrs={"class":re.compile('part'), re.IGNORECASE} )
for part in parts:
  mypart={}

  # ghi
  mypart['ghi'] = part.find(attrs={"class": re.compile('ghi')} ).string
  # def
  mypart['def'] = part.find(attrs={"class": re.compile('def')} ).string
  # h4
  mypart['title'] = part.find('h4').string

  # jkl
  mypart['other'] = part.find('p').string

  # abc
  pattern = re.compile( r'\&\#163\;(\d{1,}\.?\d{2}?)' )
  theprices = re.findall( pattern, str(part) )
  if len(theprices) == 2:
    mypart['price'] = theprices[1]
    mypart['rrp'] = theprices[0]
  elif len(theprices) == 1:
    mypart['price'] = theprices[0]
    mypart['rrp'] = theprices[0]
  else:
    mypart['price'] = None
    mypart['rrp'] = None

      

I want to extract any text from the classes def

and ghi

which I think my script is doing correctly.

I also want to extract two prices from abc

which my script is doing in a rather clunky way at the moment. Sometimes this part has two prices, sometimes one, and sometimes not.

Finally, I want to extract the part "A, List, Of, Terms, To, Extract"

from the class jkl

that the script failed to execute. I thought getting the lowercase part of the tag p

would work, but I can't figure out why it doesn't. The date in this part always matches the date in the class ghi

, so it's easy to replace or remove it.

Any advice? Thanks you!

+3


source to share


1 answer


First, if you add convertEntities=bs.BeautifulSoup.HTML_ENTITIES

to

soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)

      

then html objects like &#163;

will be converted to the corresponding unicode character eg Β£

. This will allow you to use a simpler regular expression to determine prices.


Now, given part

, you can find the text content in <div>

with prices using its attribute contents

:

In [37]: part.find(attrs={"class": re.compile('abc')}).contents
Out[37]: [<strike>Β£3.99</strike>, <br />, u'\xa33.59']

      

All we need to do is extract the number from each element, or skip it if there is no number:

def parse_price(text):
    try:
        return float(re.search(r'\d*\.\d+', text).group())
    except (TypeError, ValueError, AttributeError):
        return None

price = []
for item in part.find(attrs={"class": re.compile('abc')}).contents:
    item = parse_price(item.string)
    if item:
        price.append(item)

      

At this point there price

will be a list of 0, 1, or 2 floats. We would like to say

mypart['rrp'], mypart['price'] = price

      

but it won't work if price

there is []

or contains only one element.

Your method of handling the three cases using if..else

in order is the easiest and arguably the most readable way to proceed. But this is also a bit mundane. If you want something a little shorter, you can do the following:

Since we want to repeat the same price if price

only contains one item, you might think about itertools.cycle .



In the case where price

is an empty list,, []

we want itertools.cycle([None])

, but otherwise we could use itertools.cycle(price)

.

So, to combine both cases into one expression, we could use

price = itertools.cycle(price or [None])
mypart['rrp'], mypart['price'] = next(price), next(price)

      

The function next

selects the values ​​in the iterator price

one at a time. Since it price

cycles through its values, it never ends; it will just keep entering values ​​in sequence and then, if necessary, starting over again - which is exactly what we want.


A, List, Of, Terms, To, Extract - 1 Feb 2013

can be retrieved again using the attribute contents

:

# jkl
mypart['other'] = [item for item in part.find('p').contents
                   if not isinstance(item, bs.Tag) and item.string.strip()]

      


So, the complete executable code will look like this:

import BeautifulSoup as bs
import os
import re
import itertools as IT

def parse_price(text):
    try:
        return float(re.search(r'\d*\.\d+', text).group())
    except (TypeError, ValueError, AttributeError):
        return None

filename = os.path.expanduser("~/tmp/file.html")
with open(filename) as fp:
    html = fp.read()

soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)

for part in soup.findAll('div', attrs={"class": re.compile('(?i)part')}):
    mypart = {}
    # abc
    price = []
    for item in part.find(attrs={"class": re.compile('abc')}).contents:
        item = parse_price(item.string)
        if item:
            price.append(item)

    price = IT.cycle(price or [None])
    mypart['rrp'], mypart['price'] = next(price), next(price)

    # jkl
    mypart['other'] = [item for item in part.find('p').contents
                       if not isinstance(item, bs.Tag) and item.string.strip()]

    print(mypart)

      

what gives

{'price': 3.59, 'other': [u'A, List, Of, Terms, To, Extract - 1 Feb 2013'], 'rrp': 3.99}

      

+2


source







All Articles