Html5 find / parsing specific element in python page

I am trying to learn how to find / parse data from html5 web pages for use in a database. I want to learn how to find / analyze data from just the first one'//div[@class="col-xs-12 col-sm-6 col-md-4 col-lg-3"]'

I tried html5lib, from lxml import html and xpath but the lack of documentation for my specific use is frustrating, can't really find how I can achieve this.

Search and storage data:

http://csgo.steamanalyst.com/id/120565/ 
from <span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/'

And the 2 numbers from "addToCart(1852864,1108)" as id1:'1852864' and id2:'1108'

in <button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'

      

html code i am trying to learn from

<!DOCTYPE html> 

<div class='row'><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852864'>StatTrak&#8482; Desert Eagle | Conspiracy (Factory New)</a><br /><small class='text-muted'>StatTrak&#8482; Classified Pistol</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>1,108</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>1,451</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&StatTrak=1&search_item=+Desert+Eagle+%7C+Conspiracy+%28Factory+New%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1841001'>★ Karambit | Doppler (Factory New)</a><br /><small class='text-muted'>★ Covert Knife</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>155,000</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/62403692/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>30,300</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=%E2%98%85+Karambit+%7C+Doppler+%28Factory+New%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem2' onclick='addToCart(1841001,155000)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852853'>AK-47 | Redline (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>441</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/1420/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>520</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=AK-47+%7C+Redline+%28Field-Tested%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem3' onclick='addToCart(1852853,441)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852846'>M4A1-S | Master Piece (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>6,618</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120409/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>8,905</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=M4A1-S+%7C+Master+Piece+%28Field-Tested%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem4' onclick='addToCart(1852846,6618)'>Add to cart</button></center></div>
    </div>

      

+3


source to share


2 answers


Use a parser html

in the library lxml

. For a working example below, your HTML is being assigned myhtml

. There might be a more elegant way to parse text from a button attribute, but that's a start.



>>> from lxml import html
>>> tree = html.fromstring(myhtml)
>>> mybuttons = tree.xpath('//button[@class="btn btn-orange" and @onclick]')
>>> len(mybuttons)
4
>>> for button in mybuttons:
...     (id1, id2) = button.attrib['onclick'].replace('(', ' ').replace(',', ' ').replace(')', ' ').split()[1:]
...     print id1, id2
... 
1852864 1108
1841001 155000
1852853 441
1852846 6618
>>> myurl = tree.xpath('//span[@class="market-name"]/a')
>>> for u in myurl:
...     href = u.attrib['href']
...     print href
... 
http://csgo.steamanalyst.com/id/120565/
http://csgo.steamanalyst.com/id/62403692/
http://csgo.steamanalyst.com/id/1420/
http://csgo.steamanalyst.com/id/120409/
>>> 

      

+1


source


I used a simpler library for a similar problem:

import re
from HTMLParser import HTMLParser

class MyParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.in_market = 0
    self.markets = {}
    self.market = None

  def handle_starttag(self, tag, attrs):
    if tag == 'span':
      if "class" in attrs and \
      and attrs["class"].indexof('market-name') != -1:
        self.in_market = 1
      elif self.in_market:
        self.in_market += 1
    elif self.in_market:
      if tag == 'a' and 'href' in attrs:
        self.market = attrs["href"]
      elif tag == 'button' and 'onclick' in attrs:
        add_to_cart_RE = re.compile(r'addToCart\((\d+),(\d+)\)')
        match = add_to_cart_RE.match(attrs["onclick"])
        self.markets[self.market] = [match.group(1), match.group(2)]


  def handle_endtag(self, tag):
    if self.tag == 'span' and self.in_market:
      self.in_market -= 1

  def handle_data(self, data):
    pass

      



ask me questions if the code is not clear to you.

0


source







All Articles