Html5 find / parsing specific element in python page
I am trying to learn how to find / parse data from html5 web pages for use in a database. I want to learn how to find / analyze data from just the first one'//div[@class="col-xs-12 col-sm-6 col-md-4 col-lg-3"]'
I tried html5lib, from lxml import html and xpath but the lack of documentation for my specific use is frustrating, can't really find how I can achieve this.
Search and storage data:
http://csgo.steamanalyst.com/id/120565/
from <span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/'
And the 2 numbers from "addToCart(1852864,1108)" as id1:'1852864' and id2:'1108'
in <button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'
html code i am trying to learn from
<!DOCTYPE html>
<div class='row'><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852864'>StatTrak™ Desert Eagle | Conspiracy (Factory New)</a><br /><small class='text-muted'>StatTrak™ Classified Pistol</small><img style='margin-top:-25px;' src='256fx256f' />
<div class='item-add'>
<div class='item-amount'><span class='icon-logo'></span>1,108</div>
<div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>1,451</a></div>
<div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&StatTrak=1&search_item=+Desert+Eagle+%7C+Conspiracy+%28Factory+New%29' class='btn btn-primary'>Search</a>
<br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'>Add to cart</button></center></div>
</div>
</div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1841001'>★ Karambit | Doppler (Factory New)</a><br /><small class='text-muted'>★ Covert Knife</small><img style='margin-top:-25px;' src='256fx256f' />
<div class='item-add'>
<div class='item-amount'><span class='icon-logo'></span>155,000</div>
<div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/62403692/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>30,300</a></div>
<div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=%E2%98%85+Karambit+%7C+Doppler+%28Factory+New%29' class='btn btn-primary'>Search</a>
<br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem2' onclick='addToCart(1841001,155000)'>Add to cart</button></center></div>
</div>
</div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852853'>AK-47 | Redline (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
<div class='item-add'>
<div class='item-amount'><span class='icon-logo'></span>441</div>
<div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/1420/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>520</a></div>
<div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=AK-47+%7C+Redline+%28Field-Tested%29' class='btn btn-primary'>Search</a>
<br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem3' onclick='addToCart(1852853,441)'>Add to cart</button></center></div>
</div>
</div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852846'>M4A1-S | Master Piece (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
<div class='item-add'>
<div class='item-amount'><span class='icon-logo'></span>6,618</div>
<div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120409/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>8,905</a></div>
<div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=M4A1-S+%7C+Master+Piece+%28Field-Tested%29' class='btn btn-primary'>Search</a>
<br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem4' onclick='addToCart(1852846,6618)'>Add to cart</button></center></div>
</div>
source to share
Use a parser html
in the library lxml
. For a working example below, your HTML is being assigned myhtml
. There might be a more elegant way to parse text from a button attribute, but that's a start.
>>> from lxml import html
>>> tree = html.fromstring(myhtml)
>>> mybuttons = tree.xpath('//button[@class="btn btn-orange" and @onclick]')
>>> len(mybuttons)
4
>>> for button in mybuttons:
... (id1, id2) = button.attrib['onclick'].replace('(', ' ').replace(',', ' ').replace(')', ' ').split()[1:]
... print id1, id2
...
1852864 1108
1841001 155000
1852853 441
1852846 6618
>>> myurl = tree.xpath('//span[@class="market-name"]/a')
>>> for u in myurl:
... href = u.attrib['href']
... print href
...
http://csgo.steamanalyst.com/id/120565/
http://csgo.steamanalyst.com/id/62403692/
http://csgo.steamanalyst.com/id/1420/
http://csgo.steamanalyst.com/id/120409/
>>>
source to share
I used a simpler library for a similar problem:
import re
from HTMLParser import HTMLParser
class MyParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.in_market = 0
self.markets = {}
self.market = None
def handle_starttag(self, tag, attrs):
if tag == 'span':
if "class" in attrs and \
and attrs["class"].indexof('market-name') != -1:
self.in_market = 1
elif self.in_market:
self.in_market += 1
elif self.in_market:
if tag == 'a' and 'href' in attrs:
self.market = attrs["href"]
elif tag == 'button' and 'onclick' in attrs:
add_to_cart_RE = re.compile(r'addToCart\((\d+),(\d+)\)')
match = add_to_cart_RE.match(attrs["onclick"])
self.markets[self.market] = [match.group(1), match.group(2)]
def handle_endtag(self, tag):
if self.tag == 'span' and self.in_market:
self.in_market -= 1
def handle_data(self, data):
pass
ask me questions if the code is not clear to you.
source to share