Reading some content from a webpage read in python
I am trying to read some data from a python module from the net.
I manage to read, but with some difficulty parse this data and obtain the necessary information.
My code is below. Any help is appreciated.
#!/usr/bin/python2.7 -tt
import urllib
import urllib2
def Connect2Web():
aResp = urllib2.urlopen("https://uniservices1.uobgroup.com/secure/online_rates/gold_and_silver_prices.jsp");
web_pg = aResp.read();
print web_pg
#Define a main() function that prints a litte greeting
def main():
Connect2Web()
# This is the standard boilerplate that calls the maun function.
if __name__ == '__main__':
main()
When I print this webpage , I get the whole webpage.
I want to extract some information from it (for example, "SILVER PASSBOOK ACCOUNT"
and get a bet from it), I am having some difficulties in parsing this html document.
source to share
You can use regular expressions to get the required data:
import urllib
import urllib2
import re
def Connect2Web():
aResp = urllib2.urlopen("https://uniservices1.uobgroup.com/secure/online_rates/gold_and_silver_prices.jsp");
web_pg = aResp.read();
pattern = "<td><b>SILVER PASSBOOK ACCOUNT</b></td>" + "<td>(.*)</td>" * 4
m = re.search(pattern, web_pg)
if m:
print "SILVER PASSBOOK ACCOUNT:"
print "\tCurrency:", m.group(1)
print "\tUnit:", m.group(2)
print "\tBank Sells:", m.group(3)
print "\tBank Buys:", m.group(4)
else:
print "Nothing found"
Don't forget the re.compile
pattern if you are doing your matches in a loop.
source to share
It is not recommended to use RE for XML / HTML conformance. However, it can sometimes work. Better to use HTML parser and DOM API. Here's an example:
import html5lib
import urllib2
aResp = urllib2.urlopen("https://uniservices1.uobgroup.com/secure/online_rates/gold_and_silver_prices.jsp")
t = aResp.read()
dom = html5lib.parse(t, treebuilder="dom")
trlist = dom.getElementsByTagName("tr")
print trlist[-3].childNodes[1].firstChild.childNodes[0].nodeValue
You can iterate trlist
to find interesting data.
Added from comment: html5lib
is a third party module. See html5lib site . The program easy_install
or pip
should be able to install it.
source to share
You can also try Grablib . And / or you can use XPath (with / without Grab). Maybe it will be useful for you later, here are some examples:
g = Grab()
g.go(address)
user_div = g.xpath('//*/div[@class="user_profile"]') # main <div> for parse
country = user_div.find('*/*/a[@class="country-name"]')
region = user_div.find('*/*/a[@class="region"]') # look for <a class="region">
city = user_div.find('*/*/a[@class="city"]')
friends = [ i.text_content() for i in user_div.findall('dl[@class="friends_list"]/dd/ul/li/a[@rel="friend"]') ]
# and another ability, i.e. you have 2 tags:
# <tr> <td>Text to grab</td> <td>if only that tag contains this text</td> </tr>
val = user_div.xpath(u"dl/dt[contains(text(),'%s')]/../dd/text()" % 'if only that tag contains this text')
# print val[0] <- will contain 'Text to grab'
Good luck.
source to share