How to use lxml to get url

Question

How to use lxml to get url

I want to know how to use lxml to get the url and then I can use the xpath to parse the data I want.
Please guide me, thank you very much.

res = requests.get('http://www.ipeen.com.tw/comment/778246')
doc = parse(res.content)
name = doc.xpath("//meta[@itemprop='name']/@content")               
print name

There are errors in my code:

   doc = parse(res.content)
  File "/Users/ome/djangoenv/lib/python2.7/site-packages/lxml/html/__init__.py", line 786, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95015)
IOError

+3

python html lxml

user2492364 Dec 24. 14 at 3:53

source to share

2 answers

Presumably res.content

this is a string containing the content of the page. parse

takes a filename or file-like object. This way you use the content of the page as the filename. This is probably not what you want. To build a tree from a string, use fromstring

rather than parse

.

0

icktoofay Dec 24. 14 at 4:02

source to share

alecxe · Accepted Answer · 2014-12-24T04:07:11+0000

res.content

- string, HTML string.

You need to use lxml.html.fromstring()

:

import lxml.html
import requests

res = requests.get('http://www.ipeen.com.tw/comment/778246')

doc = lxml.html.fromstring(res.content)
name = doc.xpath(".//meta[@itemprop='name']/@content")   
print name

How to use lxml to get url

More articles: