Lxml error "IOError: Error reading file" while parsing facebook mobile in python script scraper

I am using a modified script from facebook post with python post :

#!/usr/bin/python2 -u
# -*- coding: utf8 -*-

facebook_email = "YOUR_MAIL@DOMAIN.TLD"
facebook_passwd = "YOUR_PASSWORD"

import cookielib, urllib2, urllib, time, sys
from lxml import etree

jar = cookielib.CookieJar()
cookie = urllib2.HTTPCookieProcessor(jar)       
opener = urllib2.build_opener(cookie)

headers = {
    "User-Agent" : "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7",
    "Accept" : "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,text/png,*/*;q=0.5",
    "Accept-Language" : "en-us,en;q=0.5",
    "Accept-Charset" : "utf-8",
    "Content-type": "application/x-www-form-urlencoded",
    "Host": "m.facebook.com"

    params = urllib.urlencode({'email':facebook_email,'pass':facebook_passwd,'login':'Log+In'})
    req = urllib2.Request('http://m.facebook.com/login.php?m=m&refsrc=m.facebook.com%2F', params, headers)
    res = opener.open(req)
    html = res.read()

except urllib2.HTTPError, e:
    print e.msg
except urllib2.URLError, e:
    print e.reason[1]

def fetch(url):
    req = urllib2.Request(url,None,headers)
    res = opener.open(req)
    return res.read()

body = unicode(fetch("http://www.facebook.com/photo.php?fbid=404284859586659&set=a.355112834503862.104278.354259211255891&type=1"), errors='ignore')
tree = etree.parse(body)
r = tree.xpath('/see_prev')
print r.text


Problems arise when executing the code:

$ ./facebook_fetch_coms.py
Traceback (most recent call last):
  File "./facebook_fetch_coms_classic_test.py", line 42, in <module>
    tree = etree.parse(body)
  File "lxml.etree.pyx", line 2957, in lxml.etree.parse (src/lxml/lxml.etree.c:56230)
  File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82313)
  File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82606)
  File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81645)
  File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78554)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74498)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75389)
  File "parser.pxi", line 588, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74691)
IOError: Error reading file '<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Facebook</title><meta name="description" content="Facebook helps you connect and share with the people in your life."


The goal is to first get the link with id=see_prev

using lxml

, and then use a while loop to open all comments, to finally get all the messages in the file. Any help would be much appreciated!

Edit : I am using Python 2.7.2 on archlinux x86_64 and lxml 2.3.3.


It's your problem:

tree = etree.parse(body)


The documentation says " source

is the name of a file or file that contains XML data." You provided a string, so lxml takes your HTTP response body text as the name of the file you want to open. There is no such file, so you get IOError


The error message you receive even says "Error reading file" and then gives your XML string as the name of the file it is trying to read, which is a big big hint of what is going on.

You probably want etree.XML()

one that takes input from a string. Or you can just make tree = etree.parse(res)

it read directly from the HTTP request into lxml (the result opener.open()

is a file-like object, and etree.parse()

should be totally happy to consume it).



