Parsing HTML from a local file
I am using Google App Engine with Python. I want to get a tree of HTML file from the same project as my Python script. I've tried many things, for example using an absolute url (for example http: // localhost: 8080 / nl / home.html ) and a relative url (/nl/home.html). Both don't seem to work. I am using this code:
class HomePage(webapp2.RequestHandler):
def get(self):
path = self.request.path
htmlfile = etree.parse(path)
template = jinja_environment.get_template('/nl/template.html')
pagetitle = htmlfile.find(".//title").text
body = htmlfile.get_element_by_id("body").toString()
It returns the following error: IOError: Error reading file '/nl/home.html': Could not load external entity "/nl/home.html
Does anyone know how to get an HTML file tree from the same project with Python?
EDIT
This is the working code:
class HomePage(webapp2.RequestHandler):
def get(self):
path = self.request.path.replace("/","",1)
logging.info(path)
htmlfile = html.fromstring(urllib.urlopen(path).read())
template = jinja_environment.get_template('/nl/template.html')
pagetitle = htmlfile.find(".//title").text
body = innerHTML(htmlfile.get_element_by_id("body"))
def innerHTML(node):
buildString = ''
for child in node:
buildString += html.tostring(child)
return buildString
source to share
I believe your error is in your file path. You are assuming that your application directory is the root file on the server. It's not obligatory. Actually, I couldn't find any documentation on where the files will be, so this is what I'm doing (it works on a dev server, I'm not tired of it yet):
I am assuming that Google is storing relative file locations in my application. So if I know the location of one file, I can determine the location of the rest of my files. Fortunately, the python spec allows you to programmatically locate the python source file like this:
def get_src_dir(){
return os.path.dirname(os.path.realpath(__file__))
}
get_src_dir () will get you the location of the source file.
os.path.join(get_src_dir(), rel_path_to_asset)
will now provide you with the path to your asset. rel_path_to_asset is the path to the asset relative to the source file, the get_src_dir () function is in ...
source to share