Screen scraper with PHP and XPath
Does anyone know how to maintain text formatting when using XPath to retrieve data?
I am currently extracting all blocks
<div class="info">
<h5>title</h5>
text <a href="somelink">anchor</a>
</div>
from the page. The problem is that when I access nodeValue, I can only get plain text. How can I grab content including formatting i.e. H5 and still in code?
Thanks in advance. I've searched every combination imaginable on Google and no luck.
If you have a DomElement $ element as part of the DomDocument $ dom then you will want to do something like:
$string = $dom->saveXml($element);
The NodeValue of an element is actually a text value, not structured XML.
I would like to add to Ciaran McNulty's answer
You can do the same in SimpleXml like:
$simplexml->node->asXml(); // saveXml() is now an alias
And to expand the quote
The NodeValue of an element is actually a text value, not structured XML.
You can come up with your node like this:
<div class="info">
<__toString()> </__toString()>
<h5>title</h5>
<__toString()> text </__toString()>
<a href="somelink">anchor</a>
<__toString()> </__toString()>
</div>
If the call $element->nodeValue
is like a call $element->__toString()
that will only receive __toString () elements. The imaginary created is __toString()
officially defined as XML_TEXT_NODE
.
XPath is intended to be used in another language (for example, DOM API, XSLT, XQuery, ...) and cannot be used stand-alone . The original question does not indicate what the desired investment is.
Below is a very simple and short solution when XPath is embedded in XSLT .
This transformation :
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="div[@class='info']">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
when applied to this XML doc :
<html>
<body>
<div class="info">
<h1>title1</h1> text1
<a href="somelink1">anchor1</a>
</div>
Something else here
<div class="info">
<h2>title2</h2> text2
<a href="somelink2">anchor2</a>
</div>
Something else here
<div class="info">
<h3>title3</h3> text3
<a href="somelink3">anchor3</a>
</div>
</body>
</html>
produces the desired output :
<div class="info">
<h1>title1</h1> text1
<a href="somelink1">anchor1</a>
</div>
Something else here
<div class="info">
<h2>title2</h2> text2
<a href="somelink2">anchor2</a>
</div>
Something else here
<div class="info">
<h3>title3</h3> text3
<a href="somelink3">anchor3</a>
</div>
You need to make sure your xpath request ends with <div class="info">
. However, due to the way XPath works, you will still get all the "subtags" in separate nodes. You just need to combine them.
You can also use XPath to join , however since I haven't used it I can't tell what problems you might be running into.
div/node()
must do the trick.
Input example:
<div class="info">
some <h5>title</h5> text <a href="somelink">anchor</a> more text
</div>
Sample XSLT stylesheet:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<newtag>
<xsl:copy-of select="div/node()"/>
</newtag>
</xsl:template>
</xsl:stylesheet>
Output example:
<?xml version="1.0" encoding="utf-8"?>
<newtag> some<h5>title</h5> text <a href="somelink">anchor</a> more text</newtag>