Screen scraper with PHP and XPath

Does anyone know how to maintain text formatting when using XPath to retrieve data?

I am currently extracting all blocks

<div class="info"> <h5>title</h5> text <a href="somelink">anchor</a> </div>

from the page. The problem is that when I access nodeValue, I can only get plain text. How can I grab content including formatting i.e. H5 and still in code?

Thanks in advance. I've searched every combination imaginable on Google and no luck.

+1


source to share


5 answers


If you have a DomElement $ element as part of the DomDocument $ dom then you will want to do something like:

$string = $dom->saveXml($element);

      



The NodeValue of an element is actually a text value, not structured XML.

+2


source


I would like to add to Ciaran McNulty's answer

You can do the same in SimpleXml like:

$simplexml->node->asXml(); // saveXml() is now an alias

      

And to expand the quote



The NodeValue of an element is actually a text value, not structured XML.

You can come up with your node like this:

<div class="info">
    <__toString()> </__toString()>
    <h5>title</h5>
    <__toString()> text </__toString()>
    <a href="somelink">anchor</a>
    <__toString()> </__toString()>
</div>

      

If the call $element->nodeValue

is like a call $element->__toString()

that will only receive __toString () elements. The imaginary created is __toString()

officially defined as XML_TEXT_NODE

.

+1


source


XPath is intended to be used in another language (for example, DOM API, XSLT, XQuery, ...) and cannot be used stand-alone . The original question does not indicate what the desired investment is.

Below is a very simple and short solution when XPath is embedded in XSLT .

This transformation :

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes"/>

    <xsl:template match="div[@class='info']">
       <xsl:copy-of select="."/>
    </xsl:template>
</xsl:stylesheet>

      

when applied to this XML doc :

<html>
    <body>
        <div class="info">
            <h1>title1</h1> text1
            <a href="somelink1">anchor1</a>
        </div>
        Something else here
        <div class="info">
            <h2>title2</h2> text2
            <a href="somelink2">anchor2</a>
        </div>
        Something else here
        <div class="info">
            <h3>title3</h3> text3
            <a href="somelink3">anchor3</a>
        </div>
    </body>
</html>

      

produces the desired output :

<div class="info">
  <h1>title1</h1> text1
    <a href="somelink1">anchor1</a>
</div>
        Something else here
<div class="info">
  <h2>title2</h2> text2
  <a href="somelink2">anchor2</a>
</div>
        Something else here
<div class="info">
  <h3>title3</h3> text3
  <a href="somelink3">anchor3</a>
</div>

      

+1


source


You need to make sure your xpath request ends with <div class="info">

. However, due to the way XPath works, you will still get all the "subtags" in separate nodes. You just need to combine them.

You can also use XPath to join , however since I haven't used it I can't tell what problems you might be running into.

0


source


div/node()

must do the trick.

Input example:

<div class="info">
  some <h5>title</h5> text <a href="somelink">anchor</a> more text
</div>

      

Sample XSLT stylesheet:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
        <newtag>
                <xsl:copy-of select="div/node()"/>
        </newtag>
</xsl:template>

</xsl:stylesheet>

      

Output example:

<?xml version="1.0" encoding="utf-8"?>
<newtag> some<h5>title</h5> text <a href="somelink">anchor</a> more text</newtag>

      

0


source







All Articles