Screen scraper with PHP and XPath

Question

Screen scraper with PHP and XPath

Does anyone know how to maintain text formatting when using XPath to retrieve data?

I am currently extracting all blocks

<div class="info"> <h5>title</h5> text <a href="somelink">anchor</a> </div>

from the page. The problem is that when I access nodeValue, I can only get plain text. How can I grab content including formatting i.e. H5 and still in code?

Thanks in advance. I've searched every combination imaginable on Google and no luck.

+1

php xpath screen-scraping

user137621 07 jan. 09 at 13:31

source to share

5 answers

I would like to add to Ciaran McNulty's answer

You can do the same in SimpleXml like:

$simplexml->node->asXml(); // saveXml() is now an alias

And to expand the quote

The NodeValue of an element is actually a text value, not structured XML.

You can come up with your node like this:

<div class="info">
    <__toString()> </__toString()>
    <h5>title</h5>
    <__toString()> text </__toString()>
    <a href="somelink">anchor</a>
    <__toString()> </__toString()>
</div>

If the call $element->nodeValue

is like a call $element->__toString()

that will only receive __toString () elements. The imaginary created is __toString()

officially defined as XML_TEXT_NODE

.

+1

null 08 jan. '09 at 9:42

source to share

XPath is intended to be used in another language (for example, DOM API, XSLT, XQuery, ...) and cannot be used stand-alone . The original question does not indicate what the desired investment is.

Below is a very simple and short solution when XPath is embedded in XSLT .

This transformation :

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes"/>

    <xsl:template match="div[@class='info']">
       <xsl:copy-of select="."/>
    </xsl:template>
</xsl:stylesheet>

when applied to this XML doc :

<html>
    <body>
        <div class="info">
            <h1>title1</h1> text1
            <a href="somelink1">anchor1</a>
        </div>
        Something else here
        <div class="info">
            <h2>title2</h2> text2
            <a href="somelink2">anchor2</a>
        </div>
        Something else here
        <div class="info">
            <h3>title3</h3> text3
            <a href="somelink3">anchor3</a>
        </div>
    </body>
</html>

produces the desired output :

<div class="info">
  <h1>title1</h1> text1
    <a href="somelink1">anchor1</a>
</div>
        Something else here
<div class="info">
  <h2>title2</h2> text2
  <a href="somelink2">anchor2</a>
</div>
        Something else here
<div class="info">
  <h3>title3</h3> text3
  <a href="somelink3">anchor3</a>
</div>

+1

Dimitre Novatchev 10 jan. 09 at 21:14

source to share

You need to make sure your xpath request ends with <div class="info">

. However, due to the way XPath works, you will still get all the "subtags" in separate nodes. You just need to combine them.

You can also use XPath to join , however since I haven't used it I can't tell what problems you might be running into.

0

Glen solsberry 07 jan. 09 at 13:38

source to share

div/node()

must do the trick.

Input example:

<div class="info">
  some <h5>title</h5> text <a href="somelink">anchor</a> more text
</div>

Sample XSLT stylesheet:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
        <newtag>
                <xsl:copy-of select="div/node()"/>
        </newtag>
</xsl:template>

</xsl:stylesheet>

Output example:

<?xml version="1.0" encoding="utf-8"?>
<newtag> some<h5>title</h5> text <a href="somelink">anchor</a> more text</newtag>

0

phihag 07 jan. At 13:54

source to share

Ciaran McNulty · Accepted Answer · 2009-01-07T13:37:29+0000

If you have a DomElement $ element as part of the DomDocument $ dom then you will want to do something like:

$string = $dom->saveXml($element);

The NodeValue of an element is actually a text value, not structured XML.

Screen scraper with PHP and XPath

More articles: