Remove HTML element from parsed HTML document provided

I have parsed an HTML document using Simple PHP HTML DOM Parser. The parsed document has an ul-tag with some li-tags in it. One of these li tags contains one of those dreaded Add This buttons that I want to remove.

To make it worse, the list item has no class or ID, and it is not always in the same position in the list. So there is no easy way (correct me if I'm wrong) to remove it using a parser.

What I want to do is search for the string "addthis.com" in all li elements and remove any element that contains that string.

<ul>
    <li>Foobar</li>
    <li>addthis.com</li><!-- How do I remove this? -->
    <li>Foobar</li>
</ul>

      

FYI: This is a puri hobby project in my quest to learn PHP, not a case of stealing content for profit.

All suggestions are welcome!

+3


source to share


3 answers


Could not find a method to remove nodes explicitly, but can be removed by setting the outer text to empty.



$html = new simple_html_dom();
$html->load(file_get_contents("test.html"), false, false); // preserve formatting

foreach($html->find('ul li') as $element) {
  if (count($element->find('a.addthis_button')) > 0) {
    $element->outertext="";
  }
}

echo $html;

      

+3


source


Good thing you can do is use jQuery after parsing. Something like that:



$('li').each(function(i) {
    if($(this).html() == "addthis.com"){
        $(this).remove();
    }
});

      

+1


source


This solution uses the DOMDocument class and domnode.removechild :

$str="<ul><li>Foobar</li><li>addthis.com</li><li>Foobar</li></ul>";
$remove='addthis.com';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elements = $doc->getElementsByTagName('li');
$domElemsToRemove = array();
foreach ($elements as $element) {
  $pos = strpos($element->textContent, $remove); // or similar $element->nodeValue
  if ($pos !== false) {
    $domElemsToRemove[] = $element;
  }
}
foreach( $domElemsToRemove as $domElement ){
  $domElement->parentNode->removeChild($domElement);
}
$str = $doc->saveHTML(); // <ul><li>Foobar</li><li>Foobar</li></ul>

      

0


source







All Articles