Regular expression to remove sequential character formatting characters

I need a regex to match and replace sequential character formatting tags containing whole paragraph tags in a simple DOM Html Parser

Entrance:

<p><b><i>Lorem Ipsum Content</i></b></p>

      

Expected Result: <p>Lorem Ipsum</p>

In the example below, the regex should only match and replace tags <b>

as the only tag that covers the entire paragraph tag

for example: Input: <p><b>Text <i> some more text </i>text inside </b></p>

output: <p>Text <i> some more text </i>text inside </p>

Thank.

+3


source to share


2 answers


It will look something like this:

foreach($html->find('p') as $p) {
  while(preg_match('/^<([^>]+)>(.*)<\/\1>$/', $p->innertext, $m)){
    $p->innertext = $m[2];
  }
}

      



Note that \1

in regex matches the html tag name from the first capture group, maybe not necessary, but I did it for a bonus.

0


source


Not elegant and possibly a partial shower.

  • Trim (strip) string input

  • while True:

  • Replace <i>

    with""

  • Replace <b>

    with""

  • Replace the etc symbol tag with ""

  • ...
  • If no match is found in step 3 ~ 6, then break

    .

And the regex for step 3 is this.



<p>\s*(<i>)*\s*.*(<\/i>)\s*<\/p>

      

For the tag, <b>

replace <i>

with <b>

etc.

0


source







All Articles