Regular expression to remove sequential character formatting characters
I need a regex to match and replace sequential character formatting tags containing whole paragraph tags in a simple DOM Html Parser
Entrance:
<p><b><i>Lorem Ipsum Content</i></b></p>
Expected Result: <p>Lorem Ipsum</p>
In the example below, the regex should only match and replace tags <b>
as the only tag that covers the entire paragraph tag
for example: Input: <p><b>Text <i> some more text </i>text inside </b></p>
output: <p>Text <i> some more text </i>text inside </p>
Thank.
source to share
It will look something like this:
foreach($html->find('p') as $p) {
while(preg_match('/^<([^>]+)>(.*)<\/\1>$/', $p->innertext, $m)){
$p->innertext = $m[2];
}
}
Note that \1
in regex matches the html tag name from the first capture group, maybe not necessary, but I did it for a bonus.
source to share
Not elegant and possibly a partial shower.
- Trim (strip) string
input
-
while True:
- Replace
<i>
with""
- Replace
<b>
with""
- Replace the etc symbol tag with
""
- ...
- If no match is found in step 3 ~ 6, then
break
.
And the regex for step 3 is this.
<p>\s*(<i>)*\s*.*(<\/i>)\s*<\/p>
For the tag, <b>
replace <i>
with <b>
etc.
source to share