Retrieving inner text of HTML tags using regular expressions
I am having trouble getting this data:
<tr>
<td><span class="bodytext"><b>Contact:</b><b></b></span><span style='font-size:10.0pt;font-family:Verdana;
mso-bidi-font-family:Arial'><b> </b>
<span class="bodytext">John Doe</span>
</span></td>
</tr>
<tr>
<td><span class="bodytext">PO Box 2112</span></td>
</tr>
<tr>
<td><span class="bodytext"></span></td>
</tr>
<!--*********************************************************
-->
<tr>
<td><span class="bodytext"></span></td>
</tr>
<tr>
<td><span class="bodytext">JOHAN</span> NSW 9700</td>
</tr>
<tr>
<td><strong>Phone:</strong>
02 9999 9999
</td>
</tr>
Basically, I want to capture everything after "Contact:" and before "Phone:" minus HTML; however, these two notation may not always exist, so I need to really capture everything between the two colons (:) that are not inside the HTML tag. The number <span class="bodytext">***data***</span>
can actually change, so I need some kind of loop to match them.
I prefer to use regular expressions as I could probably do it using loops and string matches.
Also, I would like to know the syntax for mismatched groups in PHP regex.
Any help would be greatly appreciated!
source to share
If you understand correctly, you are only interested in the text between HTML tags. To ignore HTML tags, just strip them first:
$text = preg_replace('/<[^<>]+>/', '', $html);
To capture everything between "Contact" and "Phone:" use:
if (preg_match('/Contact:(.*?)Phone:/s', $text, $regs)) {
$result = $regs[1];
} else {
$result = "";
}
To capture everything between two colons use:
if (preg_match('/:([^:]*):/', $text, $regs)) {
$result = $regs[1];
} else {
$result = "";
}
source to share
It seems that an arbitrary answer to questions like this seems like "omg doesn't use regular expressions! Use Beautiful Soup instead !!". I personally prefer not to use external libraries for small tasks like this, and regex is a good alternative.
An easy way to strip out all HTML tags, which is one way to solve this problem, is to use this regex:
$text = preg_replace("/<.*?>/", "", $text);
then you can use whatever method you like to grab the relevant text content.
Inappropriate groups are: (?:this won't match)
source to share
Sounds like screenshots of the screen , or you can use strip_tags () as well after finding the information you wanted.
source to share