Retrieving inner text of HTML tags using regular expressions

Question

Retrieving inner text of HTML tags using regular expressions

I am having trouble getting this data:

              <tr>
                <td><span class="bodytext"><b>Contact:</b><b></b></span><span style='font-size:10.0pt;font-family:Verdana;
  mso-bidi-font-family:Arial'><b> </b> 
                      <span class="bodytext">John Doe</span> 
                     </span></td>
              </tr>
              <tr>
                <td><span class="bodytext">PO Box 2112</span></td>
              </tr>
              <tr>
                <td><span class="bodytext"></span></td>
              </tr>

              <!--*********************************************************


              -->
              <tr>
                <td><span class="bodytext"></span></td>
              </tr>



              <tr>
                <td><span class="bodytext">JOHAN</span> NSW 9700</td>
              </tr>
              <tr>
                <td><strong>Phone:</strong> 
                02 9999 9999
                    </td>
              </tr>

Basically, I want to capture everything after "Contact:" and before "Phone:" minus HTML; however, these two notation may not always exist, so I need to really capture everything between the two colons (:) that are not inside the HTML tag. The number <span class="bodytext">***data***</span>

can actually change, so I need some kind of loop to match them.

I prefer to use regular expressions as I could probably do it using loops and string matches.

Also, I would like to know the syntax for mismatched groups in PHP regex.

Any help would be greatly appreciated!

0

php regex

atomicharri Dec 18 '08 at 2:28

source to share

3 answers

Jan Goyvaerts · Answer 1 · 2008-12-18T02:38:55+0000

If you understand correctly, you are only interested in the text between HTML tags. To ignore HTML tags, just strip them first:

$text = preg_replace('/<[^<>]+>/', '', $html);

To capture everything between "Contact" and "Phone:" use:

if (preg_match('/Contact:(.*?)Phone:/s', $text, $regs)) {
  $result = $regs[1];
} else {
  $result = "";
}

To capture everything between two colons use:

if (preg_match('/:([^:]*):/', $text, $regs)) {
  $result = $regs[1];
} else {
  $result = "";
}

nickf · Answer 2 · 2008-12-18T02:39:27+0000

It seems that an arbitrary answer to questions like this seems like "omg doesn't use regular expressions! Use Beautiful Soup instead !!". I personally prefer not to use external libraries for small tasks like this, and regex is a good alternative.

An easy way to strip out all HTML tags, which is one way to solve this problem, is to use this regex:

$text = preg_replace("/<.*?>/", "", $text);

then you can use whatever method you like to grab the relevant text content.

Inappropriate groups are: (?:this won't match)

Phill pafford · Answer 3 · 2009-10-05T13:33:27+0000

Sounds like screenshots of the screen , or you can use strip_tags () as well after finding the information you wanted.

0

Phill pafford 05 oct. 09 at 13:33

source to share

Retrieving inner text of HTML tags using regular expressions

More articles: