RegEx: match text that is not inside and part of an HTML tag

Question

RegEx: match text that is not inside and part of an HTML tag

how to combine all content outside of HTML tag?

My pseudo HTML:

<h1>aaa</h1>
bbb <img src="bla" /> ccc
<div>ddd</div>

I have used regex,

(?<=^|>)[^><]+?(?=<|$)

which would give me: "aaa bbb ccc ddd"

All I need is a way to ignore HTML tags and return: "bbb ccc"

+1

regex

crustymalte 09 June '09 at 15:20

source to share

3 answers

karim79 · Answer 1 · 2009-06-09T15:29:41+0000

Regexes are a clumsy and unreliable way to work with markup. I would suggest using a DOM parser like SimpleHtmlDom :

//get the textual content of all hyperlinks on specified page.
//you can use selectors, e.g. 'a.pretty' - see the docs
echo file_get_html('http://www.example.org')->find('a')->plaintext;

If you want to do this on the client, you can use a library like jQuery :

$('a').each(function() {
    alert($(this).text());
});

Fritz G. Mehner · Answer 2 · 2009-06-09T15:32:30+0000

Find a suitable regex to match the complete tag (for example, in a library like http://regexlib.com/ ) and remove them using the s /// placeholder. Then use the rest.

crustymalte · Answer 3 · 2009-06-09T21:07:54+0000

Thanks everyone,

expressing both together would be dirty work, but I would like to get the opposite result.

(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)

As a pseudo string:

<h1>aaa</h1>

bbb <img src="bla" /> ccc

<div>ddd</div> jhgvjhgjh zhg zt <div>ddd</div>

<div>dsada</div> hbhgjh

For simplicity, I am using this tool .

RegEx: match text that is not inside and part of an HTML tag

More articles: