RegEx: match text that is not inside and part of an HTML tag

how to combine all content outside of HTML tag?

My pseudo HTML:

<h1>aaa</h1>
bbb <img src="bla" /> ccc
<div>ddd</div>

      

I have used regex,

(?<=^|>)[^><]+?(?=<|$)

      

which would give me: "aaa bbb ccc ddd"

All I need is a way to ignore HTML tags and return: "bbb ccc"

+1


source to share


3 answers


Regexes are a clumsy and unreliable way to work with markup. I would suggest using a DOM parser like SimpleHtmlDom :

//get the textual content of all hyperlinks on specified page.
//you can use selectors, e.g. 'a.pretty' - see the docs
echo file_get_html('http://www.example.org')->find('a')->plaintext; 

      



If you want to do this on the client, you can use a library like jQuery :

$('a').each(function() {
    alert($(this).text());
});

      

+6


source


Find a suitable regex to match the complete tag (for example, in a library like http://regexlib.com/ ) and remove them using the s /// placeholder. Then use the rest.



0


source


Thanks everyone,

expressing both together would be dirty work, but I would like to get the opposite result.

(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)

      

As a pseudo string:

<h1>aaa</h1>

bbb <img src="bla" /> ccc

<div>ddd</div> jhgvjhgjh zhg zt <div>ddd</div>

<div>dsada</div> hbhgjh

      

For simplicity, I am using this tool .

0


source







All Articles