Best way to handle mixed HTML and user input?
In a PHP application I am writing, I would like users to enter a combination of HTML and text with pointed brackets in the text, but when I display this text, I want HTML tags to be displayed, non-HTML tags will be displayed literary, for example user should be able to type:
<b> 5 > 3 = true</b>
when displayed, the user should see:
5> 3 = true
What is the best way to analyze this i.e. find all non-HTML brackets, convert them to & gt; and <?
source to share
I would recommend that users enter BBcode style markup, which you then replace with html tags:
[b]This is bold[/b]
[i]this is italic with a > 'greater than' sign there[/i]
This gives you more control over how you parse user input into the html, although I admit it looks like an unnecessary burden.
source to share
If you allow the user to enter HTML, you need to solve a much more serious problem than a few captive angle brackets; HTML is really hard to validate and filter correctly, and if you don't do it right, you open yourself up to XSS attacks. I wrote a library that does this; someone else has already posted a link to it here, so I won't repeat it.
However, to answer your question, the most reliable way to convert skewed angle brackets to their escaped forms is to parse the HTML with DOM / libxml and then reinitialize. Anything using regex or such will be doomed to fail along the edge. You can also write your own parser, but that also takes a little work.
source to share
A better way would be to do the opposite: instead of finding and escaping non-HTML parentheses, avoid everything first, then look for <b>
both </b>
and and unescape only those special cases. This way, you don't run the risk of a user injecting malicious HTML into their page (if you try to avoid only what is needed, you risk losing something important).
source to share