HTMLEncode script tags only

I'm working on StackQL.net , which is a simple website that lets you run custom tsql queries on a StackOverflow Public Dataset. It's ugly (I'm not a graphic designer), but it works.

One option I have done is that I do not want html to encode all the content of the post bodies. This way you see some formatting from messages in your queries. It will even upload images and I'm fine with that.

But I'm concerned that this will also keep the tags <script>

active. Someone might install a malicious script in response to stackoverflow; they can even delete it right away, so no one sees it. One of the more common requests people try on their first visit is a simple one Select * from posts

, so with a little time to script how this can end up running across multiple browsers. I want to make sure this is not a concern before I update (Oct) the export (hopefully released soon).

What is the best and safest way to ensure that only script tags are ultimately encoded?

+2


source to share


5 answers


You can change the HTMLSanatize script to suit your needs. It was written by Jeff Atwood to show some kind of HTML. Since this was written for Stack Overflow, it also served your purpose.



I don't know how relevant this is with what Jeff has just deployed, but it's a good starting point.

+3


source


Don't forget onclick

, onmouseover

etc. or javascript: psuedo-urls ( <img src="javascript:evil!Evil!">

) or CSS ( style="property: expression(evil!Evil!);"

) or ...

There are many attack vectors outside of simple script elements.



Make a whitelist , not a blacklist.

+2


source


If the messages are in XHTML format, you can do the XSL transformation and encode / strip the tags and properties you don't want. It gets a little easier if you use something like TinyMCE or CKEditor to provide a wysiwyg editor that outputs XHTML.

+1


source


How easy is it to break tags <script>

? Highlighting just <

and >

for this tag ending with &lt;script&gt;

can be a simple and straightforward way.

Of course, links are another vector. You must also disable every instance href='javascript:'

and every attribute starting with on

*.

To be sure, keep it out of orbit.

0


source


But I'm concerned that this will also keep the tags <script

active.

Oh, this is just the beginning of malicious HTML content that can trigger cross-site scripting. There are also event handlers; inline and related CSS (expressions, behaviors, bindings), Flash and other embeddable plugins, iframes for site use, javascript:

and other dangerous schemes (there is more than you think!) anywhere that a URL can accept, meta-refresh , UTF-8 overlongs, UTF-7 mis-sniffing, data binding, VML and other non-HTML files, broken markup is parsed as scripts by permissive browsers ...

In short, any attempt to quickly fix the HTML with a simple regex will fail.

Either avoid everything to have any HTML render as plain text, or use a full parser and whitelist based parser. (And keep it up to date, because even this is hard work and newly discovered holes often appear in them.)

But aren't you using the same Markdown system as SO for rendering messages? That would be obvious. I cannot guarantee that there are no holes in Markdown that would allow cross-site scripting (in the past, of course, there were and there are probably a few more obscure ones, since it is a rather complex system). But at least you wouldn't be more defenseless than SO!

0


source







All Articles