Is there a html css normalizer that works?
A long time ago I wrote a style "normalizer" program to scan ASP / HTML code of a large bunch of classic ASP pages (most of which were originally created from MS-Word documents, so naturally they were littered with style captions and massive one-off styles). The style normalizer generated a minimal set of styles and styles and a new "processed" asp / html document, so that the sanitized document produced exactly the same rendered result as the original (verified by comparing screenshots).
From time to time I am faced with the need for such a program, and I am going with the idea of ββwriting it for a commercial release.
I didn't have anything like this (HTML: Normalize Perl Module and HTML Tidy Project just clear the tags).
So my questions are:
- Is there such a tool already, commercial or otherwise?
- If not, does he really need it?
- if so, what features would make it really useful?
re # 3, for example, collecting a base style sheet for a set of pages or setting all pages to use a given base style sheet; keeping the classic asp commands after #includes, keeping the inline asp.net scripts, etc. The more specific and numerous the better.
Example:
Old html with inline tags
<html><head>
<title>title</title>
<style type='css/text'>
.cls1 { font-family: arial; font-size: 10px; font-weight: bold; }
</style>
</head>
<body>
<% somefunction() %>
<div class='cls1' style='font-size:10px;'>test div</div>
</body>
</html>
New html
<html><head>
<title>title</title>
<style type='css/text'>
.cls1 { font-family: arial; font-size: 10px; font-weight: bold; }
</style>
</head>
<body>
<% somefunction() %>
<div class='cls1'>test div</div>
</body>
</html>
Note that there is no styling in the div as it was redundant with the cls1 class
EDIT: Remove the term "sanitizer" since I'm not focusing on XSS attacks or filtering input in comments, just consolidating lots of custom styles and random CSS classes into a minimal consistent set of stylesheets.
source to share
Well, I can't say definitively that it "works" for all of this, but Tidy does a little more than clearing tags.
See HTML Tidy Settings , especially those specific to Microsoft Word (e.g. word-2000 )
source to share
If you want to know if you've done a reasonable job, you should try these tests (using something like Tidy, you probably haven't done a reasonable job).
Some parameters:
- PHP HTML Cleaner
- lxml.html.clean in Python
- feedparser has an aggressive Python cleaner
- LiveJournal code in Perl
Anything that uses regular expressions and doesn't parse markup would be suspicious in my mind (and just too hard to implement).
source to share
Old question, but some people may still find this useful. Check out http://necolas.github.com/normalize.css/ . It works well!
source to share