RegEx to convert Word output to html order list
I need a complex regex and I don't know if it can be written.
I'm trying to clean up some awful html output from Ms Word. Here's an example of a dandy doing it in an ordered (or numbered) list.
<p>
1
Proin Facili Habitasse Hymenaeos Ligula Litora Luctus Mi</p>
<p>
2.
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno</p>
<p>
3
Ac Nec Netus Penatibus Purus Cras Mollis</p>
Beautiful, isn't it? Paragraph tags and non-destructive spaces ...
I'm wondering if a regex can be written to replace it with the following:
<ol>
<li>
1.
Proin Facili Habitasse Hymenaeos Ligula Litora Luctus Mi</li>
<li>
2.
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno</li>
<li>
3.
Ac Nec Netus Penatibus Purus Cras Mollis</li>
</ol>
The difficulty is that the number s
can vary from one to several to several, and the list can be of different lengths. The absence s
seems to be rare, and it only seems to happen after the list gets larger (for example, when going from 9 to 10 or 99 to 100).
In any case, if this is possible, it would be amazing. Anyway, I can search for long strings s
and then manually apply list formatting, but this is not as fast as automatic.
source to share
First: all standard answers are relevant to this question: you (should | can | can) not parse / process the html (valid or not) with a regex. For a wide variety of reasons not to do this, I recommend searching the web and / or SO.
That says (and if your paragraph tags cannot be nested!), You cannot do this in one replacement. First you need to wrap tags in <ol>
and </ol>
around your paragraphs, which looks like ordered lists. I am assuming a paragraph is an ordered list when it starts with <p> NUMBER.
(paragraph tag, some spaces, a number, and a full stop).
regex : (?s)((?:<p>\s*\d+\.(?:(?!</p>).)*</p>\s*)+)
replacement : <ol>$1</ol>
Brief explanation:
// regex
(?s) # enable DOT-ALL matching
( # open group 1
(?: # open non-matching group 1
<p>\s*\d+\. # match '<p>', zero or more spaces, a number and a full stop
(?:(?!</p>).)* # [when looking ahead, if there no '</p>', only then match any character] zero or more times
</p> # match '</p>'
\s* # match zero or more white spaces
) # close non-matching group 1
+ # non-matching group 1 one or more times
) # close group 1
// replacement
<ol> # insert '<ol>'
$1 # insert what is matched by the regex in group 1
</ol> # insert '</ol>'
Your string will now contain:
<ol><p>1.
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </p>
<p>2.
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </p>
<p>3.
Ac Nec Netus Penatibus Purus Cras Mollis </p></ol>
Then replace all paragraphs (including their numbers!) With tags <li>
and </li>
:
regex : (?s)<p>\s*\d+\.((?:(?!</p>).)*)</p>
replacement : <li>$1</li>
Again, a short explanation:
// regex
(?s) # enable DOT-ALL matching
<p> # match '<p>'
\s* # match zero or more white space characters
\d+ # match one or more digits
\. # match a dot
( # start group 1
(?:(?!</p>).)* # [when looking ahead, if there no '</p>', only then match any character] zero or more times
) # end group 1
</p> # match '</p>'
// replacement
<li> # insert '<li>'
$1 # insert what is matched by the regex in group 1
</li> # insert '</li>'
Your line will now look like this:
<ol><li>
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </li>
<li>
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </li>
<li>
Ac Nec Netus Penatibus Purus Cras Mollis </li></ol>
But again: be very careful. When there is one small mistake in the opening or closing tag, you may well end up in something far worse than you started!
source to share
Not exactly what you are asking for, but HTML output from Microsoft Word has long been considered very poor and many people have tried to clean it up. As a result, there are a large number of HTML cleanup tools, and a quick Google search suggests the HTML Tidy Library Project or others can help you. Don't reinvent the wheel if you don't have to!
source to share
No, this is not possible as a regular expression, because HTML is not an ordinary language .
Instead, take any HTML parser, find the subsequent nodes <p>
that are inside the common parent node whose content starts with ordered digits, and place them as <li>
in a new <ol>
node.
source to share
I am using this .JS wrapped in a function to better clean up the loaded .doc file in the DIV. This is by no means a complete solution. Improvements are welcome.
h = h.replace(/<[/]?(font|st1|shape|path|lock|imagedata|stroke|formulas|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>/gi, '')
h = h.replace(/<([^>]*)style="([^>"]*)"([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style='([^>']*)'([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style=([^> ]*) ([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style=([^>]*)>/gi, '<$1>')
h = h.replace(/<([^>]*)class="([^>"]*)"([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class='([^>']*)'([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class=([^> ]*) ([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class=([^>]*)>/gi, '<$1>')
I also found this VB solution on Tim Mackeys blog:
Private Function CleanHtml(ByVal html As String) As String
html = Regex.Replace(html, "<[/]?(font|link|m|a|st1|meta|object|style|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase)
html = Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", RegexOptions.IgnoreCase)
html = Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", RegexOptions.IgnoreCase)
Dim i As Integer = 0
Dim x As Integer = 0
html = customClean(html, "<!--[if", "<![endif]-->")
html = customClean(html, "<!-- /*", "-->")
Return html
End Function
Private Function customClean(ByVal html As String, ByVal begStr As String, ByVal endStr As String) As String
Dim i As Integer
Dim j As Integer
While html.Contains(begStr)
i = html.IndexOf(begStr, 0)
j = html.IndexOf(endStr, 0)
html = html.Remove(i, ((j - i) + endStr.Length))
End While
Return html
End Function
Hope it helps.
source to share