RegEx to convert Word output to html order list

I need a complex regex and I don't know if it can be written.

I'm trying to clean up some awful html output from Ms Word. Here's an example of a dandy doing it in an ordered (or numbered) list.

<p>

1 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;


Proin Facili Habitasse Hymenaeos Ligula Litora Luctus Mi</p>

<p>

2. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;


Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno</p>

<p>

3 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;


Ac Nec Netus Penatibus Purus Cras Mollis</p>

Beautiful, isn't it? Paragraph tags and non-destructive spaces ...

I'm wondering if a regex can be written to replace it with the following:

<ol>


<li>

1. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;


Proin Facili Habitasse Hymenaeos Ligula Litora Luctus Mi</li>

<li>

2. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;


Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno</li>

<li>

3. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;


Ac Nec Netus Penatibus Purus Cras Mollis</li>


</ol>

The difficulty is that the number &nbsp;s

can vary from one to several to several, and the list can be of different lengths. The absence &nbsp;s

seems to be rare, and it only seems to happen after the list gets larger (for example, when going from 9 to 10 or 99 to 100).

In any case, if this is possible, it would be amazing. Anyway, I can search for long strings &nbsp;s

and then manually apply list formatting, but this is not as fast as automatic.

+2


source to share


5 answers


First: all standard answers are relevant to this question: you (should | can | can) not parse / process the html (valid or not) with a regex. For a wide variety of reasons not to do this, I recommend searching the web and / or SO.

That says (and if your paragraph tags cannot be nested!), You cannot do this in one replacement. First you need to wrap tags in <ol>

and </ol>

around your paragraphs, which looks like ordered lists. I am assuming a paragraph is an ordered list when it starts with <p> NUMBER.

(paragraph tag, some spaces, a number, and a full stop).

regex       : (?s)((?:<p>\s*\d+\.(?:(?!</p>).)*</p>\s*)+)
replacement : <ol>$1</ol>

      

Brief explanation:

// regex
(?s)                # enable DOT-ALL matching
(                   # open group 1
  (?:               #   open non-matching group 1
    <p>\s*\d+\.     #     match '<p>', zero or more spaces, a number and a full stop
    (?:(?!</p>).)*  #     [when looking ahead, if there no '</p>', only then match any character] zero or more times
    </p>            #     match '</p>'
    \s*             #     match zero or more white spaces
  )                 #   close non-matching group 1
  +                 #   non-matching group 1 one or more times
)                   # close group 1

// replacement
<ol>                # insert '<ol>'
$1                  # insert what is matched by the regex in group 1
</ol>               # insert '</ol>'

      

Your string will now contain:

<ol><p>1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </p>

<p>2.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </p>

<p>3.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Ac Nec Netus Penatibus Purus Cras Mollis </p></ol>

      



Then replace all paragraphs (including their numbers!) With tags <li>

and </li>

:

regex       : (?s)<p>\s*\d+\.((?:(?!</p>).)*)</p>
replacement : <li>$1</li>

      

Again, a short explanation:

// regex
(?s)               # enable DOT-ALL matching
<p>                # match '<p>'
\s*                # match zero or more white space characters
\d+                # match one or more digits
\.                 # match a dot
(                  # start group 1
  (?:(?!</p>).)*   #   [when looking ahead, if there no '</p>', only then match any character] zero or more times
)                  # end group 1
</p>               # match '</p>'

// replacement
<li>               # insert '<li>'
$1                 # insert what is matched by the regex in group 1
</li>              # insert '</li>'

      

Your line will now look like this:

<ol><li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </li>

<li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </li>

<li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Ac Nec Netus Penatibus Purus Cras Mollis </li></ol>

      

But again: be very careful. When there is one small mistake in the opening or closing tag, you may well end up in something far worse than you started!

+3


source


Not exactly what you are asking for, but HTML output from Microsoft Word has long been considered very poor and many people have tried to clean it up. As a result, there are a large number of HTML cleanup tools, and a quick Google search suggests the HTML Tidy Library Project or others can help you. Don't reinvent the wheel if you don't have to!



+1


source


All those &nbsp;

do not work, you need the following:

/<p>( *[0-9]+.*?)<\/p>/<li>\1<\/li>/

      

0


source


No, this is not possible as a regular expression, because HTML is not an ordinary language .

Instead, take any HTML parser, find the subsequent nodes <p>

that are inside the common parent node whose content starts with ordered digits, and place them as <li>

in a new <ol>

node.

0


source


I am using this .JS wrapped in a function to better clean up the loaded .doc file in the DIV. This is by no means a complete solution. Improvements are welcome.

h = h.replace(/<[/]?(font|st1|shape|path|lock|imagedata|stroke|formulas|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>/gi, '')

h = h.replace(/<([^>]*)style="([^>"]*)"([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style='([^>']*)'([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style=([^> ]*) ([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style=([^>]*)>/gi, '<$1>')

h = h.replace(/<([^>]*)class="([^>"]*)"([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class='([^>']*)'([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class=([^> ]*) ([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class=([^>]*)>/gi, '<$1>')

      

I also found this VB solution on Tim Mackeys blog:

Private Function CleanHtml(ByVal html As String) As String
html = Regex.Replace(html, "<[/]?(font|link|m|a|st1|meta|object|style|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase)
html = Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", RegexOptions.IgnoreCase)
html = Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", RegexOptions.IgnoreCase)
Dim i As Integer = 0
Dim x As Integer = 0
html = customClean(html, "<!--[if", "<![endif]-->")
html = customClean(html, "<!-- /*", "-->")
Return html
End Function

Private Function customClean(ByVal html As String, ByVal begStr As String, ByVal endStr As String) As String
Dim i As Integer
Dim j As Integer
While html.Contains(begStr)
i = html.IndexOf(begStr, 0)
j = html.IndexOf(endStr, 0)
html = html.Remove(i, ((j - i) + endStr.Length))
End While
Return html
End Function

      

Hope it helps.

0


source







All Articles