.NET: convert .doc to .htm results in fancy characters
I used MS Word automation to save .doc to .htm. If there are bullet characters in the .doc file, they are stored in .htm order, but when I try to read the .htm file in a string (so I can subsequently post to the database for final storage as a string, not a blob), the bullets are converted to question marks or other characters depending on the encoding used to load into the string.
I use this to read text:
string html = File.ReadAllText(myFileSpec);
I also tried to use StreamReader but get the same results (maybe it is used inside File.ReadAllText).
I've also tried specifying each encoding type in the second overload of File.ReadAllText:
string html = File.ReadAllText(originalFile, Encoding.ASCII);
I've tried all the available enums for the Encoding type.
Any ideas?
source to share
On my system (using US-English) Word saves * .htm files in the Windows-1252 codepage. If your system uses this code page, then you should read it as.
string html = File.ReadAllText(originalFile, Encoding.GetEncoding(1252));
It's also possible that no matter what you use in the view, the results can create question marks for you, so be sure and check that too.
source to share
Isn't it a problem that Word's conversion .doc
to .html
turns bullet points into question marks (and it has nothing to do with File.ReadAllText
or StreamReader
etc.)?
i.e. by the time it gets to it File.ReadAllText
, it's already a question mark.
When I convert a simple simple wordlist to HTML in Word 2003 I get
<ul style='margin-top:0cm' type=disc>
<li class=MsoNormal style='mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'>
<span lang=EN-GB style='mso-ansi-language:EN-GB'>Test 1</span>
</li>
<li class=MsoNormal style='mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'>
<span lang=EN-GB style='mso-ansi-language:EN-GB'>Test 2</span>
</li>
</ul>
It's ugly, but there is nothing in it that could be a question mark
source to share
OK, apparently I lied in my first statement. I thought I tried every encoding, but I haven't tried this:
data = File.ReadAllText(tempFile, Encoding.Default);
You think that overloading this method when you DO NOT specify an encoding will work fine, expecting the default encoding to be, well, Encoding.Default. However, it uses Encoding.UTF8 by default. Hope this helps someone else.
source to share