.NET: convert .doc to .htm results in fancy characters

Question

.NET: convert .doc to .htm results in fancy characters

I used MS Word automation to save .doc to .htm. If there are bullet characters in the .doc file, they are stored in .htm order, but when I try to read the .htm file in a string (so I can subsequently post to the database for final storage as a string, not a blob), the bullets are converted to question marks or other characters depending on the encoding used to load into the string.

I use this to read text:

string html = File.ReadAllText(myFileSpec);

I also tried to use StreamReader but get the same results (maybe it is used inside File.ReadAllText).

I've also tried specifying each encoding type in the second overload of File.ReadAllText:

string html = File.ReadAllText(originalFile, Encoding.ASCII);

I've tried all the available enums for the Encoding type.

Any ideas?

+1

c # encoding

Todd price 07 nov. '08 at 18:45

source to share

5 answers

You tried to open a file in binary mode. If you open in test mode I think it will destroy the unicode characters.

0

osp70 07 nov. '08 at 18:51

source to share

Isn't it a problem that Word's conversion .doc

to .html

turns bullet points into question marks (and it has nothing to do with File.ReadAllText

or StreamReader

etc.)?

i.e. by the time it gets to it File.ReadAllText

, it's already a question mark.

When I convert a simple simple wordlist to HTML in Word 2003 I get

 <ul style='margin-top:0cm' type=disc> 
     <li class=MsoNormal style='mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'>
       <span lang=EN-GB style='mso-ansi-language:EN-GB'>Test 1</span>
     </li> 
     <li class=MsoNormal style='mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'>
       <span lang=EN-GB style='mso-ansi-language:EN-GB'>Test 2</span>
     </li> 
 </ul>

It's ugly, but there is nothing in it that could be a question mark

0

inspite 07 nov. '08 at 19:54

source to share

What do these characters look like in the HTML file? What is the encoding declaration of this file (in the "Content-Type" meta tag)? Ideally, these characters should be converted to UTF-8 objects or characters.
Answering these questions can lead you to a solution ... :-)

0

PhiLho 07 nov. '08 at 19:59

source to share

OK, apparently I lied in my first statement. I thought I tried every encoding, but I haven't tried this:

data = File.ReadAllText(tempFile, Encoding.Default);

You think that overloading this method when you DO NOT specify an encoding will work fine, expecting the default encoding to be, well, Encoding.Default. However, it uses Encoding.UTF8 by default. Hope this helps someone else.

0

Todd price 07 nov. At 21:06

source to share

Jeffrey l whitledge · Accepted Answer · 2008-11-07T20:37:34+0000

On my system (using US-English) Word saves * .htm files in the Windows-1252 codepage. If your system uses this code page, then you should read it as.

string html = File.ReadAllText(originalFile, Encoding.GetEncoding(1252));

It's also possible that no matter what you use in the view, the results can create question marks for you, so be sure and check that too.

.NET: convert .doc to .htm results in fancy characters

More articles: