Detecting encoding conversion problems

Most of the content on my company's website starts out as a Word document (encoded in Windows-1252) and is eventually copied and pasted into our UTF-8 encoded content management system. The conversion is usually clamped by a few characters (special interrupt characters, smart quotes, scientific notation) that need to be manually cleaned up, but of course some of them always slip through.

What do you think is the best way to spot them?

0


source to share


3 answers


How exactly do you do the conversion?

The whole problem with copying from Word is something I've seen more often, but it should be easy to fix.

The generic characters you mention are in the range 0x80

- 0x9F

, in which the Windows-1252 code page differs from the ISO-8859-1 code page . This range is undefined in ISO-8859-1.



You should be doing the conversion from ISO-8859-1 (or possibly ISO-8859-15) instead of Windows-1252, forcing it to suppress characters in that range.

You should either adjust the original encoding of your transform, or if possible (I'm not familiar with C #, but I doubt it), use a code page diagram to fix 32 characteristic characters apart from the main transform.

+2


source


Can you save the text as .rtf and then parse it with some other program?



Can you use Word VBA to keep the text as normal?

+1


source


As mentioned, it would be better to export the Word content in a parse format (both RTF and XML).

There may be a definite reason to use copy and paste to add stuff to your CMS, but when copying and pasting, you will probably always end up in the form of a visual check and fix the round unless you create a tool that will monitor the clipboard.

When copying and pasting from (latest version) Word, the clipboard has several different formats that you can use, one of which is based on XML. One could create something that flushes the Word XML on the clipboard and "installs" the text version (which you are probably pasting into the CMS) into the cleaned format.

You can use Word.interop that comes with office and standard C # clipboard functions to create one. The tool could run on top (in the background) of Word while adding content to the CMS.

+1


source







All Articles