What makes HTML documents created in Microsoft Word so large in code?

Below is a simple W3C-validated code to print "Hello World":

<!DOCTYPE html>
<html>
<head>
<meta charset = "utf-8">
<title>Hello</title>
</head>
Hello World
</html> 

      

But when I do the same with MS Word the generated code is 449 lines long . Why do all these extra lines appear in the code

+3


source to share


3 answers


Word namespace:

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">

      

Words contain information about metadata:

<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>xxxxxx</o:Author>
  <o:LastAuthor>xxxxx</o:LastAuthor>
  <o:Revision>2</o:Revision>
  <o:TotalTime>0</o:TotalTime>
  <o:Created>2015-05-25T11:40:00Z</o:Created>
  <o:LastSaved>2015-05-25T11:40:00Z</o:LastSaved>
  <o:Pages>1</o:Pages>
  <o:Words>1</o:Words>
  <o:Characters>11</o:Characters>
  <o:Company>Sopra Group</o:Company>
  <o:Lines>1</o:Lines>
  <o:Paragraphs>1</o:Paragraphs>
  <o:CharactersWithSpaces>11</o:CharactersWithSpaces>
  <o:Version>12.00</o:Version>
 </o:DocumentProperties>
</xml><![endif]-->

      

Word adds css style:



<style>
<!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;
    mso-font-charset:0;
    mso-generic-font-family:roman;
    mso-font-pitch:variable;
    mso-font-signature:-536870145 1107305727 0 0 415 0;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;
    mso-font-charset:0;
    mso-generic-font-family:swiss;
    mso-font-pitch:variable;
    mso-font-signature:-536870145 1073786111 1 0 415 0;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
    {mso-style-unhide:no;
    mso-style-qformat:yes;
    mso-style-parent:"";
    margin-top:0cm;
    margin-right:0cm;
    margin-bottom:10.0pt;
    margin-left:0cm;
    line-height:115%;
    mso-pagination:widow-orphan;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";
    mso-ascii-font-family:Calibri;
    mso-ascii-theme-font:minor-latin;
    mso-fareast-font-family:Calibri;
    mso-fareast-theme-font:minor-latin;
    mso-hansi-font-family:Calibri;
    mso-hansi-theme-font:minor-latin;
    mso-bidi-font-family:"Times New Roman";
    mso-bidi-theme-font:minor-bidi;
    mso-fareast-language:EN-US;}
.MsoChpDefault
    {mso-style-type:export-only;
    mso-default-props:yes; ......

      

Word uses css style:

<p class=MsoNormal>Hello World</p>

      

You need to keep this information in case you need to modify it in the future. If you are doing a simple export, you can remove all metadata.

+13


source


As explained in this link , the code is added for MS Office purposes; and among other things, it meant making it easier for you to resume editing the document in Word. I understand that most of the bloat you see is just mockup and documentary information. I will supply the relevant quote for future reference in case of link decay.

[...] It turns out these HTML files were created by Microsoft Word! Due to a series of different web designs and designers over the years, as well as healthy editing by the marketing department, 4 of our clients' web pages have been created or modified. using Microsoft Word!

When we scrolled through the HTML file, we saw a lot of extra data that a regular web browser would never interpret. A little research has explained this for us. Microsoft allows you to save your document as an HTML file. They also want you to be able to open an HTML file that was created with Microsoft Office and resume editing as a normal document. Because Microsoft Office has all sorts of functionality that HTML and CSS does not allow Office to keep certain information inside the HTML file between changes.

Some of the saved data is obvious: when the document was created and by whom, who did what edits, when, the number of paragraphs, etc. Other less obvious data such as VML, DHTML behavior, column and page spacing, style information, inline object data, etc. are also stored inside the file. All of this Office specific data is stored inside an HTML file and wrapped inside special conditional comments such as <!--[if gte mso 9]

. This hides the content from other programs that read HTML.

As Adriano Repetti pointed out, there is code to handle older versions of Office.



<!--[if gte mso 9]> ...
<!--[if gte mso 10]> ...

      

Checks the compatibility of MS Office versions to determine the layout. It might be worth mentioning that editing HTML in Word is not something I would recommend. Ever.

Try NetBeans , it's free and awesome :)
I sound like a car salesman ... * grumbles *

+7


source


The additional code you see consists of:

  • Font reference to the font used.
  • O (Document Properties) which stores information such as author, date word, account, etc.
  • Word Doc and Math options, this includes things like kerning (the space between letters), its input language, and many other settings, usually related to page layout and content.

Ultimately all of this affects what you see on the page so that it looks like your doc word and retains reference information like word counts etc.

+1


source







All Articles