Reading docx files, recognizing and saving italicized text

How should I read a .docx file from Python and recognize italic text and store it as a string?

I looked at the python docx package, but all I can see are functions to write to a .docx file.

I appreciate help in advance

+3


source to share


2 answers


This is what my example document looks like TestDocument.docx

.

enter image description here

Note. The word "italic" is in italics, but "Accent" uses the "Accent" style.

If you install the module python-docx

. This is a fairly simple exercise.



>>> from docx import Document
>>> document = Document('TestDocument.docx')
>>> for p in document.paragraphs:
...     for run in p.runs:
...             print run.text, run.italic, run.bold
... 
Test Document None None
Italics True None
Emp None None
hasis None None
>>> [[run.text for run in p.runs if run.italic] for p in document.paragraphs]
[[], ['Italics'], []]

      

The attribute Run.italic

captures if the text is formatted as italic, but it doesn't know if the text block has a style that appears in italic, but it can be detected by setting Run.style.name (if you know what styles in your document appear in italic.

>>> [[run.text for run in p.runs if run.style.name=='Emphasis'] for p in document.paragraphs]
[[], [], ['Emp', 'hasis']]

      

+3


source


Your best bet is to unroll docx, which will create a directory called word. Inside that directory there is document.xml, from there you will need to examine the xml structure and keywords to read only italic text. once you're done, all you need to do is output the text string from the xml file.



0


source







All Articles