Reading docx files, recognizing and saving italicized text
This is what my example document looks like TestDocument.docx
.
Note. The word "italic" is in italics, but "Accent" uses the "Accent" style.
If you install the module python-docx
. This is a fairly simple exercise.
>>> from docx import Document
>>> document = Document('TestDocument.docx')
>>> for p in document.paragraphs:
... for run in p.runs:
... print run.text, run.italic, run.bold
...
Test Document None None
Italics True None
Emp None None
hasis None None
>>> [[run.text for run in p.runs if run.italic] for p in document.paragraphs]
[[], ['Italics'], []]
The attribute Run.italic
captures if the text is formatted as italic, but it doesn't know if the text block has a style that appears in italic, but it can be detected by setting Run.style.name (if you know what styles in your document appear in italic.
>>> [[run.text for run in p.runs if run.style.name=='Emphasis'] for p in document.paragraphs]
[[], [], ['Emp', 'hasis']]
source to share
Your best bet is to unroll docx, which will create a directory called word. Inside that directory there is document.xml, from there you will need to examine the xml structure and keywords to read only italic text. once you're done, all you need to do is output the text string from the xml file.
source to share