How can I find a list in docx using python?
I'm trying to split a doc document that looks like this:
1.0 List item
1.1 List item
1.2 List
item 2.0 List item
It is stored in docx and I am using python-docx to parse it. Unfortunately, it loses all numbering at the beginning. I am trying to determine the start of each ordered list item.
The python-docx library also allows me to access styles, but I cannot figure out how to determine if a style is a list style or not.
So far I've worked with a function and checked the output, but the standard format looks something like this:
for p in doc.paragraphs:
s = p.style
while s.base_style is not None:
print s.name
s = s.base_style
print s.name
Which I used to try to search on custom styles but ended up in "Normal" and not "ListNumber".
I've tried searching for styles across document, paragraphs, and runs with no luck. I also tried looking for p.text, but as mentioned earlier, the numbering is not retained.
source to share
List items can be implemented in XML in various ways. Unfortunately, the most common way to add list items using the toolbar (as opposed to using styles) is also probably the most difficult.
Your best bet is to start by using opc-diag to look at the XML that is used inside document.xml and then formulate your strategy from there.
The list processing API for python-docx is not yet implemented, so you will need to work at the lxml level if you want to do this with today's version.
source to share