How can I find a list in docx using python?

I'm trying to split a doc document that looks like this:

1.0 List item
 1.1 List item
 1.2 List
 item 2.0 List item

It is stored in docx and I am using python-docx to parse it. Unfortunately, it loses all numbering at the beginning. I am trying to determine the start of each ordered list item.

The python-docx library also allows me to access styles, but I cannot figure out how to determine if a style is a list style or not.

So far I've worked with a function and checked the output, but the standard format looks something like this:

    for p in doc.paragraphs:
        s = p.style
        while s.base_style is not None:
            print s.name
            s = s.base_style
        print s.name

      

Which I used to try to search on custom styles but ended up in "Normal" and not "ListNumber".

I've tried searching for styles across document, paragraphs, and runs with no luck. I also tried looking for p.text, but as mentioned earlier, the numbering is not retained.

+3


source to share


1 answer


List items can be implemented in XML in various ways. Unfortunately, the most common way to add list items using the toolbar (as opposed to using styles) is also probably the most difficult.

Your best bet is to start by using opc-diag to look at the XML that is used inside document.xml and then formulate your strategy from there.



The list processing API for python-docx is not yet implemented, so you will need to work at the lxml level if you want to do this with today's version.

+4


source







All Articles