Add depth to nodes, iterating over lxml tree

I want to add depth to each node, for this I came up with the following recursive function:

import lxml.html

def add_depth(node, depth = 0):
    node.depth = depth
    print(node.tag, node.depth)
    for n in node.iterchildren(): 
        add_depth(n , depth + 1)

html = """<html>
            <body>
              <div>
                <a></a>
                <h1></h1>
              </div>
            </body>
          </html>"""

tree = lxml.html.fromstring(html)

add_depth(tree)

for x in tree.iter():
    print(x)
    if not hasattr(x, 'depth'):
        print('this should not happen', x)

      

I thought this was one of the cheapest ways to add depth, so doing it once will make all elements deep and I only need to see each element once.

The problem is that somehow it doesn't seem to stick ... it, like depth, doesn't stick to the element. Could it be that iterating over the lxml tree is something built in place and thus adding depth doesn't stick?

What's going on here, and what is the cheapest way to get all the elements to have depth?

Breakthrough

Using the following:

def add_depth(node, depth = 0, maxd = None):
    node.depth = depth
    if maxd is None:
        maxd = []
    maxd.append((node, node.depth)) 
    for n in node.iterchildren(): 
        add_depth(n , depth + 1, maxd)
    return maxd    

      

Suddenly it works. This code creates a huge list of all items and depth next to it (so I can sort it). Even when iterating over the original tree, this time they have depth. However, this is ineffective and I don't get it.

@Maximoo

tree.depth = 0
for x in tree.iter(): 
    if x.getparent() is not None:
        x.depth = x.getparent().depth + 1

AttributeError: 'HtmlElement' object has no attribute 'depth'

      

+3


source to share


1 answer


There are a couple of problems here.

  • First, you are trying to make your recursive function a side effect of updating the original tree. I don't think this is possible.

  • Secondly, you don't want to use Python attributes, you need to use the xml attributes that you access using x.attrib

    .



The working part of the code might be as follows (this is a bit awkward as I am constantly filling in depth from string to int since xml attributes cannot be integers). It doesn't use recursion, but I think the brute-force is anyway:

tree.attrib['depth'] = '0'
for x in tree.iter():
    if 'depth' not in x.attrib:
        x.attrib['depth'] = str(int(x.getparent().attrib['depth']) + 1)


print(lxml.html.tostring(tree).decode())

<html depth="0">
            <body depth="1">
              <div depth="2">
                <a depth="3"></a>
                <h1 depth="3"></h1>
              </div>
            </body>
          </html>

      

+1


source







All Articles