Add depth to nodes, iterating over lxml tree
I want to add depth to each node, for this I came up with the following recursive function:
import lxml.html
def add_depth(node, depth = 0):
node.depth = depth
print(node.tag, node.depth)
for n in node.iterchildren():
add_depth(n , depth + 1)
html = """<html>
<body>
<div>
<a></a>
<h1></h1>
</div>
</body>
</html>"""
tree = lxml.html.fromstring(html)
add_depth(tree)
for x in tree.iter():
print(x)
if not hasattr(x, 'depth'):
print('this should not happen', x)
I thought this was one of the cheapest ways to add depth, so doing it once will make all elements deep and I only need to see each element once.
The problem is that somehow it doesn't seem to stick ... it, like depth, doesn't stick to the element. Could it be that iterating over the lxml tree is something built in place and thus adding depth doesn't stick?
What's going on here, and what is the cheapest way to get all the elements to have depth?
Breakthrough
Using the following:
def add_depth(node, depth = 0, maxd = None):
node.depth = depth
if maxd is None:
maxd = []
maxd.append((node, node.depth))
for n in node.iterchildren():
add_depth(n , depth + 1, maxd)
return maxd
Suddenly it works. This code creates a huge list of all items and depth next to it (so I can sort it). Even when iterating over the original tree, this time they have depth. However, this is ineffective and I don't get it.
@Maximoo
tree.depth = 0
for x in tree.iter():
if x.getparent() is not None:
x.depth = x.getparent().depth + 1
AttributeError: 'HtmlElement' object has no attribute 'depth'
source to share
There are a couple of problems here.
-
First, you are trying to make your recursive function a side effect of updating the original tree. I don't think this is possible.
-
Secondly, you don't want to use Python attributes, you need to use the xml attributes that you access using
x.attrib
.
The working part of the code might be as follows (this is a bit awkward as I am constantly filling in depth from string to int since xml attributes cannot be integers). It doesn't use recursion, but I think the brute-force is anyway:
tree.attrib['depth'] = '0'
for x in tree.iter():
if 'depth' not in x.attrib:
x.attrib['depth'] = str(int(x.getparent().attrib['depth']) + 1)
print(lxml.html.tostring(tree).decode())
<html depth="0">
<body depth="1">
<div depth="2">
<a depth="3"></a>
<h1 depth="3"></h1>
</div>
</body>
</html>
source to share