Lxml search tags by regex

Question

Lxml search tags by regex

I am trying to use lxml to get an array of tags formatted as

<TEXT1>TEXT</TEXT1>

<TEXT2>TEXT</TEXT2>

<TEXT3>TEXT</TEXT3>

I tried to use

xml_file.findall("TEXT*")

but this is looking for a letter asterisk.

I also try to use ETXPath but doesn't seem to work. Is there any API function to work with this, since TEXT is supposed to be appended with integers is not the nicest solution.

+3

python xml tags lxml

jkaluzka 15 nov. 14 at 21:44

source to share

2 answers

Here's one idea:

import lxml.etree

doc = lxml.etree.parse('test.xml')
elements = [x for x in doc.xpath('//*') if x.tag.startswith('TEXT')]

+1

JKesMc9tqIQe9M 15 nov. 14 at 22:18

source to share

Robᵩ · Accepted Answer · 2014-11-15T22:27:32+0000

Yes, you can use regular expressions in the lxml xpath .

Here's one example:

results = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

Of course, in the example you mentioned, you don't need a regex. You can use the starts-with()

xpath function :

results = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

Complete program:

from lxml import etree

root = etree.XML('''
    <root>
      <TEXT1>one</TEXT1>
      <TEXT2>two</TEXT2>
      <TEXT3>three</TEXT3>
      <x-TEXT4>but never four</x-TEXT4>
    </root>''')

result1 = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

result2 = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

assert(result1 == result2)

for result in result1:
    print result.text, result.tag

Consider this XML when addressing the new requirement:

<root>
   <tag>
      <TEXT1>one</TEXT1>
      <TEXT2>two</TEXT2>
      <TEXT3>three</TEXT3>
   </tag>
   <other_tag>
      <TEXT1>do not want to found one</TEXT1>
      <TEXT2>do not want to found two</TEXT2>
      <TEXT3>do not want to found three</TEXT3>
   </other_tag>
</root>

If you want to find all TEXT

elements that are direct children of an element <tag>

:

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

Or if you want all elements TEXT

that are direct children of only the first element tag

:

result = root.xpath("//tag[1]/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

Or, if you only want to find the first element of TEXT

each element tag

:

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')][1]")
assert(' '.join(e.text for e in result) == 'one')

Resorources:

http://www.w3schools.com/xpath/
http://lxml.de/xpathxslt.html

Lxml search tags by regex

More articles: