Lxml search tags by regex

I am trying to use lxml to get an array of tags formatted as

<TEXT1>TEXT</TEXT1>

<TEXT2>TEXT</TEXT2>

<TEXT3>TEXT</TEXT3>

      

I tried to use

xml_file.findall("TEXT*")

      

but this is looking for a letter asterisk.

I also try to use ETXPath but doesn't seem to work. Is there any API function to work with this, since TEXT is supposed to be appended with integers is not the nicest solution.

+3


source to share


2 answers


Yes, you can use regular expressions in the lxml xpath .

Here's one example:

results = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

      

Of course, in the example you mentioned, you don't need a regex. You can use the starts-with()

xpath function :

results = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

      

Complete program:

from lxml import etree

root = etree.XML('''
    <root>
      <TEXT1>one</TEXT1>
      <TEXT2>two</TEXT2>
      <TEXT3>three</TEXT3>
      <x-TEXT4>but never four</x-TEXT4>
    </root>''')

result1 = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

result2 = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

assert(result1 == result2)

for result in result1:
    print result.text, result.tag

      


Consider this XML when addressing the new requirement:

<root>
   <tag>
      <TEXT1>one</TEXT1>
      <TEXT2>two</TEXT2>
      <TEXT3>three</TEXT3>
   </tag>
   <other_tag>
      <TEXT1>do not want to found one</TEXT1>
      <TEXT2>do not want to found two</TEXT2>
      <TEXT3>do not want to found three</TEXT3>
   </other_tag>
</root>

      



If you want to find all TEXT

elements that are direct children of an element <tag>

:

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

      

Or if you want all elements TEXT

that are direct children of only the first element tag

:

result = root.xpath("//tag[1]/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

      

Or, if you only want to find the first element of TEXT

each element tag

:

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')][1]")
assert(' '.join(e.text for e in result) == 'one')

      


Resorources:

+7


source


Here's one idea:



import lxml.etree

doc = lxml.etree.parse('test.xml')
elements = [x for x in doc.xpath('//*') if x.tag.startswith('TEXT')]

      

+1


source







All Articles