Parsing xml containing default namespace to get element value using lxml

I have an xml string like this

str1 = """<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
    <loc>
        http://www.example.org/sitemap_1.xml.gz
    </loc>
    <lastmod>2015-07-01</lastmod>
</sitemap>
</sitemapindex> """

      

I want to extract all urls present inside <loc>

node ihttp://www.example.org/sitemap_1.xml.gz

I tried this code but it didn't say

from lxml import etree
root = etree.fromstring(str1)
urls = root.xpath("//loc/text()")
print urls
[]

      

I tried to check if the root node is being formed correctly. I tried this and returned the same string as str1

etree.tostring(root)

'<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n<sitemap>\n<loc>http://www.example.org/sitemap_1.xml.gz</loc>\n<lastmod>2015-07-01</lastmod>\n</sitemap>\n</sitemapindex>'

      

+3


source to share


1 answer


This is a common mistake when working with XML that has a default namespace. Your XML has a default namespace, a namespace declared without a prefix, here:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

      

Note that not only is the element in which the default namespace is declared is in that namespace, but all descendant elements implicitly inherit the default namespace unless otherwise specified (using an explicit namespace prefix or local namespace by default pointing to a different uri namespace). This means that in this case all elements, including loc

, are in the default namespace.

To select an element in a namespace, you need to define a prefix to match the namespace and use the prefix correctly in XPath:



from lxml import etree
str1 = '''<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
    <loc>
        http://www.example.org/sitemap_1.xml.gz
    </loc>
    <lastmod>2015-07-01</lastmod>
</sitemap>
</sitemapindex>'''
root = etree.fromstring(str1)

ns = {"d" : "http://www.sitemaps.org/schemas/sitemap/0.9"}
url = root.xpath("//d:loc", namespaces=ns)[0]
print etree.tostring(url)

      

output:

<loc xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
        http://www.example.org/sitemap_1.xml.gz
    </loc>

      

+7


source







All Articles