How can I search for mixed tag names in an ATOM XML document?

Question

How can I search for mixed tag names in an ATOM XML document?

I'm working with google APIs and they offer the option to return JSON or ATOM. ATOM looks like XML syntax and I want to use BeautifulSoup to parse it.

I have no problem turning this into a BeautifulSoup object, but I am having a hard time finding the element. Take a paragraph of the ATOM doc as an example:

from bs4 import BeautifulSoup

feed = """
<cse:DataObject type="cse_thumbnail">
        <cse:Attribute name="width" value="160"/>
        <cse:Attribute name="height" value="160"/>
        <cse:Attribute name="src" value="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRAUAShHrU8LK9MLEMEcfg-rtYgLzaxUP-j30lNJJdP1P6FBdVIziH4LTY"/>
</cse:DataObject>
"""

soup = BeautifulSoup(feed)

print soup.find_all("cse:Attribute", {"value":"160"})

... it returns an empty list. What am I doing wrong?

+3

python xml atom-feed web-scraping beautifulsoup

B.Mr.W. 23 Apr At 21:45

source to share

1 answer

Zero Piraeus · Answer 1 · 2015-04-27T16:47:33+0000

Your code, as written, parses the XML as if it were HTML, and since HTML is not case sensitive, BeautifulSoup will convert all tag names to lowercase

Since HTML tags and attributes are not case sensitive , all three HTML parsers convert the tags and attribute names to lowercase. That is, the markup is <TAG></TAG>

converted to <TAG></TAG>

. If you want to preserve mixed or uppercase tags and attributes, you need to parse the document as XML .

Finding the bottom of the tag works really well:

>>> soup.find_all("cse:attribute", {"value":"160"})
[<cse:attribute name="width" value="160"></cse:attribute>, 
 <cse:attribute name="height" value="160"></cse:attribute>]

As the quoted text mentions, an alternative is to use an XML parser which will preserve case. However BeautifulSoup with an lxml parser is missing namespaces from tag names ...

>>> soup = BeautifulSoup(feed, "xml")
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<DataObject type="cse_thumbnail">
<Attribute name="width" value="160"/>
<Attribute name="height" value="160"/>
<Attribute name="src" value="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRAUAShHrU8LK9MLEMEcfg-rtYgLzaxUP-j30lNJJdP1P6FBdVIziH4LTY"/>
</DataObject>
>>> soup.find_all("cse:Attribute", {"value":"160"})
[]
>>> soup.find_all("cse:attribute", {"value":"160"})
[]
>>> soup.find_all("Attribute", {"value":"160"})
[<Attribute name="width" value="160"/>,
 <Attribute name="height" value="160"/>]

... which may not be what you want.

How can I search for mixed tag names in an ATOM XML document?

More articles: