HTML between h3 / h2 tags with Xpath / BeautifulSoup
I am using Scrapy
for a project and I am getting the following html:
<h3><span class="my_class">First title</span></h3>
<ul>
<li>Text for the first title... li #1</li>
</ul>
<ul>
<li>Text for the first title... li #2</li>
</ul>
<h3><span class="my_class">Second title</span></h3>
<ul>
<li>Text for the second title... li #1</li>
</ul>
<ul>
<li>Text for the second title... li #2</li>
</ul>
Now that I use it response.xpath(".//ul/li/text()").extract()
, it works, it gives me ["Text for the first title... li #1", "Text for the first title... li #2", "Text for the second title... li #1", "Text for the second title... li #2"]
But this is partially what I want.
I want two lists, one for First title
and one for Second title
. So the result will be:
first_title = ["Text for the first title... li #1", "Text for the first title... li #2"]
second_title = ["Text for the second title... li #1", "Text for the second title... li #2"]
I still don't know how to achieve this. I am currently using Scrapy
to get HTML; A solution using xpath
with clean Python
would be perfect for me. But somehow I think it BeautifulSoup
will be useful for this kind of task.
Do you have any idea how to do this in Python?
source to share
You can use XPath and CSS selectors in Scrapy.
Here's an example solution (in an ipython session, I only changed # 1 and # 2 in the second block to # 3 and # 4 to make it more obvious):
In [1]: import scrapy
In [2]: selector = scrapy.Selector(text="""<h3><span class="my_class">First title</span></h3>
...: <ul>
...: <li>Text for the first title... li #1</li>
...: <li>Text for the first title... li #2</li>
...: </ul>
...: <h3><span class="my_class">Second title</span></h3>
...: <ul>
...: <li>Text for the second title... li #3</li>
...: <li>Text for the second title... li #4</li>
...: </ul>""")
In [3]: for title_list in selector.css('h3 + ul'):
...: print title_list.xpath('./li/text()').extract()
...:
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']
In [4]: for title_list in selector.css('h3 + ul'):
print title_list.css('li::text').extract()
...:
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']
In [5]:
Edit, following the OP's question in a comment:
Each tag is
<li>
wrapped in its own<ul>
(...). Is there a way to expand it so that it looks for all tagsul
under the tagh3
?
If h3
and ul
are all brothers and sisters, one way to choose ul
which is before the next h3
is to count the previous h3
brothers and sisters
Consider this HTML snippet input:
<h3><span class="my_class">First title</span></h3>
<ul><li>Text for the first title... li #1</li></ul>
<ul><li>Text for the first title... li #2</li></ul>
<h3><span class="my_class">Second title</span></h3>
<ul><li>Text for the second title... li #3</li></ul>
<ul><li>Text for the second title... li #4</li></ul>
The first line <ul><li>
has 1 fixing h3
sibling, the third line <ul><li>
has 2 prior h3
marriages.
So, for each one h3
you need to follow the ul
siblings, which have exactly the number h3
that you have seen so far.
Firstly:
following-sibling::ul[count(preceding-sibling::h3)=1]
then
following-sibling::ul[count(preceding-sibling::h3)=2]
etc.
Here's the idea in action by selecting enumerate()
on h3
(remember that XPath Positions start at 1 , not 0):
In [1]: import scrapy
In [2]: selector = scrapy.Selector(text="""
<h3><span class="my_class">First title</span></h3>
<ul><li>Text for the first title... li #1</li></ul>
<ul><li>Text for the first title... li #2</li></ul>
<h3><span class="my_class">Second title</span></h3>
<ul><li>Text for the second title... li #3</li></ul>
<ul><li>Text for the second title... li #4</li></ul>
""")
In [3]: for cnt, title in enumerate(selector.css('h3'), start=1):
...: print title.xpath('following-sibling::ul[count(preceding-sibling::h3)=%d]/li/text()' % cnt).extract()
...:
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']
source to share
The way to do it with Beautiful Soup is as follows. (I've stored the results in dict lists, not separately named lists, unless you know ahead of time how many you will have.)
from bs4 import BeautifulSoup
soup = BeautifulSoup(url)
groups = soup.find_all('ul')
results = {}
for group in groups:
results[group.find_previous_sibling().text] = [e.text for e in a.find_all('li')]
source to share
If you want to use BeautifulSoup you can use the method findNext
:
h3s = soup.find_all("h3")
for h3 in h3s:
print h3.text
print h3.findNext("ul").text
In this case, BS is a little easier to use as it can find siblings more easily.
With simple XPath, you can do something like this:
h3s = data.xpath('//h3')
for h3 in h3s:
print h3.xpath('.//text()')
h3.xpath('./following-sibling::ul')[0].xpath('.//text()')
This is fixed for your example above. If you want some general approach I would say BS is the right tool because of the methods available.
source to share