HTML between h3 / h2 tags with Xpath / BeautifulSoup

I am using Scrapy

for a project and I am getting the following html:

<h3><span class="my_class">First title</span></h3>
<ul>
    <li>Text for the first title... li #1</li>
</ul>
<ul>
    <li>Text for the first title... li #2</li>
</ul>
<h3><span class="my_class">Second title</span></h3>
<ul>
    <li>Text for the second title... li #1</li>
</ul>
<ul>
    <li>Text for the second title... li #2</li>
</ul>

      

Now that I use it response.xpath(".//ul/li/text()").extract()

, it works, it gives me ["Text for the first title... li #1", "Text for the first title... li #2", "Text for the second title... li #1", "Text for the second title... li #2"]

But this is partially what I want.

I want two lists, one for First title

and one for Second title

. So the result will be:

first_title = ["Text for the first title... li #1", "Text for the first title... li #2"]
second_title = ["Text for the second title... li #1", "Text for the second title... li #2"]

      

I still don't know how to achieve this. I am currently using Scrapy

to get HTML; A solution using xpath

with clean Python

would be perfect for me. But somehow I think it BeautifulSoup

will be useful for this kind of task.

Do you have any idea how to do this in Python?

+3


source to share


3 answers


You can use XPath and CSS selectors in Scrapy.

Here's an example solution (in an ipython session, I only changed # 1 and # 2 in the second block to # 3 and # 4 to make it more obvious):

In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""<h3><span class="my_class">First title</span></h3>
   ...: <ul>
   ...:     <li>Text for the first title... li #1</li>
   ...:     <li>Text for the first title... li #2</li>
   ...: </ul>
   ...: <h3><span class="my_class">Second title</span></h3>
   ...: <ul>
   ...:     <li>Text for the second title... li #3</li>
   ...:     <li>Text for the second title... li #4</li>
   ...: </ul>""")

In [3]: for title_list in selector.css('h3 + ul'):
   ...:         print title_list.xpath('./li/text()').extract()
   ...:     
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [4]: for title_list in selector.css('h3 + ul'):
        print title_list.css('li::text').extract()
   ...:     
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [5]: 

      


Edit, following the OP's question in a comment:

Each tag is <li>

wrapped in its own <ul>

(...). Is there a way to expand it so that it looks for all tags ul

under the tag h3

?

If h3

and ul

are all brothers and sisters, one way to choose ul

which is before the next h3

is to count the previous h3

brothers and sisters

Consider this HTML snippet input:

<h3><span class="my_class">First title</span></h3>
<ul><li>Text for the first title... li #1</li></ul>
<ul><li>Text for the first title... li #2</li></ul>

<h3><span class="my_class">Second title</span></h3>
<ul><li>Text for the second title... li #3</li></ul>
<ul><li>Text for the second title... li #4</li></ul>

      



The first line <ul><li>

has 1 fixing h3

sibling, the third line <ul><li>

has 2 prior h3

marriages.

So, for each one h3

you need to follow the ul

siblings, which have exactly the number h3

that you have seen so far.

Firstly:

following-sibling::ul[count(preceding-sibling::h3)=1]

then

following-sibling::ul[count(preceding-sibling::h3)=2]

etc.

Here's the idea in action by selecting enumerate()

on h3

(remember that XPath Positions start at 1 , not 0):

In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""
<h3><span class="my_class">First title</span></h3>
<ul><li>Text for the first title... li #1</li></ul>
<ul><li>Text for the first title... li #2</li></ul>

<h3><span class="my_class">Second title</span></h3>
<ul><li>Text for the second title... li #3</li></ul>
<ul><li>Text for the second title... li #4</li></ul>
""")

In [3]: for cnt, title in enumerate(selector.css('h3'), start=1):
   ...:     print title.xpath('following-sibling::ul[count(preceding-sibling::h3)=%d]/li/text()' % cnt).extract()
   ...: 
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']

      

+1


source


The way to do it with Beautiful Soup is as follows. (I've stored the results in dict lists, not separately named lists, unless you know ahead of time how many you will have.)



from bs4 import BeautifulSoup

soup = BeautifulSoup(url)
groups = soup.find_all('ul')
results = {}
for group in groups:
   results[group.find_previous_sibling().text] = [e.text for e in a.find_all('li')]

      

+1


source


If you want to use BeautifulSoup you can use the method findNext

:

h3s = soup.find_all("h3")
for h3 in h3s:
    print h3.text
    print h3.findNext("ul").text

      

In this case, BS is a little easier to use as it can find siblings more easily.

With simple XPath, you can do something like this:

h3s = data.xpath('//h3')
for h3 in h3s:
    print h3.xpath('.//text()')
    h3.xpath('./following-sibling::ul')[0].xpath('.//text()')

      

This is fixed for your example above. If you want some general approach I would say BS is the right tool because of the methods available.

+1


source







All Articles