Nice parsing data on a specific tag

Question

Nice parsing data on a specific tag

I am currently parsing a webpage with this code:

boards = soup(itemprop="name")
prices = soup("span", { "class" : "price-currency" })

for board, price in zip(boards, prices):
    print(board.text.strip(), price.next_sibling)

And it prints the board and the price like this:

SURFBOARD RACK free delivery to your door 120.00
Huge Beginner Surfboard Sale! Kids & Adult Softboards all 1/2 Price!! 90.00
Mega Softboard Clearance Sale! Beginner Foam SurfBoards 1/2 Price! 90.00
Surfboard 6'2" Simon Anderson Spudnick 360.00
Surfboard Cover, Surfboard Bags, Cheap Single Surf Board Bags 50.00

The web page I'm processing is divided into 3 sections: sponsored links, top ads, and recent ads. I am printing data from all three of these sections, but I only want to get data from the last ad section that has this html in it:

<div class="module__body ad-listing">

How can I specify that I want boards and prices to be printed below this section?

Page: https://www.gumtree.com.au/s-surfing/mona-vale-sydney/surfboard/k0c18568l3003999r10?fromSearchBox=true

+3

html python-3.x parsing parent-child beautifulsoup

Frank harb May 09 '17 at 7:27

source to share

1 answer

Bill bell · Answer 1 · 2017-05-10T14:01:53+0000

You may hate this answer. My inclination is to use the lxml module when I see complex HTML, because I can use xpath expressions.

In this case, the first one xpath

finds the collection of elements li

in HTML that you want. The loop uses two expressions xpath

that detect things like "Quicksale 6'4 Dylan Surfboard RX5" in an item li

and one that finds a collection of texts for pricing information within the same item. Clause 12 appears to be coded differently; I haven't researched this.

>>> import requests
>>> from lxml import etree
>>> page = requests.get('https://www.gumtree.com.au/s-surfing/mona-vale-sydney/surfboard/k0c18568l3003999r10?fromSearchBox=true').text
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(page, parser=parser)
>>> recents = tree.xpath('.//div[@class="module__body ad-listing"]/ul/li')
>>> for i, recent in enumerate(recents):
...     try:
...         i, recent.xpath('.//span[@itemprop="name"]/text()')[0].strip()
...     except:
...         '-------------> item', i, 'failed'
...         continue
...     one_span = first_recent.xpath('.//span[@class="j-original-price"]')[0]
...     ' '.join([_.strip() for _ in list(one_span.itertext()) if _.strip()])
... 
(0, "Quicksale 6'4 Dylan Surfboard RX5")
'$ 450.00 Negotiable'
(1, 'DHD 5\'9 "Switchblade" Surfboard')
'$ 450.00 Negotiable'
(2, '6ft Modern Surfboards - Highline')
'$ 450.00 Negotiable'
(3, "5'11 Channel Island T-Low surfboard")
'$ 450.00 Negotiable'
(4, 'Chill Rare Bird Surfboard 5"8')
'$ 450.00 Negotiable'
(5, 'Vintage surfboard')
'$ 450.00 Negotiable'
(6, "5'7 Annesley Blonde model")
'$ 450.00 Negotiable'
(7, 'McCoy single fin surfboard')
'$ 450.00 Negotiable'
(8, 'Sculpt surfboard')
'$ 450.00 Negotiable'
(9, '8\'1" longboard surfboard travel cover')
'$ 450.00 Negotiable'
(10, 'Longboard Surfboard')
'$ 450.00 Negotiable'
(11, "5'10 Custom Chaos Surfboard")
'$ 450.00 Negotiable'
('-------------> item', 12, 'failed')
(13, "6'0 JS lowdown")
'$ 450.00 Negotiable'
(14, 'Mega Softboard Clearance Sale! Beginner Foam SurfBoards 1/2 Price!')
'$ 450.00 Negotiable'
(15, 'Surfboard')
'$ 450.00 Negotiable'
(16, 'Surfboard 5\'10" 30 lt')
'$ 450.00 Negotiable'
(17, 'Christenson Super Sport Surfboard')
'$ 450.00 Negotiable'
(18, 'TOMO Firewire V4 Surfboard')
'$ 450.00 Negotiable'
(19, "Surfboard 6'6 baked bean")
'$ 450.00 Negotiable'
(20, 'foam surfboards')
'$ 450.00 Negotiable'
(21, 'Channel Islands surfboard')
'$ 450.00 Negotiable'
(22, 'Channel Islands Surfboard')
'$ 450.00 Negotiable'
(23, 'JS surfboard')
'$ 450.00 Negotiable'
(24, 'CLASSIC RETRO SURF FACTORY MINI MAL')
'$ 450.00 Negotiable'
(25, 'Surfboard JS')
'$ 450.00 Negotiable'

Nice parsing data on a specific tag

More articles: