Python CSS Selector BeautifulSoup

I built a very basic scraper by looking at Airbnb listings. The goal is to navigate through this site (i.e. this one ).

first_page = BeautifulSoup(requests.get("https://www.airbnb.com/s/Copenhagen--Denmark/homes?allow_override%5B%5D=&s_tag=kHqeQTpz&section_offset=1").text, 'html.parser')
listings = first_page.find_all('div', 'listing-card-wrapper')
for listing in listings:
    print(listing.select("#listing-15616363 > div.infoContainer_v72lrv > a > div.ellipsized_1iurgbx > div > span:nth-child(1) > span:nth-child(1)"))

      

The code passes through 18 elements on the page correctly. However, it prints 18 empty arrays indicating that listing.select is not working. I got the CSS tag from the select function of the copy of Chrome Dev tools.

+3


source to share


2 answers


This has to do with what listing-15616363

each listing applies to (note the format listing-{listing_id}

) and therefore there is no class that has id = 'listing-15616363'

among your looped lists.

For example, if you want to get the url, you can do something like this:

listing.find('a', class_ = "linkContainer_55zci1")['href']

      

Alternatively, you can use python lxml , which is an order of magnitude faster than BeautifulSoup (if used correctly), something like this:



import requests
from lxml import html

url = "https://www.airbnb.com/s/Copenhagen--Denmark/homes?allow_override%5B%5D=&s_tag=kHqeQTpz&section_offset=1"

response = requests.get(url)
root = html.fromstring(response.content)
result_list = []

def remove_non_ascii(text) :
    return ''.join([i if ord(i) < 128 else '' for i in text])

currency = root.xpath('//div[@itemprop="offers"]/meta[@itemprop="priceCurrency"]/@content')[0].strip()

for row in root.xpath('//div[contains(@class, "listing-card-wrapper")]') : 
    if row :
        url = row.xpath('.//a[@class="linkContainer_55zci1"]/@href')[0].strip()
        title = row.xpath('.//div[@class="ellipsized_1iurgbx"]/span/text()')[0].strip()
        price = remove_non_ascii(row.xpath('.//div[@class="inline_g86r3e"]/span//text()')[0].strip())

        result_list.append({'url' : "https://www.airbnb.com" + url, 
            'title' : title, 'price' : price, 'currency' : currency})

print result_list

      

This will lead to:

[{'url': 'https://www.airbnb.com/rooms/5316912', 'currency': 'INR', 'price': u' 3,823', 'title': 'Small City  apt. next to the Metro'}, {'url': 'https://www.airbnb.com/rooms/16989400', 'currency': 'INR', 'price': u' 2,347', 'title': 'Cozy room close to city center'}, {'url': 'https://www.airbnb.com/rooms/17628374', 'currency': 'INR', 'price': u' 6,774', 'title': 'Cosy, quiet apartment in downtown Copenhagen'}, {'url': 'https://www.airbnb.com/rooms/1206721', 'currency': 'INR', 'price': u' 4,426', 'title': 'Apt.close to Metro, Airport and CHP'}, {'url': 'https://www.airbnb.com/rooms/13813273', 'currency': 'INR', 'price': u' 3,622', 'title': 'Large room in Vesterbro'}, {'url': 'https://www.airbnb.com/rooms/14083881', 'currency': 'INR', 'price': u' 9,322', 'title': 'City Room'}, {'url': 'https://www.airbnb.com/rooms/6221130', 'currency': 'INR', 'price': u' 5,365', 'title': 'cosy flat 2 min from Central Statio'}, {'url': 'https://www.airbnb.com/rooms/15804159', 'currency': 'INR', 'price': u' 3,823', 'title': 'Cozy, central near waterfront. Quality breakfast!'}, {'url': 'https://www.airbnb.com/rooms/17266268', 'currency': 'INR', 'price': u' 3,756', 'title': 'Cosy room in Frederiksberg'}, {'url': 'https://www.airbnb.com/rooms/2647233', 'currency': 'INR', 'price': u' 3,353', 'title': 'Bedroom & Living Room Frederiksberg'}, {'url': 'https://www.airbnb.com/rooms/12083235', 'currency': 'INR', 'price': u' 5,969', 'title': 'Wonderful Copenhagen is right here'}, {'url': 'https://www.airbnb.com/rooms/7787976', 'currency': 'INR', 'price': u' 7,042', 'title': 'Homely renovated flat with garden'}, {'url': 'https://www.airbnb.com/rooms/17556785', 'currency': 'INR', 'price': u' 1,610', 'title': u'Small Cosy home above our Caf\xe9 ( Breakfast incl )'}, {'url': 'https://www.airbnb.com/rooms/894420', 'currency': 'INR', 'price': u' 10,261', 'title': 'Wonderful apt. right in the city!'}, {'url': 'https://www.airbnb.com/rooms/17028460', 'currency': 'INR', 'price': u' 7,847', 'title': 'Nyhavn 3-bed apartment for families'}, {'url': 'https://www.airbnb.com/rooms/17651114', 'currency': 'INR', 'price': u' 6,371', 'title': 'Spacious place by canals in heart of Copenhagen'}, {'url': 'https://www.airbnb.com/rooms/10564051', 'currency': 'INR', 'price': u' 3,420', 'title': u'\u623f\u95f4\u5728\u54e5\u672c\u54c8\u6839\u7684\u5fc3\u810f'}, {'url': 'https://www.airbnb.com/rooms/17709435', 'currency': 'INR', 'price': u' 2,951', 'title': u'Hyggelig lejlighed t\xe6t p\xe5 centrum.'}]

      

You can also refer to the documentation for scraper and lxml for further understanding.

+2


source


When web cleanup tries to use xpath or specific attributes of an element instead of css selectors because they are often too specific for each element.

Instead of using css selectors, I was able to achieve what you want using the attribute itemprop

in the following code:

Code:



from bs4 import BeautifulSoup
import requests

html_source = requests.get("https://www.airbnb.com/s/Copenhagen--Denmark/homes?allow_override%5B%5D=&s_tag=kHqeQTpz&section_offset=1").text
first_page = BeautifulSoup(html_source, 'html.parser')

listings = first_page.find_all('div', {'itemprop':'itemListElement'})

for l in listings:
    a = l.find_next('meta')
    b = a.find_next('meta')
    c = b.find_next('meta')

    print("Name: ", a['content'])
    print("Position: ", b['content'])
    print("URL: ", c['content'])

    print("-"*15)    

      

Output:

Name:  Small City  apt. next to the Metro - Apartment - København
Position:  1
URL:  www.airbnb.com/rooms/5316912
---------------
Name:  Cozy room close to city center - Apartment - Frederiksberg
Position:  2
URL:  www.airbnb.com/rooms/16989400
---------------
Name:  Cosy, quiet apartment in downtown Copenhagen - Apartment - København
Position:  3
URL:  www.airbnb.com/rooms/17628374
---------------
Name:  Apt.close to Metro, Airport and CHP - Apartment - Copenhagen
Position:  4
URL:  www.airbnb.com/rooms/1206721
---------------
Name:  Large room in Vesterbro - Apartment - København
Position:  5
URL:  www.airbnb.com/rooms/13813273
---------------
Name:  City Room - Apartment - København
Position:  6
URL:  www.airbnb.com/rooms/14083881
---------------
Name:  cosy flat 2 min from Central Statio - Apartment - København V
Position:  7
URL:  www.airbnb.com/rooms/6221130
---------------
Name:  Cozy, central near waterfront. Quality breakfast! - Apartment - København
Position:  8
URL:  www.airbnb.com/rooms/15804159
---------------
Name:  Cosy room in Frederiksberg - Apartment - Frederiksberg
Position:  9
URL:  www.airbnb.com/rooms/17266268
---------------
Name:  Bedroom & Living Room Frederiksberg - Apartment - Frederiksberg
Position:  10
URL:  www.airbnb.com/rooms/2647233
---------------
Name:  Wonderful Copenhagen is right here - Apartment - København
Position:  11
URL:  www.airbnb.com/rooms/12083235
---------------
Name:  Homely renovated flat with garden - Apartment - Frederiksberg
Position:  12
URL:  www.airbnb.com/rooms/7787976
---------------
Name:  Small Cosy home above our Café ( Breakfast incl ) - Bed & Breakfast - København
Position:  13
URL:  www.airbnb.com/rooms/17556785
---------------
Name:  Wonderful apt. right in the city! - Apartment - Copenhagen
Position:  14
URL:  www.airbnb.com/rooms/894420
---------------
Name:  Nyhavn 3-bed apartment for families - Apartment - Copenhagen
Position:  15
URL:  www.airbnb.com/rooms/17028460
---------------
Name:  Spacious place by canals in heart of Copenhagen - Apartment - København
Position:  16
URL:  www.airbnb.com/rooms/17651114
---------------
Name:  房间在哥本哈根的心脏 - Apartment - København
Position:  17
URL:  www.airbnb.com/rooms/10564051
---------------
Name:  Hyggelig lejlighed tæt på centrum. - Apartment - København
Position:  18
URL:  www.airbnb.com/rooms/17709435
---------------

      

+1


source







All Articles