Web scraping with python 3.6 and beautifulsoup - getting invalid url

I want to work with this page in Python: http://www.sothebys.com/en/search-results.html?keyword=degas%27

This is my code:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.sothebys.com/en/search-results.html?keyword=degas%27')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

      

I am getting the following output:

<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "[no URL]", is invalid.<p>
Reference #9.8f4f1502.1494363829.5fae0e0e
</p></body></html>

      

I can open a page with my browser from the same computer and not receive an error message. When I use the same code with a different url, you select the correct HTML content:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchtype=p&action=search&searchFrom=header&lid=1&entry=degas')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

      

I have also tested other urls (reddit, google, e-commerce sites) and didn't run into any issue. So the same code works with one url and the other doesn't. Where is the problem?

+3


source to share


2 answers


This website is blocking requests that are not coming from any browser, which is why you are getting an error Invalid URL

. Adding custom headers to the request works great.



import requests
from bs4 import BeautifulSoup

ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.sothebys.com/en/search-results.html?keyword=degas%27"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
print(soup)

      

+1


source


change your code like

soup = BeautifulSoup(page.text, "lxml")

      



If you are using page.content

then converting byte array to string will help you, but you have to go withpage.text

+3


source







All Articles