Web scraping with python 3.6 and beautifulsoup - getting invalid url
I want to work with this page in Python: http://www.sothebys.com/en/search-results.html?keyword=degas%27
This is my code:
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.sothebys.com/en/search-results.html?keyword=degas%27')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
I am getting the following output:
<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "[no URL]", is invalid.<p>
Reference #9.8f4f1502.1494363829.5fae0e0e
</p></body></html>
I can open a page with my browser from the same computer and not receive an error message. When I use the same code with a different url, you select the correct HTML content:
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchtype=p&action=search&searchFrom=header&lid=1&entry=degas')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
I have also tested other urls (reddit, google, e-commerce sites) and didn't run into any issue. So the same code works with one url and the other doesn't. Where is the problem?
+3
source to share
2 answers
This website is blocking requests that are not coming from any browser, which is why you are getting an error Invalid URL
. Adding custom headers to the request works great.
import requests
from bs4 import BeautifulSoup
ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.sothebys.com/en/search-results.html?keyword=degas%27"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
print(soup)
+1
source to share