How can I copy data from websites Don't return plain HTML
I use queries and BeautifulSoup for python to clean up html from basic sites, but most modern websites don't just deliver html as a result. I believe they run javascript or something (I'm not very familiar, kinda noob here). I was wondering if anyone knows how to, say, search for a flight on google flights and scribble the top result, aka cheapest price?
If it was plain html I could just parse the html tree and find the text, but that doesn't show up when you browse the "page source". If you inspect the element in your browser, you can see the price inside the hmtl tags, as if you were looking at the normal page source on the underlying site.
What happens here when the validation element has html but the page source doesn't work? And does anyone know how to clear this kind of data?
Many thanks!
source to share
You are in place - the page markup is being added using javascript after the initial server response. I haven't used BeautifulSoup, but from its documentation it looks like it doesn't execute javascript, so you're out of luck on that front.
You can try Selenium , which is basically a virtual browser - people use it for foreground testing. It does javascript, so it can give you what you want.
But if you are specifically looking for information on Google Flights, there is an API for that: https://developers.google.com/qpx-express/v1/
source to share
You can use Scrapy
which will allow you to clean up the page along with many other spider features. Scrapy has great integration Splash
, which is a library that you can use to execute javascript on a page. Splash can be used offline, or you can get Scrapy-Splash
.
Note that Splash essentially starts its own server to execute the javascript, so it will work with your main script and be called. Scrapy manages this through "middleware" or predefined processes that run on every request: in your case, you get the page, run Javascript in Splash, and parse the results.
This can be a bit lighter weight than being included in Selenium or the like, especially if all you are trying to do is render the page, not render it, and then automatically interact with different parts.
source to share