How to crawl a website where the navigation page includes dynamic loading
I want to crawl a site that has multiple pages and when the page number is clicked it is dynamically loaded. How do I clear it?
Since the url is missing as href or how to crawl other pages?
It would be great if someone could help me with this.
PS: The url stays the same when clicking on another page.
source to share
If you are using Google Chrome you can check the url which is dynamically called in
network->headers
developer tools
so based on that, you can determine if it is a request GET
or POST
.
If it's a request GET
, you can find the parameters directly from the url.
If this request POST
, you can find the settings from form data
in network->headers
developer tools.
source to share
You can search for the data you want in the javascript code instead of HTML. This is usually a pain, but you can do funny things with regular expressions.
Alternatively, some of the browser testing libraries like splinter work by loading the page in an actual browser like firefox or chrome before clearing.One of them will work if you run this on a machine with the browser installed.
source to share
Since this post was tagged with python and web-crawler, Beautiful Soup should be mentioned: http://www.crummy.com/software/BeautifulSoup/
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs3/download/2.x/documentation.html
source to share