How to crawl a website where the navigation page includes dynamic loading

I want to crawl a site that has multiple pages and when the page number is clicked it is dynamically loaded. How do I clear it?

Since the url is missing as href or how to crawl other pages?

It would be great if someone could help me with this.

PS: The url stays the same when clicking on another page.

+3


source to share


6 answers


You should also consider Ghost.py as it allows you to run arbitrary javascript commands, fill out forms, and snapshoot very quickly.



+2


source


If you are using Google Chrome you can check the url which is dynamically called in network->headers

developer tools

so based on that, you can determine if it is a request GET

or POST

.



If it's a request GET

, you can find the parameters directly from the url.

If this request POST

, you can find the settings from form data

in network->headers

developer tools.

+1


source


You can search for the data you want in the javascript code instead of HTML. This is usually a pain, but you can do funny things with regular expressions.

Alternatively, some of the browser testing libraries like splinter work by loading the page in an actual browser like firefox or chrome before clearing.One of them will work if you run this on a machine with the browser installed.

0


source


Since this post was tagged with python and web-crawler, Beautiful Soup should be mentioned: http://www.crummy.com/software/BeautifulSoup/

Documentation here: http://www.crummy.com/software/BeautifulSoup/bs3/download/2.x/documentation.html

0


source


You cannot do this easily as it is ajax pagination (even with mechanize ). Instead, open the original page file and try to find out what is the url request used for the ajax paging. Then you can create a fake request and handle the returned data differently.

0


source


If you don't mind using gevent. GRobot is another good choice.

0


source







All Articles