How to load all site resources including AJAX requests etc. In Python?

Question

How to load all site resources including AJAX requests etc. In Python?

I know how to query a website and read its text using Python. I used to try to use a library like BeautifulSoup to make all link requests on a site, but that doesn't result in not everything that looks like full URLs like AJAX requests and most requests to (since http://example.com "will be missing, and more importantly not in format <a href='url'>Link</a>

, so BeautifulSoup will skip that).

How can I load all site resources in Python? Will this require interoperability with something like Selenium, or is there a way that isn't that hard to implement without it? I haven't used Selenium that much, so I'm not sure how difficult it will be.

thank

+3

python selenium python-requests urllib2 beautifulsoup

rofls 10 Aug 14 at 22:26

source to share

3 answers

iChux · Answer 1 · 2014-08-11T08:34:10+0000

It all depends on what you want and how you want. The closest thing that can work for you is

from ghost import Ghost
ghost = Ghost()
page, extra_resources = ghost.open("http://jeanphi.fr")
assert page.http_status==200 and 'jeanphix' in ghost.content

You can read more: http://jeanphix.me/Ghost.py/

rofls · Answer 2 · 2014-08-10T23:10:52+0000

I would love to hear other ways to do this, especially if they are more succinct (easier to remember), but I think it achieves my goal. This doesn't fully answer my original question, although - it's just more than usage requests.get(url)

- it was enough for me in this case`:

import urllib2
url = 'http://example.com'
headers = {'User-Agent' : 'Mozilla/5.0'}
request = urllib2.Request(url,None,headers)
sock = urllib2.urlopen(request)
ch = sock.read()
sock.close()

José Tomás Tocino · Answer 3 · 2014-08-10T23:24:30+0000

Mmm, that's a pretty interesting question. For those resources whose URLs cannot be fully identified due to the fact that they are generated at runtime or something like that (like those used in scripting, not just AJAX), you really need run the website, so scripts are executed and dynamic urls are generated.

One option is using something like the given answer , which uses a third party library like Qt to actually run the website. To collect all urls, you need to somehow track all requests made by the website that can be done like this (although this is C ++, but the code is essentially the same) .

Finally, once you get the url, you can use something like Requests to load external resources.

How to load all site resources including AJAX requests etc. In Python?

More articles: