How to load all site resources including AJAX requests etc. In Python?

I know how to query a website and read its text using Python. I used to try to use a library like BeautifulSoup to make all link requests on a site, but that doesn't result in not everything that looks like full URLs like AJAX requests and most requests to (since http://example.com "will be missing, and more importantly not in format <a href='url'>Link</a>

, so BeautifulSoup will skip that).

How can I load all site resources in Python? Will this require interoperability with something like Selenium, or is there a way that isn't that hard to implement without it? I haven't used Selenium that much, so I'm not sure how difficult it will be.

thank

+3


source to share


3 answers


It all depends on what you want and how you want. The closest thing that can work for you is

from ghost import Ghost
ghost = Ghost()
page, extra_resources = ghost.open("http://jeanphi.fr")
assert page.http_status==200 and 'jeanphix' in ghost.content

      



You can read more: http://jeanphix.me/Ghost.py/

+2


source


I would love to hear other ways to do this, especially if they are more succinct (easier to remember), but I think it achieves my goal. This doesn't fully answer my original question, although - it's just more than usage requests.get(url)

- it was enough for me in this case`:



import urllib2
url = 'http://example.com'
headers = {'User-Agent' : 'Mozilla/5.0'}
request = urllib2.Request(url,None,headers)
sock = urllib2.urlopen(request)
ch = sock.read()
sock.close()

      

0


source


Mmm, that's a pretty interesting question. For those resources whose URLs cannot be fully identified due to the fact that they are generated at runtime or something like that (like those used in scripting, not just AJAX), you really need run the website, so scripts are executed and dynamic urls are generated.

One option is using something like the given answer , which uses a third party library like Qt to actually run the website. To collect all urls, you need to somehow track all requests made by the website that can be done like this (although this is C ++, but the code is essentially the same) .

Finally, once you get the url, you can use something like Requests to load external resources.

0


source







All Articles