Ruby + save webpage

To save the HTML version of a web page using Ruby is very easy.

One way to do this is rio:

require 'rubygems'
require 'rio'
rio('http://www.google.com') > rio('google.html')

      

Is it possible to do the same by parsing the html, requesting again different images, javascript, css and then saving each one?

I think this is not very efficient.

So, is there a way to save a web page + all images, css and javascript associated with that page and all this automatically?

+2


source to share


3 answers


how about the system ("wget ​​-r -l 1 http://google.com ")



+2


source


In most cases, we can use system tools. As dimus said, you can use wget to load the page.



And there are many useful apis to solve Net problem. For example, net / ftp, net / http, or net / https. A detailed document can be viewed. Net / HTTP . But these methods only get a response, we need to parse the HTML document more. Better yet, using mozilla lib is a good way.

0


source


url = "docs.zillabyte.com"
output_dir = "/tmp/crawl"

# -E = adjust malformed extensions (e.g. /some_image/ -> /some_image.gif)
# -H = span hosts (e.g. include assets from other domains) 
# -p = download all assets associated with the page
# -P = output prefix (a.k.a the directory to dump the assets)
system("wget -E -H -p '#{url}' -P '#{output_dir}'")

# read files from 'output_dir'

      

0


source







All Articles