How to download pdf file in ruby ​​without .pdf in link

I need to download a pdf from a site that does not provide a link ending in (.pdf) using ruby. Manually, when I click on the link to download the PDF, it takes me to a new page and after a while a dialog box will open to save / open the file.

Please help me download the file.

Link

0


source to share


2 answers


If you just want a simple ruby ​​script, I just run wget

. Like thisexec 'wget "http://path.to.the.file/and/some/params"'

At the same time, you can run wget.

Another way is to just run get on the page you know is in

source = Net::HTTP.get("http://the.website.com", "/and/some/params")

There are several other http clients you could use, but as long as you are making a request to the get

endpoint where the pdf file resides, it should provide you with the raw data. Then you can just rename the file and you will have a pdf

In your case, I ran the following commands to get the pdf



wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf

      

Then open the pdf file. Note that these are linux commands. If you want to get a file with a ruby ​​script, you can use something like the one I mentioned earlier.

Update:

There is an additional complication that was not originally outlined, namely that the PDF url changes every time there is an update to the pdf. To make this work, you probably want to do something that involves web scrubbing. I suggest nokogiri . This way, you can look at the page where the download is located and then execute a request to get the url you want. Also, the server hosting the pdf file is misconfigured and breaks chrome within seconds of opening the page.

How to solve this problem: I went to the site and updated it. Then broke the connection to the server (press the X where the refresh button would otherwise be). Then right-click next to the download link and select inspect element

. Then scan the dom to find what ultimately identifies (like id). Luckily I found something <strong id="telecharger"> Download</strong>

. This means you can use something like page.css('strong#telecharger')[0].parent['href']

This should give you the url. Then you can complete the pull request as described above. I don't have time to make the script for you (too much work), but it should be enough to solve the problem.

0


source


Will you do it

require 'open-uri'
File.open('my_file_name.pdf', "wb") do |file|
  file.write open('http://someurl.com/2013-1-2/somefile/download').read
end

      



I do this for my projects and it works.

+2


source







All Articles