Python: urlretrieve PDF download
I'm using the urllib urlretrieve () function in Python to try and grab some pdf from websites. It (for me at least) stopped working and downloads corrupted data (15KB instead of 164KB).
I've tested this with multiple pdfs , all without success (i.e. random.pdf ). I can't seem to get it to work and I need to upload a pdf file for a project I'm working on.
Here's an example of the type of code I'm using to load the pdf (and parse the text using pdftotext.exe ):
def get_html(url): # gets html of page from Internet
import os
import urllib2
import urllib
from subprocess import call
f_name = url.split('/')[-2] # get file name (url must end with '/')
try:
if f_name.split('.')[-1] == 'pdf': # file type
urllib.urlretrieve(url, os.getcwd() + '\\' + f_name)
call([os.getcwd() + '\\pdftotext.exe', os.getcwd() + '\\' + f_name]) # use xpdf to output .txt file
return open(os.getcwd() + '\\' + f_name.split('.')[0] + '.txt').read()
else:
return urllib2.urlopen(url).read()
except:
print 'bad link: ' + url
return ""
I am a beginner programmer so any input would be great! Thanks to
source to share
I suggest trying requests . This is a really nice library that hides the whole implementation behind a simple api.
>>> import requests
>>> req = requests.get("http://www.mathworks.com/moler/random.pdf")
>>> len(req.content)
167633
>>> req.headers
{'content-length': '167633', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.3 (Red Hat) mod_jk/1.2.31 PHP/5.3.13 Phusion_Passenger/3.0.9 mod_perl/2.0.4 Perl/v5.8.8', 'last-modified': 'Fri, 15 Feb 2008 17:11:12 GMT', 'connection': 'keep-alive', 'etag': '"30863b-28ed1-446357e3d4c00"', 'date': 'Sun, 03 Feb 2013 05:53:21 GMT', 'content-type': 'application/pdf'}
By the way, the reason you only get 15kb download is because your url is not correct. It should be
http://www.mathworks.com/moler/random.pdf
But you are GETing
http://www.mathworks.com/moler/random.pdf/
>>> import requests
>>> c = requests.get("http://www.mathworks.com/moler/random.pdf/")
>>> len(c.content)
14390
source to share
Maybe it's a little late, but you can try this: Just writing the content to a new file and reading it with textract, doing it without it, I got unnecessary text containing "# $".
import requests
import textract
url = "The url which downloads the file"
response = requests.get(url)
with open('./document.pdf', 'wb') as fw:
fw.write(response.content)
text = textract.process("./document.pdf")
print('Result: ', text)
source to share