Python: urlretrieve PDF download

Question

Python: urlretrieve PDF download

I'm using the urllib urlretrieve () function in Python to try and grab some pdf from websites. It (for me at least) stopped working and downloads corrupted data (15KB instead of 164KB).

I've tested this with multiple pdfs , all without success (i.e. random.pdf ). I can't seem to get it to work and I need to upload a pdf file for a project I'm working on.

Here's an example of the type of code I'm using to load the pdf (and parse the text using pdftotext.exe ):

def get_html(url): # gets html of page from Internet
    import os
    import urllib2
    import urllib
    from subprocess import call
    f_name = url.split('/')[-2] # get file name (url must end with '/')
    try:
        if f_name.split('.')[-1] == 'pdf': # file type
            urllib.urlretrieve(url, os.getcwd() + '\\' + f_name)
            call([os.getcwd() + '\\pdftotext.exe', os.getcwd() + '\\' + f_name]) # use xpdf to output .txt file
            return open(os.getcwd() + '\\' + f_name.split('.')[0] + '.txt').read()
        else:
            return urllib2.urlopen(url).read()
    except:
        print 'bad link: ' + url    
        return ""

I am a beginner programmer so any input would be great! Thanks to

+3

python pdf urllib2 urllib

hisroar 03 Feb 13 at 5:40 am

source to share

3 answers

to write a file to disk:

myfile = open("out.pdf", "w")
myfile.write(req.content)

+2

user1767754 June 27. 15 at 19:08

source to share

Maybe it's a little late, but you can try this: Just writing the content to a new file and reading it with textract, doing it without it, I got unnecessary text containing "# $".

import requests
import textract
url = "The url which downloads the file"
response = requests.get(url)
with open('./document.pdf', 'wb') as fw:
    fw.write(response.content)
text = textract.process("./document.pdf")
print('Result: ', text)

0

Arjunsingh May 30 '17 at 10:21

source to share

sberry · Accepted Answer · 2013-02-03T05:54:32+0000

I suggest trying requests . This is a really nice library that hides the whole implementation behind a simple api.

>>> import requests
>>> req = requests.get("http://www.mathworks.com/moler/random.pdf")
>>> len(req.content)
167633
>>> req.headers
{'content-length': '167633', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.3 (Red Hat) mod_jk/1.2.31 PHP/5.3.13 Phusion_Passenger/3.0.9 mod_perl/2.0.4 Perl/v5.8.8', 'last-modified': 'Fri, 15 Feb 2008 17:11:12 GMT', 'connection': 'keep-alive', 'etag': '"30863b-28ed1-446357e3d4c00"', 'date': 'Sun, 03 Feb 2013 05:53:21 GMT', 'content-type': 'application/pdf'}

By the way, the reason you only get 15kb download is because your url is not correct. It should be

http://www.mathworks.com/moler/random.pdf

But you are GETing

http://www.mathworks.com/moler/random.pdf/

>>> import requests
>>> c = requests.get("http://www.mathworks.com/moler/random.pdf/")
>>> len(c.content)
14390

Python: urlretrieve PDF download

More articles: