Python: urlretrieve PDF download

I'm using the urllib urlretrieve () function in Python to try and grab some pdf from websites. It (for me at least) stopped working and downloads corrupted data (15KB instead of 164KB).

I've tested this with multiple pdfs , all without success (i.e. random.pdf ). I can't seem to get it to work and I need to upload a pdf file for a project I'm working on.

Here's an example of the type of code I'm using to load the pdf (and parse the text using pdftotext.exe ):

def get_html(url): # gets html of page from Internet
    import os
    import urllib2
    import urllib
    from subprocess import call
    f_name = url.split('/')[-2] # get file name (url must end with '/')
    try:
        if f_name.split('.')[-1] == 'pdf': # file type
            urllib.urlretrieve(url, os.getcwd() + '\\' + f_name)
            call([os.getcwd() + '\\pdftotext.exe', os.getcwd() + '\\' + f_name]) # use xpdf to output .txt file
            return open(os.getcwd() + '\\' + f_name.split('.')[0] + '.txt').read()
        else:
            return urllib2.urlopen(url).read()
    except:
        print 'bad link: ' + url    
        return ""

      

I am a beginner programmer so any input would be great! Thanks to

+3


source to share


3 answers


I suggest trying requests . This is a really nice library that hides the whole implementation behind a simple api.

>>> import requests
>>> req = requests.get("http://www.mathworks.com/moler/random.pdf")
>>> len(req.content)
167633
>>> req.headers
{'content-length': '167633', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.3 (Red Hat) mod_jk/1.2.31 PHP/5.3.13 Phusion_Passenger/3.0.9 mod_perl/2.0.4 Perl/v5.8.8', 'last-modified': 'Fri, 15 Feb 2008 17:11:12 GMT', 'connection': 'keep-alive', 'etag': '"30863b-28ed1-446357e3d4c00"', 'date': 'Sun, 03 Feb 2013 05:53:21 GMT', 'content-type': 'application/pdf'}

      

By the way, the reason you only get 15kb download is because your url is not correct. It should be



http://www.mathworks.com/moler/random.pdf

      

But you are GETing

http://www.mathworks.com/moler/random.pdf/

>>> import requests
>>> c = requests.get("http://www.mathworks.com/moler/random.pdf/")
>>> len(c.content)
14390

      

+9


source


to write a file to disk:



myfile = open("out.pdf", "w")
myfile.write(req.content)

      

+2


source


Maybe it's a little late, but you can try this: Just writing the content to a new file and reading it with textract, doing it without it, I got unnecessary text containing "# $".

import requests
import textract
url = "The url which downloads the file"
response = requests.get(url)
with open('./document.pdf', 'wb') as fw:
    fw.write(response.content)
text = textract.process("./document.pdf")
print('Result: ', text)

      

0


source







All Articles