How do I change the file extension?

I am trying to clear a .xlsx file from the Tax Foundation website . Unfortunately, I keep getting an error message that reads: Excel cannot open the file '2017-FF-For-Website-7-10-2017.xlsx because the file format or file extension is not valid. verify that the file has not been corrupted and that the file extension matches the format of the file

. I did some research and he says the fix is ​​to change the file extension to ".xls" instead of ".xlsx". Can anyone help?

from bs4 import BeautifulSoup
import urllib.request
import os

url = urllib.request.urlopen("https://taxfoundation.org/facts-figures-2017/")

soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset'))

FHFA = os.chdir('C:/US_Census/Directory')

seen = set()
for link in soup.find_all('a', href=True):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.xlsx']):
        continue

    file = href.split('/')[-1]
    filename = file.rsplit('.', 1)[0]
    if filename not in seen:  # only retrieve file if it has not been seen before
        seen.add(filename)  # add the file to the set
        url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
    print(filename)

print(' ')
print("All files successfully downloaded.")

      

PS I know you can upload a file, but I am using it to automate a certain process.

+3


source to share


1 answer


Your problem was your line url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)

. If you go to the website and hover over the Excel download button, you will see that there is a much longer link, https://files.taxfoundation.org/20170710170238/2017-FF-For-Website-7-10-2017.xlsx

(notice the 2017....238

?). This way you have never downloaded an Excel file. Here is the correct line for this:

url = urllib.request.urlretrieve(href, file)



Everything else worked correctly.

+2


source







All Articles