Automatic checkout of large files via public HTTP to Google Cloud Storage

Question

Automatic checkout of large files via public HTTP to Google Cloud Storage

For weather processing purposes, I want to automatically receive daily weather forecast data in Google Cloud Storage.

The files are available at the public HTTP URL ( http://dcpc-nwp.meteo.fr/openwis-user-portal/srv/en/main.home ), but they are very large (30 to 300 megabytes). File size is a major concern.

After looking at previous stackoverflow threads, I tried two unsuccessful methods:

1 / First try with urlfetch on Google App Engine

    from google.appengine.api import urlfetch

    url = "http: //dcpc-nwp.meteo.fr/servic ..."
    result = urlfetch.fetch (url)

    [...] # Code to save in a Google Cloud Storage bucket

But the following error appears on the urlfetch line:

DeadlineExceededError: Timed out waiting for HTTP response from URL

2 / Second attempt with Cloud Storage Transfert service

According to the documentation, it is possible to get the HTTP data in the cloud storage directly through the Cloud Storage Transfert service: https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#httpdata

But uploading requires size and md5 files. This option may not work in my case because the website does not provide this information.

3 / Any ideas?

Do you see any solution to automatically fetch a large file over HTTP in my Cloud Storage?

+4

python google-app-engine google-cloud-storage

Matthieu Jul 12 17 at 15:08

source to share

3 answers

Currently MD5 and size are required for Google Transfer service; we understand that in cases like yours it can be difficult to work with, but unfortunately we don't have a great solution today.

If you can't get the size and MD5 by uploading the files yourself (temporarily), I think the best thing you can do.

0

Zach wilt Jul 12 17:01

source to share

The md5 and file size can be easily and quickly obtained with the curl -I command as stated in this link https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests .
You can then configure the storage transfer service to use this information.

Another option is to use a serverless cloud feature. It might look something like below in Python.

import requests

def download_url_file(url):
    try:
        print('[ INFO ] Downloading {}'.format(url))
        req = requests.get(url)
        if req.status_code==200:
            # Download and save to /tmp
            output_filepath = '/tmp/{}'.format(url.split('/')[-1])
            output_filename = '{}'.format(url.split('/')[-1])
            open(output_filepath, 'wb').write(req.content)
            print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
            return output_filename
        else:
            print('[ ERROR ] Status Code: {}'.format(req.status_code))
    except Exception as e:
        print('[ ERROR ] {}'.format(e))
    return output_filename

0

Kannappan sirchabesan Jan 16 19 at 18:05

source to share

Matthieu · Accepted Answer · 2017-07-16T19:21:48+0000

3 / Workaround with Compute Engine instance

Since it was not possible to fetch large files from external HTTP using App Engine or directly from Cloud Storage, I used a workaround with an always running Compute Engine instance.

This instance regularly checks for new weather files, downloads them, and uploads them to the Cloud Storage bucket.

For scalability, maintenance and cost, I would prefer to use only backend services, but hopefully:

It works great on a brand new F1-micro Compute Engine (no additional package and only $ 4 / month when running 24/7)
Network traffic from Compute Engine to Google Cloud Storage is free if instance and bucket are in the same region ($ 0 / month)

Automatic checkout of large files via public HTTP to Google Cloud Storage

More articles: