R httr download files with ftp error 421 "too many connections from your internet address"
EDIT - short question : does it have a httr
finalizer that closes the FTP connection?
I am downloading climate forecast files from the ftp server of the NASA NEX project using the package httr
.
My script:
library(httr)
var = c("pr", "tasmin", "tasmax")
rcp = c("rcp45", "rcp85")
mod= c("inmcm4", "GFDL-CM3")
year=c(seq(2040,2080,1))
for (v in var) {
for (r in rcp) {
url<- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', r, '/day/atmos/', v, '/r1i1p1/v1.0/', sep='')
for (m in mod) {
for (y in year) {
nfile<- paste0(v,'_day_BCSD_',r,"_r1i1p1_",m,'_',y,'.nc', sep='')
url1<- paste0(url,nfile, sep='')
destfile<-paste0('mypath',r,'/',v,'/',nfile, sep='')
GET(url=url1, authenticate(user='NEXGDDP', password='', type = "basic"), write_disk(path=destfile, overwrite = FALSE ))
Sys.sleep(0.5)
}}}}
After a while the server will terminate my connection with the following error: " 421 Too many connections from your Internet address .
I read here that it has to do with the number of open connections and that I have to close them on each iteration (I'm not sure if this really makes sense tho!). Is there a way to close ftp with a package httr
?
source to share
Suggested solution (final answer)
The suggested solution is to set the maximum number of connections to ftp server for httr
> config(CURLOPT_MAXCONNECTS=5)
<request>
Options:
* CURLOPT_MAXCONNECTS: 5
Description
Preamble:
The package httr
is a wrapper for curl
. This is important because it abstracts the curl interface. In this case, we want to change the behavior curl
by changing the curl configuration with an abstraction httr
.
-
httr
by default handles auto-sharing between requests to the same website (by default the handle controls hang automatically), cookies are supported across requests, and modern root-level certificate store is also used.
In this context, we do not control the FTP server, only the client's request to the server. Hence, we can change the default behavior with httr:config
to reduce the number of concurrent FTP requests.
Query httr curl ftp options
To get the current parameters, we can run the following command:
>httr_options("ftp")
httr libcurl type
49 ftp_account CURLOPT_FTP_ACCOUNT string
50 ftp_alternative_to_user CURLOPT_FTP_ALTERNATIVE_TO_USER string
51 ftp_create_missing_dirs CURLOPT_FTP_CREATE_MISSING_DIRS integer
52 ftp_filemethod CURLOPT_FTP_FILEMETHOD integer
53 ftp_response_timeout CURLOPT_FTP_RESPONSE_TIMEOUT integer
54 ftp_skip_pasv_ip CURLOPT_FTP_SKIP_PASV_IP integer
55 ftp_ssl_ccc CURLOPT_FTP_SSL_CCC integer
56 ftp_use_eprt CURLOPT_FTP_USE_EPRT integer
57 ftp_use_epsv CURLOPT_FTP_USE_EPSV integer
58 ftp_use_pret CURLOPT_FTP_USE_PRET integer
59 ftpport CURLOPT_FTPPORT string
60 ftpsslauth CURLOPT_FTPSSLAUTH integer
196 tftp_blksize CURLOPT_TFTP_BLKSIZE integer
to access libcurl documentation that we can call curl_docs("CURLOPT_FTP_ACCOUNT")
.
Changing httr
Query Configuration
You can either change the global curl config httr
using set_config()
, or just wrap your request using with_config()
. In this case, we want to limit the maximum number of connections to the ftp server.
in the following way:
httr_options("max")
httr libcurl type
95 max_recv_speed_large CURLOPT_MAX_RECV_SPEED_LARGE number
96 max_send_speed_large CURLOPT_MAX_SEND_SPEED_LARGE number
97 maxconnects CURLOPT_MAXCONNECTS integer
98 maxfilesize CURLOPT_MAXFILESIZE integer
99 maxfilesize_large CURLOPT_MAXFILESIZE_LARGE number
100 maxredirs CURLOPT_MAXREDIRS integer
Now we can search curl_docs("CURLOPT_MAXCONNECTS")
- ok, this is what we want.
Now we have to install it.
> config(CURLOPT_MAXCONNECTS=5)
<request>
Options:
* CURLOPT_MAXCONNECTS: 5
ref: https://cran.r-project.org/web/packages/httr/httr.pdf
Alternative RCurl approach
I know this is a little overkill, I included it for an alternative approach. What for? There is a subtle issue here due to network bandwidth ... Starting multiple concurrent FTP sessions can be slower than starting them in sequence. My alternative approach would be to run the R script below, or go directly to using curl via the Unix shell command line.
require(RCurl)
require(stringr)
opts = curlOptions(userpwd = "NEXGDDP:", netrc = TRUE)
rcpDir = c("rcp45", "rcp85")
varDir = c("pr", "tasmin", "tasmax")
for (rcp in rcpDir ) {
for (var in varDir ) {
url <- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', rcp, '/day/atmos/', var, '/r1i1p1/v1.0/', sep = '')
print(url)
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE, .opts = opts)
filelist <- unlist(str_split(filenames, "\n"))
filelist <- filelist[!filelist == ""]
filesavg <- str_detect(filelist,
"inmcm4_20[4-8]0|GFDL-CM3_20[4-8]0")
filesavg <- filelist[filesavg]
filesavg
urlsavg <- str_c(url, filesavg)
for (file in seq_along(urlsavg)) {
fname <- str_c("data/", filesavg[file])
if (!file.exists(fname)) {
print(urlsavg[file])
bin <- getBinaryURL(urlsavg[file], .opts = opts)
writeBin(bin, fname)
Sys.sleep(1)
}
}
}
}
Code output
> require(RCurl)
> require(stringr)
> opts = curlOptions(userpwd = "NEXGDDP:", netrc = TRUE)
> rcpDir = c("rcp45", "rcp85")
> varDir = c("pr", "tasmin", "tasmax")
> for (rcp in rcpDir ) {
+ for (var in varDir ) {
+ url <- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', rcp, '/day/atmos/', var, '/r1i1p1/v1.0/', sep = '')
+ print(url)
+ filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE, .opts = opts)
+ filelist <- unlist(str_split(filenames, "\n"))
+ filelist <- filelist[!filelist == ""]
+ filesavg <- str_detect(filelist,
+ "inmcm4_20[4-8]0|GFDL-CM3_20[4-8]0")
+ filesavg <- filelist[filesavg]
+ filesavg
+ urlsavg <- str_c(url, filesavg)
+
+ for (file in seq_along(urlsavg)) {
+ fname <- str_c("data/", filesavg[file])
+ if (!file.exists(fname)) {
+ print(urlsavg[file])
+ bin <- getBinaryURL(urlsavg[file], .opts = opts)
+ writeBin(bin, fname)
+ Sys.sleep(1)
+ }
+ }
+ }
+ }
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmin/r1i1p1/v1.0/tasmin_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_GFDL-CM3_2080.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2040.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2050.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2060.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2070.nc"
[1] "ftp://ftp.nccs.nasa.gov/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2080.nc"
source to share
(Not sure if this should be the answer, but I can't add the whole thing to the comment)
To summarize, two alternative solutions worked, combining my approach with the one suggested by Technophobe. I've put the final code here as well in case it might be helpful for those experiencing the same problems.
httr
:
library(httr)
#configure a proxy, in case you are in a office/university network
set_config(use_proxy(url='http://~in_case_you_need_a_proxy', port=paste_here_port_no)
#limit the number of simultaneous connections as suggested by Technofobe
#default is 5
config(CURLOPT_MAXCONNECTS=3)
var = c("pr","tasmax","tasmin")
rcp = c("rcp45", "rcp85")
mod= c("inmcm4", "GFDL-CM3")
year=c(seq(2036,2050,1), seq(2061,2080,1))
for (v in var) {
for (r in rcp) {
url<- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', r, '/day/atmos/', v, '/r1i1p1/v1.0/', sep='')
for (m in mod) {
for (y in year) {
nfile<- paste0(v,'_day_BCSD_',r,"_r1i1p1_",m,'_',y,'.nc', sep='')
url1<- paste0(url,nfile, sep='')
destfile<-paste0('D:/destination_path/',r,'/',v,'/',nfile, sep='')
GET(url=url1, authenticate(user='NEXGDDP', password='', type = "basic"), write_disk(path=destfile, overwrite = FALSE ))
gc()
Sys.sleep(1)
}}}}
An alternative approach using RCurl
library(RCurl)
opts = curlOptions(proxy='http://~in_case_you_need_a_proxy:paste_here_port_no', userpwd = "NEXGDDP:", netrc = TRUE)
var = c("pr","tasmax","tasmin")
rcp = c("rcp45", "rcp85")
mod= c("inmcm4", "GFDL-CM3")
year=c(seq(2036,2050,1), seq(2061,2080,1))
for (v in var) {
for (r in rcp) {
url<- paste0( 'ftp://ftp.nccs.nasa.gov/BCSD/', r, '/day/atmos/', v, '/r1i1p1/v1.0/', sep='')
for (m in mod) {
for (y in year) {
nfile<- paste0(v,'_day_BCSD_',r,"_r1i1p1_",m,'_',y,'.nc', sep='')
url1<- paste0(url,nfile, sep='')
destfile<-paste0('D:/destination_path/',r,'/',v,'/',nfile, sep='')
bin <- getBinaryURL(url1, .opts = opts)
writeBin(bin, destfile)
Sys.sleep(1)
gc()
}}}}
Both approaches have been tested and processed. The second may still be affected by error 421, but in a very limited number of cases (I've uploaded over 900 files for a total of about 600GB). Hopefully this is a good recommendation for other people working in the field.
source to share