Avoid increasing memory in a foreach loop in R
I am trying to create summary statistics that combine two different sets of spatial data: a large raster file and a polygon file. The idea is to get a summary of the raster values ββin each polygon.
Since the raster is too large to process it at once, I am trying to create subtasks and process them in parallel, that is, process each polygon from at SpatialPolgyonsDataframe
once.
The code works fine, however after about 100 interactions I am running into memory issues. Here is my code and what I intend to do:
# session setup
library("raster")
library("rgdal")
# multicore processing.
library("foreach")
library("doSNOW")
# assign three clusters to be used for current R session
cluster = makeCluster(3, type = "SOCK",outfile="")
registerDoSNOW(cluster)
getDoParWorkers()# check if it worked
# load base data
r.terra.2008<-raster("~/terra.tif")
spodf.malha.2007<-readOGR("~/,"composed")
# bring both data-sets to a common CRS
proj4string(r.terra.2008)
proj4string(spodf.malha.2007)
spodf.malha.2007<-spTransform(spodf.malha.2007,CRSobj = CRS(projargs = proj4string(r.terra.2008)))
proj4string(r.terra.2008)==proj4string(spodf.malha.2007) # should be TRUE
# create a function to extract areas
function.landcover.sum<-function(r.landuse,spodf.pol){
return(table(extract(r.landuse,spodf.pol)))}
# apply it one one subset to see if it is working
function.landcover.sum(r.terra.2008,spodf.malha.2007[1,])
## parallel loop
# define package(s) to be use in the parallel loop
l.packages<-c("raster","sp")
# try a parallel loop for the first 6 polygons
l.results<-foreach(i=1:6,
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
}
here is a list that looks like this.
l.results
[[1]]
9 10
193159 2567
[[2]]
7 9 10 12 14 16
17 256 1084 494 67 15
[[3]]
3 5 6 7 9 10 11 12
2199 1327 8840 8579 194437 1061 1073 1834
14 16
222 1395
[[4]]
3 6 7 9 10 12 16
287 102 728 329057 1004 1057 31
[[5]]
3 5 6 7 9 12 16
21 6 20 495 184261 4765 28
[[6]]
6 7 9 10 12 14
161 161 386 943 205 1515
So the result is pretty small and shouldn't be the source of a memory allocation problem. Thus, than the next loop over the entire polygon dataset, which has> 32,000 rows, creates a memory allocation that is greater than 8 GB after about 100 iterations.
# apply the parallel loop on the whole dataset
l.results<-foreach(i=1:nrow(spodf.malha.2007),
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
# gc(reset=TRUE) # does not resolve the problem
# closeAllConnections() # does not resolve the problem
}
What am I doing wrong?
edit I tried (as suggested in the comments) to delete the object after each iteration in the inner loop, but that didn't solve the problem. Also, I tried to solve possible problems with multiple data importers by first sending the objects to the environment:
clusterExport(cl = cluster,
varlist = c("r.terra.2008","function.landcover.sum","spodf.malha.2007"))
without significant changes. My R version is 3.4 on linux platform, so presumably the link patch from the fist comment should already be included in this version. I also tried the package parallel
as suggested in the first comment, but no differences showed up.
source to share
No one has answered this question yet
Check out similar questions: