Memory issues using bigmemory to load large dataset in R
I have a large text file (> 10 million lines,> 1 GB) that I want to process one line at a time so as not to load the whole thing into memory. After processing each line, I want to store some variables into an object big.matrix
. Here's a simplified example:
library(bigmemory)
library(pryr)
con <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')
for (i in 1:5){
print(c(address(x), refs(x)))
y <- readLines(con, n = 1, warn = FALSE)
x[i] <- 2L*as.integer(y)
}
close(con)
where x.csv
contains
4 18 2 14 16
Following the advice here http://adv-r.had.co.nz/memory.html I printed the memory address of my object big.matrix
and it seems to have changed with each iteration of the loop
[1] "0x101e854d8" "2"
[1] "0x101d8f750" "2"
[1] "0x102380d80" "2"
[1] "0x105a8ff20" "2"
[1] "0x105ae0d88" "2"
-
Can objects be changed
big.matrix
in place? -
Is there a better way to load, process and then store this data? The current method is slow!
source to share
- Is there a better way to load, process and then store this data? The current method is slow!
The slowest part of your method allows you to make a call to read each line individually. We can "truncate" the data, or read several lines at a time, so as not to hit the memory limit and, possibly, speed up the process.
Here's the plan:
- Figure out how many lines we have in the file
- Read in a piece of these lines
- Do some operations on this snippet
-
Push this snippet back to a new file to save it later
library(readr) # Make a file x <- data.frame(matrix(rnorm(10000),100000,10)) write_csv(x,"./test_set2.csv") # Create a function to read a variable in file and double it calcDouble <- function(calc.file,outputFile = "./outPut_File.csv", read.size=500000,variable="X1"){ # Set up variables num.lines <- 0 lines.per <- NULL var.top <- NULL i=0L # Gather column names and position of objective column connection.names <- file(calc.file,open="r+") data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1) close(connection.names) col.name <- which(colnames(data.names)==variable) #Find length of file by line connection.len <- file(calc.file,open="r+") while((linesread <- length(readLines(connection.len,read.size)))>0){ lines.per[i] <- linesread num.lines <- num.lines + linesread i=i+1L } close(connection.len) # Make connection for doubling function # Loop through file and double the set variables connection.double <- file(calc.file,open="r+") for (j in 1:length(lines.per)){ # if stops read.table from breaking # Read in a chunk of the file if (j == 1) { data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="") } else { data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="") } # Grab the columns we need and double them double <- data[,I(col.name)] * 2 if (j != 1) { write_csv(data.frame(double),outputFile,append = TRUE) } else { write_csv(data.frame(double),outputFile) } message(paste0("Reading from Chunk: ",j, " of ",length(lines.per))) } close(connection.double) } calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")
So, we are returning a CSV file with managed data. You can change double <- data[,I(col.name)] * 2
to whatever you need to do for each snippet.
source to share