Grow ffdf data frame gradually
From the save.ffdf documentation:
Using 'save.ffdf automatically sets' finalizers' ff vectors are "closed". This means that the data will be saved to disk when the object is deleted or R sessions are closed. Data can be deleted either with "delete" or by deleting the directory where the object was saved ('dir).
I want to start with a small ffdf dataframe, add a bit of new data at a time, and grow it to disk. So I did a little experiment:
# in R
ffiris = as.ffdf(iris)
save.ffdf(ffiris, dir = "~/Desktop/iris")
# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff ffiris$Species.ff
# in R
# add a new column
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
rm(ffiris)
# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff ffiris$Species.ff
Turns out it doesn't automatically update ff data on disk when ffiris is removed. How about saving manually?
# in R
# add a new column
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
save.ffdf(ffiris, "~/Desktop/iris")
# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff ffiris$Species.ff
Hmm, no luck so far. Why?
How do I delete a folder before saving?
# in R
ffiris = as.ffdf(iris)
unlink("~/Desktop/iris", recursive = TRUE, force = TRUE)
save.ffdf(ffiris, "~/Desktop/iris", overwrite = TRUE)
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
unlink("~/Desktop/iris", recursive = TRUE, force = TRUE)
save.ffdf(ffiris, "~/Desktop/iris", overwrite = TRUE)
# in bash
ls ~/Desktop/iris/
# ls: /Users/ky/Desktop/iris/: No such file or directory
Even a stranger. Even if all of this works, it will still be terribly ineffective. I am looking for something like:
updateOnDisk(ffiris)
Can anyone please help?
ff
and ffbase
suggest out-of-memory R-vectors, but introduce reference semantics that can give problems with R-idioms.
R is a functional programming language, meaning that functions do not change parameters and objects, but return modified copies. In ffbase
we are implementing functions in the R path, i.e. transform
returns a copy of the original ffdf data.frame
. This can be seen by looking at the filenames:
ffiris = as.ffdf(iris)
save.ffdf(ffiris, dir = "~/Desktop/iris")
filename(ffiris) # show contents of ~/Desktop/iris
ffiris =transform(ffiris, new1 = 99) # this create a copy of the whole data.frame!
filename(ffiris)
ffiris$new2 <- ff(rep(99, nrow(iris))) # this creates a new column, but not yet in the right directory
filename(ffiris)
save.ffdf(ffiris, dir="~/Desktop/iris", overwrite=TRUE) # this fixes that.
The conversion is currently inefficient to add a new column as it copies the entire dataframe (this is R semantics). This is because the conversion can be a temporary result and you don't want to change the original data.
In ffbase2 we fix this problem