How to hide dcast function in reshape package in R

As a relatively new user of R, I am having trouble with any of the looping functions. I've looked at many tutorials, but the examples are usually very simple and therefore easy to follow. However, I need to create slightly more complex loops and it is very difficult for me to figure out how to do this. There are several loop related questions in this and other forums, but none of them match exactly what I need, and while I tried to adapt other answers for my current problem, I keep running into errors.

I have 2000 .csv files with data pushed into long format data (simplified example):

solution1    
> sol1     sol2     Istat
> s1       s2       0.435
> s1       s3       0.456
> s1       s4       0.845
> s1       s5       0.234

      

This is basically a summary of the pairwise comparisons of the 2000 individual solutions I have, with the similarity of the solutions summed up in the "Istat" value.

I am trying to dcast each of these 2000 CSV files into a wide spreadsheet (using the reshape package in R) so that they look like this:

     s1     s2     s3     s4     s5
s1   NA     0.435  0.456  0.845  0.234

      

I know how to do this only once with a single CSV file:

stat.cast <- dcast(solution1, sol2 ~ sol1, value.var="Istat")

      

But I cannot think of it as a loop function for

or even with lapply

, which seems to be a possible solution as well.

The closest I was able to get the function for

:

 # Get files from directory
loopout = "/Users/jc219806/Documents/Chapter 1/ANALYSES/R work/Istat/last_LoopOut/"
# List of file names inside folder
solutions <- list.files(loopout)
# Read all 2000 files inside
all.data <- lapply(solutions, read.csv, header=TRUE)
# Loop for performing reshape cast function to each listed dataframe
for (i in 1:length(all.data))
  {
  all.cast <- dcast(all.data, sol2 ~ sol1, value.var="Istat")
  }

      

But it keeps giving me the error that it cannot recognize the "Istat" value from the input - even if it is present in the list of data cells (the "decisions" object in the code above).

And using a function lapply

:

lapply(solutions, dcast(all.data, sol2 ~ sol1, value.var="Istat"))

      

I am getting the same type of error:

Error: value.var (Istat) not found in input

      

I don't understand why, because it is listed in the data list as one of the variables in each of the 2000 data frames. It looks like I don't get it to loop through each of the 2000 .csv 2000 files, but I don't know how to fix it. I also wondered if the code could also be written so that it iterates over all 2000 outputs according to the column names? This is looping.

Hopefully this is not as difficult a problem as it seems to me. Any help (along with some detailed explanations) or helpful direction would be widely and sincerely appreciated. Thanks to

+3


source to share


3 answers


I would melt

list your "all.data" and then dcast

broad form it. Something like:

## Sample data
set1 <- set2 <- data.frame(sol1 = c("s1", "s1", "s1", "s1"), 
                   sol2 = c("s2", "s3", "s4", "s5"), 
                   Istat = c(0.435, 0.456, 0.845, 0.234))
set2$Istat <- set2$Istat + 1 ## Just to see some different data

all.data <- mget(ls(pattern = "set\\d+")) ## use your actual object

## The reshaping
library(reshape2)
dcast(melt(all.data, id.vars = c("sol1", "sol2")), 
      L1 + sol1 ~ sol2, value.var = "value")
#     L1 sol1    s2    s3    s4    s5
# 1 set1   s1 0.435 0.456 0.845 0.234
# 2 set2   s1 1.435 1.456 1.845 1.234

      



If your "all.data" object has names, "L1" will display names that can be very convenient in the long run.

+2


source


You wrote:

for (i in 1:length(all.data))
  {
  all.cast <- dcast(all.data, sol2 ~ sol1, value.var="Istat")
  }

      

What you should have written:

all.cast <- list()
for (i in 1:length(all.data)) {
  all.cast[[i]] <- dcast(all.data[[i]], sol2 ~ sol1, value.var = "Istat")
}

      



But a more "R-esque" solution would be:

all.cast <- lapply(all.data, dcast, sol2 ~ sol1, value.var = "Istat")

      

Hopefully this makes it clear what you did wrong.

+4


source


"all.data" is a list of data. To iterate over a list, you can use lapply

both anonymous function call (just to be clear) and apply to it dcast

.

library(reshape2)
lapply(all.data, function(x) dcast(x, sol1 ~ sol2, value.var="Istat"))

      

Or instead of a separate dcast

list it could be rbind

for a dataframe with a grouping variable for each list item and then either do dcast

or spread

fromlibrary(tidyr)

library(dplyr)
library(tidyr)
unnest(all.data, group) %>% 
                  spread(sol2, Istat)

      

Or using data.table

library(data.table)
dcast(rbindlist(Map(cbind, all.data, group=seq_along(all.data))),
                 group + sol1 ~sol2, value.var='Istat')

      

data

all.data <- structure(list(solution1 = structure(list(sol1 = c("s1", 
"s1", 
"s1", "s1"), sol2 = c("s2", "s3", "s4", "s5"), Istat = c(0.435, 
0.456, 0.845, 0.234)), .Names = c("sol1", "sol2", "Istat"), 
class =     "data.frame", row.names = c(NA, 
-4L)), solution2 = structure(list(sol1 = c("s1", "s1", "s1", 
"s1"), sol2 = c("s2", "s3", "s4", "s5"), Istat = c(0.42, 0.536, 
0.945, 0.324)), .Names = c("sol1", "sol2", "Istat"), 
class =    "data.frame", row.names = c(NA, 
-4L))), .Names = c("solution1", "solution2"))

      

+3


source







All Articles