Subset multiple times in a data frame

Question

Subset multiple times in a data frame

I want a 100 times subset of a dataframe made up of 20 variables (continuous and categorical) in two parts that are 70% and 30%. But I can do it with aperture dataset as an example as well.

data(iris)

test.rows <- sample(1:nrow(iris), 105)
iris.70 <- iris[test.rows, ]
iris.30 <- iris[-test.rows, ]

This gives me the dataframes I need. But how can I do this 100 times and save the results somewhere so I can use them later?

I have tried

output <- list()

for(i in 1:surveyed100){
   output[[i]] <- test.rows <- sample(1:nrow(surveyed100), 246) 
}

But this tells me: a numeric expression has 5 elements: only the first one is used.

I would appreciate your help.

+3

r dataframe subset

Diego guevara torres 09 Aug 17 at 9:56

source to share

2 answers

LAP · Answer 1 · 2017-08-09T10:23:02+0000

First, create 100 samples:

samples <- list()

for(i in 1:100){
  samples[[i]] <- sample(1:nrow(surveyed100), 246)
}

Then use lapply()

to store all 100 subsets in the list:

output <- lapply(samples, function(x) list(surveyed100[x,], surveyed100[-x,]))

Usage example iris

:

samples <- list()

for(i in 1:100){
  samples[[i]] <- sample(1:nrow(iris), 105)
}

head(samples)
[[1]]
  [1]  66 106  39  50  33 123  68  62  65 125  30  25  60  70  49  98 140  44 141  94  18  59 117  32  63 133  16 139  97 145 105  78 112  95
 [35] 128  36  37  64  10 124  40 111  17  29  51  89  99   4 135 103 101  19 115  74  73  91  11  67  84  88   1 114 138  21  77  24  69  13
 [69]  53  58 110 150   9  31 144  54 129  34  35  52 142  14 113 127  27  20  87 134 118  15  72  92  75   8 104  96 136 143   2  41 109  90
[103] 146  26   6

[[2]]
  [1]  78  84  89  75  63  81 119  51 127  20  66 106 140  65 116  72 147 141  61 113 130 136 109  49  57 149  90  56   8  46  82  55  38   4
 [35]  70  94 100 117  95  29  45  13 128  11  83  80  35  41 121  73  39  67  19  98 108 103  42   2  44 132 114 137 118  12 125  24  77  53
 [69]  28 150  92   5  43 112  60 122  15  30 104 102 120  76  47  85  40  79  33 143  48 139 148 124  36  16 138 101 115 107 134 126  74   6
[103]  52  50  10

[[3]]
  [1]  23  67  54 131  84 146  25   7  41 101 138  49  28  95  15   5  57  69 126  60  12  92  35  89  50   1  13  77 140 116 136  17 144  64
 [35]  32 139  76 102  61 130   2  44  75 100  81  31  34  46  72  33  18  79  24 133 124  62   9  88   8  66  74 125  51 127 123  52  90  39
 [69] 120  42  16  83  40 137  47  58  82 135  96  20 119  91  36  48 132  55  93 106 107 109 113  53  19 141 105 128  78 143  29   4  45  37
[103]  73  94  87

[[4]]
  [1] 125  41  37  80 136  50  91  89  44 117 132  82  78 128 146  49  61 105 145  83 111 126 100  94   7 102 112  17 120  60  36 104 123  65
 [35]  48  34  45  73  25  46 110  74  66 137 107 101 106  24  97  18 119  72  33 134  87  35 121  14  88   9  39   8  64 142  10 148  54  99
 [69] 103  95  63  11 133 141  32  96  51  81 140  76 138 127  52  75  55  26 115  19  90  16  21  86  56  22  79  53  31  23  68  13  77  30
[103]  71 116  67

[[5]]
  [1]  83   4  85 133 111  55 145  65  81  50 136  64  13  27   5 117  33  69  40 127  80  61  53 125  77  36 124 140 138  86   7   6  79  29
 [35]  21 115  23  74  93  10 132  51   2  41  49 123  94 142 120  48  19  89  28  91  14 118  43 103  87  58 149  20  56 113  82  62 104  44
 [69]  72  47 119  35 143 116 128  26  75  88   9  60  16 130 114  31   1 147  78  73   3  32  70 146 131 102  15  54 141 129  42 101  17  59
[103]  46 134 110

[[6]]
  [1]  18  20  53 106 142 125 120 109 119 129  84 146  99  51  43  91 141  89 131 124  95 135  81  42  73 112 128 133 108  27  28  47  32  76
 [35] 130 138  70  36  10  90  16  11 137  17  87   5  35  25 123  97  12 115 127  94  34 103   4  54 134  78  68  71 101 126  61  37  33   2
 [69]  88  80 144  82 150   3  21 114  58 110 136  22 105 117  79  64 102  49  98  59 132  39   8 149 121  40  29 104  55  77 147  74  50  56
[103]  48  75  23

subsets:

output <- lapply(samples, function(x) list(iris[x,], iris[-x,]))

Output:

head(output[[1]][[1]])
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
66           6.7         3.1          4.4         1.4 versicolor
106          7.6         3.0          6.6         2.1  virginica
39           4.4         3.0          1.3         0.2     setosa
50           5.0         3.3          1.4         0.2     setosa
33           5.2         4.1          1.5         0.1     setosa
123          7.7         2.8          6.7         2.0  virginica

head(output[[1]][[2]])
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3           4.7         3.2          1.3         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
7           4.6         3.4          1.4         0.3  setosa
12          4.8         3.4          1.6         0.2  setosa
22          5.1         3.7          1.5         0.4  setosa
23          4.6         3.6          1.0         0.2  setosa


> nrow(output[[1]][[1]])
[1] 105

> nrow(output[[1]][[2]])
[1] 45

docendo discimus · Answer 2 · 2017-08-09T10:48:36+0000

You can create a small function to do this, for example:

foo <- function(dat, train_percent = 0.7) {
  n     <- seq_len(nrow(dat))
  train <- sample(n, floor(train_percent * max(n)))
  test  <- sample(setdiff(n, train))
  list(train = dat[train,], test = dat[test,])
}

Then you can easily apply this function multiple times using replicate

:

replicate(100, foo(iris), simplify = FALSE)

The result list contains 100 items, and each item is itself a list of two items, where the first is the "train" and the second is the "test" dataset.

Subset multiple times in a data frame

More articles: