Tidyr :: gather na.rm with missing data

Let's say I have multiple columns in a dataframe that measure the same concept, but in different methods (for example, there are several kinds of IQ tests, and students may or may not have any of them). I want to combine various methods into one column (obvious use case for tidyr).

If the data looks something like this:

mydata <- data.frame(ID = 55:64, 
                 age = c(12, 12, 14, 11, 20, 10, 13, 15, 18, 17),
                 Test1 = c(100, 90, 88, 115, NA, NA, NA, NA, NA, NA),
                 Test2 = c(NA, NA, NA, NA, 100, 120, NA, NA, NA, NA),
                 Test3 = c( NA, NA, NA, NA, NA, NA, 110, NA, 85, 150))

      

I would naturally like to do something like this (note that I use na.rm = TRUE to avoid having many NA's in my dataset to get my own rows):

library(tidyr)
tests <- gather(mydata, key=IQSource, value=IQValue, c(Test1, Test2, Test3), na.rm = TRUE)
tests

      

Giving me:

ID age IQSource IQValue 1 55 12 Test1 100 2 56 12 Test1 90 3 57 14 Test1 88 4 58 11 Test1 115 15 59 20 Test2 100 16 60 10 Test2 120 27 61 13 Test3 110 29 63 18 Test3 85 30 64 17 Test3 150

The problem is that I have a student (ID = 62) who does not have any IQ in any of the three and I do not want to lose her other data (data in the ID and age columns).

Is there a way to distinguish in tidyr that yes, I want to delete NA where I have data in at least one column that I am collecting, but at the same time I want to prevent data loss when all columns to collect are NA ?)

+3


source to share


3 answers


If each student has only one IQ test ...

library(tidyverse)

mydata %>%
  gather(key=IQSource, value=IQValue, Test1:Test3) %>%
  group_by(ID) %>%
  arrange(IQValue) %>%
  slice(1)

      

      ID   age IQSource IQValue
 1    55    12    Test1     100
 2    56    12    Test1      90
 3    57    14    Test1      88
 4    58    11    Test1     115
 5    59    20    Test2     100
 6    60    10    Test2     120
 7    61    13    Test3     110
 8    62    15    Test1      NA
 9    63    18    Test3      85
10    64    17    Test3     150

      



If students can have multiple IQ tests ...

mydata %>%
  # Add an ID with multiple IQ tests
  bind_rows(data.frame(ID=65, age=13, Test1=100, Test2=100, Test3=NA)) %>%
  gather(key=IQSource, value=IQValue, Test1:Test3) %>%
  group_by(ID) %>%
  filter(!is.na(IQValue) | all(is.na(IQValue))) %>%
  filter(all(!is.na(IQValue)) | !duplicated(IQValue)) %>%
  arrange(ID, IQSource)

      

      ID   age IQSource IQValue
 1    55    12    Test1     100
 2    56    12    Test1      90
 3    57    14    Test1      88
 4    58    11    Test1     115
 5    59    20    Test2     100
 6    60    10    Test2     120
 7    61    13    Test3     110
 8    62    15    Test1      NA
 9    63    18    Test3      85
10    64    17    Test3     150
11    65    13    Test1     100
12    65    13    Test2     100

      

+1


source


I didn't find a direct solution, but you could right_join

revert the original data.frame

one and then deselect all the columns you don't need.

library(tidyr)
library(dplyr)

mydata %>% 
  gather(key, val, Test1:Test3, na.rm = T) %>%
  right_join(mydata) %>% 
  select(-contains("Test"))
#> Joining, by = c("ID", "age")
#>    ID age   key val
#> 1  55  12 Test1 100
#> 2  56  12 Test1  90
#> 3  57  14 Test1  88
#> 4  58  11 Test1 115
#> 5  59  20 Test2 100
#> 6  60  10 Test2 120
#> 7  61  13 Test3 110
#> 8  62  15  <NA>  NA
#> 9  63  18 Test3  85
#> 10 64  17 Test3 150

      



Alternatively, you could, of course, first create data.frame

with all the variables you want to store and then attach to it:

id_data <- select(mydata, ID, age)

mydata %>% 
  gather(key, val, Test1:Test3, na.rm = T) %>%
  right_join(id_data)

      

+3


source


I think this will do the trick for you:

    # make another data frame which has just ID and whether or not they missed all 3 tests
    missing = mydata %>% 
      mutate(allNA = is.na(Test1) & is.na(Test2) & is.na(Test3)) %>%
      select(ID, allNA)

    # Gather and keep NAs  
    tests <- gather(mydata, key=IQSource, value=IQValue, c(Test1, Test2, Test3), na.rm = FALSE)

    # Keep the rows that have a IQValue or missed all tests
    tests = left_join(tests, missing) %>% 
      filter(!is.na(IQValue) | allNA)
    # Remove duplicated rows of individuals who missed all exams
    tests = tests[!is.na(tests$IQValue) | !duplicated(tests[["ID"]]), ]

      

+1


source







All Articles