Na.trim on a specific column with data.table

Question

Na.trim on a specific column with data.table

I am trying to use na.trim

from a package zoo

in a single column object data.table

. This is what my data looks like:

DT <- data.table(id=c(rep("a",3),rep("b",3)),
                 col1=c(NA,1,2,NA,3,NA),col2=c(NA,NA,5,NA,NA,NA))
   id col1 col2
1:  a   NA   NA
2:  a    1   NA
3:  a    2    5
4:  b   NA   NA
5:  b    3   NA
6:  b   NA   NA

I would like to remove rows with leading NA using na.trim

and group by id. Here is the result I am expecting:

   id col1 col2
1:  a    1   NA
2:  a    2    5
3:  b    3   NA
4:  b   NA   NA

Here is what I have tried so far. This removes the leading NA but removes col2:

DT[,na.trim(col1),by=id]
   id V1
1:  a  1
2:  a  2
3:  b  3

This also doesn't work:

DT[,.SD[na.trim(col1)],by=id]
   id col1 col2
1:  a   NA   NA
2:  a    1   NA
3:  b   NA   NA

+3

r zoo data.table

Pierre lapointe May 17 '17 at 13:04

source to share

2 answers

na.trim

will be used in the same way as with data.table. See ?na.trim

for more information on its arguments.

DT[, na.trim(.SD, sides = "left", is.na = "all"), by = id]

giving:

   id col1 col2
1:  a    1   NA
2:  a    2    5
3:  b    3   NA
4:  b   NA   NA

ADDED:

In the comments, the poster clarified that na.trim

only one NN columns should work. In this case, add a column with row numbers, .I and after including the subset na.trim

using those row numbers.

DT[DT[, na.trim(data.table(col1, .I), "left"), by = id]$.I, ]

+4

G. Grothendieck May 17 '17 at 14:19

source to share

Jaap · Accepted Answer · 2017-05-17T13:11:39+0000

Possible solution without using the zoo

-package:

DT[DT[, .I[!!cumsum(!is.na(col1))], by = id]$V1]

You get:

   id col1 col2
1:  a    1   NA
2:  a    2    5
3:  b    3   NA
4:  b   NA   NA

What does it do:

With, DT[, .I[!!cumsum(!is.na(col1))], id]$V1

you create a vector of diamonds that you want to keep. By using !!cumsum(!is.na(col1))

, you make sure that only leading missing values are omitted col1

.
Then you use this vector for a subset of the data table.
!!cumsum(!is.na(col1))

does the same as cumsum(!is.na(col1))!=0

. Usage !!

converts all numbers above zero to TRUE

and all zeros to FALSE

.
.I

is optional, you can also use:, DT[DT[, !!cumsum(!is.na(col1)), by = id]$V1]

which multiplies data.table with a boolean vector.

Two alternatives with cummax

by @lmo from comments:

# alternative 1:
DT[DT[, !!(cummax(!is.na(col1))), by = id]$V1]

# alternative 2:
DT[as.logical(DT[, cummax(!is.na(col1)), by = id]$V1)]

Another alternative to @jogo:

DT[, .SD[!!cumsum(!is.na(col1))], by = id]

Another alternative to @Frank:

DT[, .SD[ rleid(col1) > 1L | !is.na(col1) ], by = id]

Na.trim on a specific column with data.table

More articles: