Split column of rows into variable number of columns using data.table in R

I want to analyze many years of income records in a Quicken house. I exported the file to qif and used the bank2csv program to render the csv. Within Quicken you can use category (eg car, tax), subcategories (eg car: service, car: fuel) and tags (eg self, spouse, son). bank2csv displays categories: subcategories / tag as a concatenated string. I want to put a category in a category column instead, a subcategory in a subcategory column, and put tags in a tag column. I saw a similar question , but alas, strsplit

then it workedunlist

and then indexed each element so that it can be written to the correct location by assignment. It won't work as sometimes there is no tag and sometimes there is no subcategory. It is easy to split the string into a list and store that list in a column, but how does one actually assign the first element of the list to one column and the second element (if it exists) to the second column. There is of course an elegant way.

simplified sample

library(data.table)
library(stringi)
dt <- data.table(category.tag=c("toys/David", "toys/David", "toys/James", "toys", "toys", "toys/James"), transaction=1:6)

      

How to create the third and fourth columns: category, tag. Some tag would beNA

I can do the following, but it doesn't take me very far. I need a way to specify the first or second element of the resulting list (as opposed to the whole list)

dt[, category:= strsplit(x = category.tag, split = "/") ]

      

+3


source to share


4 answers


Just pushed two functions transpose()

and tstrsplit()

in data.table v1.9.5.

With this we can do:

require(data.table)
dt[, c("category", "tag") := tstrsplit(category.tag, "/", fixed=TRUE)]
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

      



tstrsplit

is a wrapper for transpose(strsplit(as.character(x), ...))

. And you can also pass fill=.

to fill in missing values ​​with any other value than NA

.

transpose()

can also be used for lists, data frames and data tables.

+6


source


you can use cSplit



library(splitstackshape)
dt[, c("category", "tag") := cSplit(dt[,.(category.tag)], "category.tag", "/")]
dt
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

      

+4


source


1) Tryread.table

read <- function(x) read.table(text = x, sep  = "/", fill = TRUE, na.strings = "")
dt[, c("category", "tag") := read(category.tag)]

      

No additional packages are required.

2) An alternative is to use separate

in the tidyr package:

library(tidyr)
separate(dt, category.tag, c("category", "tag"), extra = "drop")

      

The above tidyr version is 0.1.0.9000 from github. To install it, make sure the package is installed devtools R and instructs: devtools::install_github("hadley/tidyr")

.

Update: Comments and minor improvements made by Richard Scrivens. Added tidyr solution.

+2


source


Since you already have "stringi", you can also look at the argument stri_split_fixed

and simplify

:

setnames(cbind(dt, stri_split_fixed(dt$category.tag, "/", simplify = TRUE)), 
         c("V1", "V2"), c("category", "tag"))[]
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

      

Although I must admit that I am partial to cSplit

: -)

+1


source







All Articles