Mining sequences from data rows
long time seeker of the answer, the first time asks. I have an R data frame that is a single column, 267,000 rows with 17 factors, for example:
regions
VE
PU
PR
DE
NU
AD
DE
NO
AD
I am trying to extract them as sequences of columns with length 2 and 3, then go down 1 row and repeat to the end. repetitions and order of presence. I want to take the above and do this:
s1 s2
VE PU
PU PR
PR DE
DE NU
NU AD
AD DE
DE NO
I've tried using packages like TraMinEr and ArulesSequences but I can't seem to figure them out. I think this is because my sequences are pure states, there is no timing, even in the original dataset. I also tried to create my own iterator scripts, but I was not successful. I've googled endlessly and I'm just at the end. I do not know how to do that. the ultimate goal is to match the outputs with 2 or 3 permutation data frames and binarize matches with 1, 0 without matches and process that x49 into a new data frame.
I am not an expert in programming or R, just a beginner user. does anyone know of a script or package that can do this?
source to share
Basically you want to assign regions
without last observation s1
and regions
without first observation s2
. You don't necessarily need additional packages for this. There are several approaches:
1) Using the functions head
andtail
With their help, you can get vectors without the last observation ( head(column, -1)
) or without the first observation ( tail(column, -1)
).
Using:
new.df <- data.frame(s1 = head(df$regions,-1), s2 = tail(df$regions,-1))
Thus, will get you:
> new.df
s1 s2
1 VE PU
2 PU PR
3 PR DE
4 DE NU
5 NU AD
6 AD DE
7 DE NO
8 NO AD
If you need three columns, you can do:
new.df <- data.frame(s1 = head(df$regions,-2),
s2 = head(tail(df$regions,-1),-1),
s3 = tail(df$regions,-2))
that leads to:
> new.df
s1 s2 s3
1 VE PU PR
2 PU PR DE
3 PR DE NU
4 DE NU AD
5 NU AD DE
6 AD DE NO
7 DE NO AD
2) base subset
As an alternative to head
and functions , tail
you can also use a base subset:
new.df <- data.frame(s1 = df$regions[-nrow(df)],
s2 = df$regions[-1])
3) using the embed
-function
n <- 3
new.df <- data.frame(embed(df$regions, n)[,n:1])
names(new.df) <- paste0('s',1:n)
which gives:
> new.df
s1 s2 s3
1 VE PU PR
2 PU PR DE
3 PR DE NU
4 DE NU AD
5 NU AD DE
6 AD DE NO
7 DE NO AD
4) using the shift
-function from the data.table
-package
The shift
package function data.table
can also be accessed:
library(data.table)
dt <- as.data.table(df)
new.dt <- na.omit(dt[, .(s1 = regions,
s2 = shift(regions, 1, NA, 'lead'),
s3 = shift(regions, 2, NA, 'lead'))])
And instead, na.omit
you can also use rowSums
on is.na
:
new.dt <- dt[, .(s1 = regions,
s2 = shift(regions, 1, NA, 'lead'),
s3 = shift(regions, 2, NA, 'lead'))]
new.dt[rowSums(is.na(new.dt))==0]
source to share
You can also use transmute
and lead
in the package dplyr
:
df1 <-read.table(text="regions
VE
PU
PR
DE
NU
AD
DE
NO
AD",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>% transmute(s1=regions,s2=lead(regions)) %>%na.omit
s1 s2
1 VE PU
2 PU PR
3 PR DE
4 DE NU
5 NU AD
6 AD DE
7 DE NO
8 NO AD
If you want sequences of 3 you can add another column with lead(regions,2)
df1 %>% transmute(s1=regions,s2=lead(regions),s3=lead(regions,2)) %>%na.omit
s1 s2 s3
1 VE PU PR
2 PU PR DE
3 PR DE NU
4 DE NU AD
5 NU AD DE
6 AD DE NO
7 DE NO AD
source to share