Mining sequences from data rows

long time seeker of the answer, the first time asks. I have an R data frame that is a single column, 267,000 rows with 17 factors, for example:

regions
VE
PU
PR
DE
NU
AD
DE
NO
AD

      

I am trying to extract them as sequences of columns with length 2 and 3, then go down 1 row and repeat to the end. repetitions and order of presence. I want to take the above and do this:

s1   s2
VE   PU
PU   PR
PR   DE
DE   NU
NU   AD
AD   DE
DE   NO

      

I've tried using packages like TraMinEr and ArulesSequences but I can't seem to figure them out. I think this is because my sequences are pure states, there is no timing, even in the original dataset. I also tried to create my own iterator scripts, but I was not successful. I've googled endlessly and I'm just at the end. I do not know how to do that. the ultimate goal is to match the outputs with 2 or 3 permutation data frames and binarize matches with 1, 0 without matches and process that x49 into a new data frame.

I am not an expert in programming or R, just a beginner user. does anyone know of a script or package that can do this?

+3


source to share


2 answers


Basically you want to assign regions

without last observation s1

and regions

without first observation s2

. You don't necessarily need additional packages for this. There are several approaches:

1) Using the functions head

andtail

With their help, you can get vectors without the last observation ( head(column, -1)

) or without the first observation ( tail(column, -1)

).

Using:

new.df <- data.frame(s1 = head(df$regions,-1), s2 = tail(df$regions,-1))

      

Thus,

will get you:

> new.df
  s1 s2
1 VE PU
2 PU PR
3 PR DE
4 DE NU
5 NU AD
6 AD DE
7 DE NO
8 NO AD

      

If you need three columns, you can do:

new.df <- data.frame(s1 = head(df$regions,-2), 
                     s2 = head(tail(df$regions,-1),-1),
                     s3 = tail(df$regions,-2))

      

that leads to:

> new.df
  s1 s2 s3
1 VE PU PR
2 PU PR DE
3 PR DE NU
4 DE NU AD
5 NU AD DE
6 AD DE NO
7 DE NO AD

      

2) base subset



As an alternative to head

and functions , tail

you can also use a base subset:

new.df <- data.frame(s1 = df$regions[-nrow(df)], 
                     s2 = df$regions[-1])

      

3) using the embed

-function

n <- 3
new.df <- data.frame(embed(df$regions, n)[,n:1])
names(new.df) <- paste0('s',1:n)

      

which gives:

> new.df
  s1 s2 s3
1 VE PU PR
2 PU PR DE
3 PR DE NU
4 DE NU AD
5 NU AD DE
6 AD DE NO
7 DE NO AD

      

4) using the shift

-function from the data.table

-package

The shift

package function data.table

can also be accessed:

library(data.table)
dt <- as.data.table(df)
new.dt <- na.omit(dt[, .(s1 = regions,
                         s2 = shift(regions, 1, NA, 'lead'),
                         s3 = shift(regions, 2, NA, 'lead'))])

      

And instead, na.omit

you can also use rowSums

on is.na

:

new.dt <- dt[, .(s1 = regions,
                 s2 = shift(regions, 1, NA, 'lead'),
                 s3 = shift(regions, 2, NA, 'lead'))]

new.dt[rowSums(is.na(new.dt))==0]

      

+3


source


You can also use transmute

and lead

in the package dplyr

:

df1 <-read.table(text="regions
VE
PU
PR
DE
NU
AD
DE
NO
AD",header=TRUE, stringsAsFactors=FALSE)

library(dplyr)
df1 %>% transmute(s1=regions,s2=lead(regions)) %>%na.omit

  s1 s2
1 VE PU
2 PU PR
3 PR DE
4 DE NU
5 NU AD
6 AD DE
7 DE NO
8 NO AD

      



If you want sequences of 3 you can add another column with lead(regions,2)

df1 %>% transmute(s1=regions,s2=lead(regions),s3=lead(regions,2)) %>%na.omit
  s1 s2 s3
1 VE PU PR
2 PU PR DE
3 PR DE NU
4 DE NU AD
5 NU AD DE
6 AD DE NO
7 DE NO AD

      

+3


source







All Articles