Finding the most common substrings in a subsequence
I am trying to find sequences to find the most common substrings (IE subsequences where all events are contiguous). The user manual says about the methods for finding a subsequence:
"The idea of a subsequence is an extension of the notion of a substring and is described in detail for in Elzing (2008). Although a substring of a sequence necessarily consists of contiguous characters, this requirement is relaxed with the notion of a subsequence. Thus, if x = abac, λ (empty string), u = b, v = bac and w = bc belong to the set of subsequences x, but only λ, u = b and v = bac are substrings of x "
Is there a way to turn off this relaxation and only look at substrings? This is specifically used by the seqefsub command . I can't find anything about this in the TraMineR manual, so any help on this is appreciated! Thanks Andrew
source to share
Although TraMineR
it has no special function for substrings, you can get similar results by playing around with time constraints.
For example, if you set the maxGap=1
constraint argument to seqefsub
you, you get frequent subsequences formed with events occurring during two consecutive time points. Below I will illustrate data delivery actcal
using TraMineR
.
library(TraMineR)
data(mvad)
data(actcal)
## creating a state sequence object
actcal.seq <- seqdef(actcal,13:24,
labels=c("> 36 hours", "19 to 36 hours", "< 19 hours", "no work"))
## transforming into an event sequence object
actcal.seqe <- seqecreate(actcal.seq, tevent="state")
## frequent subsequences without constraints
fsubs <- seqefsub(actcal.seqe, pMinSupport=.01)
library(TraMineRextras)
fsubsn <- seqentrans(fsubs)
## displaying only subsequences with at least 2 events
fsubsn[fsubsn$data$nevent>1]
## Now with the maxGap=1 constraint
cstr <- seqeconstraint(maxGap=1)
fsstr <- seqefsub(actcal.seqe, pMinSupport=.01, constraint=cstr)
fsstrn <- seqentrans(fsstr)
fsstrn[fsstrn$data$nevent>1]
In this example, you get subsequences with events occurring at subsequent positions. To get subsequences of consecutive events regardless of the time elapsed between them, define sequences of events with timestamps that are defined as consecutive numbers, for example.
id event timestamp
1 A 1
1 C 2
1 B 3
2 C 1
2 B 2
3 A 1
3 B 2
...
Hope it helps
source to share