Extracting elements between symbol and space
I'm having a hard time extracting elements between /
and a black space
. I can do this when I have two characters like <
and >
, but the space is throwing me. I would like the most efficient way to do this in an R base as it will be bound to thousands of vectors.
I would like to include this:
x <- "This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
It:
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
EDIT:
Thanks everyone for the answers. I'm going for speed, so Andres' code wins. Dwin code wins for the best code. Your Dirk was the second fastest. The stringr solution was the slowest (I figured it would be) and was not in the base, but quite understandable (which is indeed the intent of the stringr package, I think, since this seems to be Hadley's philosophy with most things.
I appreciate your help. Thanks again.
I thought I would include benchmarking since it will be lapplied
over a few thousand vectors:
test replications elapsed relative user.self sys.self
1 ANDRES 10000 1.06 1.000000 1.05 0
3 DIRK 10000 1.29 1.216981 1.20 0
2 DWIN 10000 1.56 1.471698 1.43 0
4 FLODEL 10000 8.46 7.981132 7.70 0
source to share
Similar, but slightly more concise:
#1- Separate the elements by the blank space
y=unlist(strsplit(x,' '))
#2- extract just what you want from each element:
sub('^.*/([^ ]+).*$','\\1',y)
Where the starting and ending anchor characters
^
and $
respectively .*
match any character.
[^ ]+
accepts non-empty characters.
\\1
- this is the first marked character
source to share
Use a regex pattern that is fwd-slash or whitespace:
strsplit(x, "/|\\s" )
[[1]]
[1] "This" "DT" "is" "VBZ" "a" "DT" "short"
[8] "JJ" "sentence" "NN" "consisting" "VBG" "of" "IN"
[15] "some" "DT" "nouns," "JJ" "verbs," "NNS" "and"
[22] "CC" "adjectives." "VBG"
Didn't read Q close enough. This result can be used to extract even elements:
strsplit(x, "/|\\s")[[1]][seq(2, 24, by=2)]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
source to share
Here's a one-liner:
R> x <- paste("This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG"
"of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
R> matrix(do.call(c, strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")),
+ ncol=2, byrow=TRUE)[,2]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
R>
The key is to get rid of the "text before the slash":
R> gsub("[a-zA-Z.,]*/", " ", x)
[1] " DT VBZ DT JJ NN VBG IN DT JJ NNS CC VBG"
R>
after which it's just a matter of breaking the line
R> strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")
[[1]]
[1] "" "DT" "" "VBZ" "" "DT" "" "JJ" "" "NN"
[11] "" "VBG" "" "IN" "" "DT" "" "JJ" "" "NNS"
[21] "" "CC" "" "VBG"
and filtration ""
. For the latter, the bit might be more compact. R>
source to share
The package stringr
has nice features for working with strings with very intuitive names. Here's how you can use str_extract_all
to get all matches (including the leading slash), then str_sub
to remove the slash:
str_extract_all(x, "/\\w*")
# [[1]]
# [1] "/DT" "/VBZ" "/DT" "/JJ" "/NN" "/VBG" "/IN" "/DT" "/JJ" "/NNS"
# [11] "/CC" "/VBG"
str_sub(str_extract_all(x, "/\\w*")[[1]], start = 2)
# [1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
source to share