Extracting elements between symbol and space

Question

Extracting elements between symbol and space

I'm having a hard time extracting elements between /

and a black space

. I can do this when I have two characters like <

and >

, but the space is throwing me. I would like the most efficient way to do this in an R base as it will be bound to thousands of vectors.

I would like to include this:

x <- "This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"

It:

 [1] "DT"  "VBZ" "DT"  "JJ"  "NN"  "VBG" "IN"  "DT"  "JJ"  "NNS" "CC"  "VBG"

EDIT:

Thanks everyone for the answers. I'm going for speed, so Andres' code wins. Dwin code wins for the best code. Your Dirk was the second fastest. The stringr solution was the slowest (I figured it would be) and was not in the base, but quite understandable (which is indeed the intent of the stringr package, I think, since this seems to be Hadley's philosophy with most things.

I appreciate your help. Thanks again.

I thought I would include benchmarking since it will be lapplied

over a few thousand vectors:

    test replications elapsed relative user.self sys.self
1 ANDRES        10000    1.06 1.000000      1.05        0
3   DIRK        10000    1.29 1.216981      1.20        0
2   DWIN        10000    1.56 1.471698      1.43        0
4 FLODEL        10000    8.46 7.981132      7.70        0

+3

r

Tyler rinker Mar 31 12 at 19:51

source to share

4 answers

Use a regex pattern that is fwd-slash or whitespace:

strsplit(x, "/|\\s" )
[[1]]
 [1] "This"        "DT"          "is"          "VBZ"         "a"           "DT"          "short"      
 [8] "JJ"          "sentence"    "NN"          "consisting"  "VBG"         "of"          "IN"         
[15] "some"        "DT"          "nouns,"      "JJ"          "verbs,"      "NNS"         "and"        
[22] "CC"          "adjectives." "VBG"

Didn't read Q close enough. This result can be used to extract even elements:

strsplit(x, "/|\\s")[[1]][seq(2, 24, by=2)]
 [1] "DT"  "VBZ" "DT"  "JJ"  "NN"  "VBG" "IN"  "DT"  "JJ"  "NNS" "CC"  "VBG"

+3

42- Mar 31 12 at 20:05

source to share

Here's a one-liner:

R> x <- paste("This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG"
              "of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
R> matrix(do.call(c, strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")), 
+         ncol=2, byrow=TRUE)[,2]
 [1] "DT"  "VBZ" "DT"  "JJ"  "NN"  "VBG" "IN"  "DT"  "JJ"  "NNS" "CC"  "VBG"
R>

The key is to get rid of the "text before the slash":

R> gsub("[a-zA-Z.,]*/", " ", x)
[1] " DT  VBZ  DT  JJ  NN  VBG  IN  DT  JJ  NNS  CC  VBG"
R>

after which it's just a matter of breaking the line

R> strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")
[[1]]
 [1]  ""    "DT"  ""    "VBZ" ""    "DT"  ""    "JJ"  ""    "NN"
 [11] ""    "VBG" ""    "IN"  ""    "DT"  ""    "JJ"  ""    "NNS" 
 [21] ""    "CC"  ""    "VBG"

and filtration ""

. For the latter, the bit might be more compact. R>

+2

Dirk Eddelbuettel Mar 31 12 at 20:04

source to share

The package stringr

has nice features for working with strings with very intuitive names. Here's how you can use str_extract_all

to get all matches (including the leading slash), then str_sub

to remove the slash:

str_extract_all(x, "/\\w*")
# [[1]]
#  [1] "/DT"  "/VBZ" "/DT"  "/JJ"  "/NN"  "/VBG" "/IN"  "/DT"  "/JJ"  "/NNS"
# [11] "/CC"  "/VBG"

str_sub(str_extract_all(x, "/\\w*")[[1]], start = 2)
#  [1] "DT"  "VBZ" "DT"  "JJ"  "NN"  "VBG" "IN"  "DT"  "JJ"  "NNS" "CC"  "VBG"

+1

flodel Mar 31 12 at 21:20

source to share

aatrujillob · Accepted Answer · 2012-03-31T20:30:20+0000

Similar, but slightly more concise:

#1- Separate the elements by the blank space

    y=unlist(strsplit(x,' '))

#2- extract just what you want from each element:

    sub('^.*/([^ ]+).*$','\\1',y)

Where the starting and ending anchor characters ^

and $

respectively .*

match any character. [^ ]+

accepts non-empty characters. \\1

- this is the first marked character

Extracting elements between symbol and space

More articles: