How do I extract the first number from each row into a vector in R?

Question

How do I extract the first number from each row into a vector in R?

I am new to regex in R. Here I have a vector in which I am interested in retrieving the first fill of a number in each line of the vector.

I have a vector called "shootsummary" that looks like this.

> head(shootsummary)
[1] Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.                                         
[2] Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him.                           
[3] John Zawahri, 23, armed with a homemade assault rifle and high-capacity magazines, killed his brother and father at home and then headed to Santa Monica College, where he was eventually killed by police.      
[4] Dennis Clark III, 27, shot and killed his girlfriend in their shared apartment, and then shot two witnesses in the building parking lot and a third victim in another apartment, before being killed by police.
[5] Kurt Myers, 64, shot six people in neighboring towns, killing two in a barbershop and two at a car care business, before being killed by officers in a shootout after a nearly 19-hour standoff.

The first occurrence of a number in each line means the "age" of the person, and I am interested in extracting ages from these lines without mixing them with other numbers in the specified lines.

I used:

as.numeric(gsub("\\D", "", shootsummary))

The result was:

[1]  34128     42     23     27   6419

I'm looking for a result that looks the same as just the ages extracted from the sentence, without extracting other numbers that occur after the age.

[1]  34     42     23     27   64

+3

regex vector r

user3563667 17 Sep 14 at 8:05

source to share

7 replies

stringi

will be faster

library(stringi)
stri_extract_first(shootsummary, regex="\\d+")
#[1] "34" "42" "23" "27" "64"

+3

akrun 17 Sep 14 at 8:14

source to share

What about

splitbycomma <- strsplit(shootsummary, ",")
as.numeric(  sapply(splitbycomma, "[", 2)  )

+1

Berry boessenkool 17 Sep '14 at 8:10

source to share

R regmatches()

returns a vector with the first regex match at each element:

regmatches(shootsummary, regexpr("\\d+", shootsummary, perl=TRUE));

+1

Tim Pietzcker 17 Sep 14 at 8:12

source to share

One parameter str_extract

from stringr

with a wrapper as.numeric

.

> library(stringr)
> as.numeric(str_extract(shootsummary, "[0-9]+"))
# [1] 34 42 23 27 64

Update In response to your question in the comments to this answer, here's a bit of explanation. A full description of the function can be found in the help file.

str_extract

returns the first occurrence of the regular expression. It is vectorized over a character vector in its first argument.
Regular expression [0-9]+

matches any character: '0' - '9' (1 or more times)
as.numeric

changes the resulting character vector to a numeric vector.

+1

Rich scriven 17 Sep 14 at 8:13

source to share

You can use sub

:

test <- ("xff 34 sfsdg 352 efsrg")

sub(".*?(\\d+).*", "\\1", test)
# [1] "34"

How does regular expression work?

.

matches any character. Quantifier *

means any number of occurrences. ?

used to match all characters up to the first match \\d

(digit). A quantifier +

means one or more occurrences. The parentheses around \\d

are the first group of matches. This may be followed by additional characters ( .*

). The second argument ( \\1

) replaces the entire string with the first match group (i.e., the first number).

+1

Sven Hohenstein 17 Sep '14 at 8:21

source to share

You can do this very well with a function first_number()

from a package, filesstrings

or for more general needs, there is a function there nth_number()

. Install it with install.packages("filesstrings")

.

library(filesstrings)
#> Loading required package: stringr
shootsummary <- c("Aaron Alexis, 34, a military veteran and contractor ...",
                  "Pedro Vargas, 42, set fire to his apartment, killed six ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "Dennis Clark III, 27, shot and killed his girlfriend ...",
                  "Kurt Myers, 64, shot six people in neighboring ..."
)
first_number(shootsummary)
#> [1] 34 42 23 23 27 64
nth_number(shootsummary, n = 1)
#> [1] 34 42 23 23 27 64

0

Rory Nolan Feb 23 17 at 19:15

source to share

Avinash Raj · Accepted Answer · 2014-09-17T08:34:44+0000

You can try the following command sub

,

> test
[1] "Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police."              
[2] "Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him."
> sub("^\\D*(\\d+).*$", "\\1", test)
[1] "34" "42"

Sample Explanation:

^

asserts that we are at the beginning of the line.
\D*

Matches zero or more non-digit characters.
(\d+)

, then one or more digits are written to group 1 (first number).
.*

Matches any character zero or more times.
$

It claims that we are at the end of the line.
Finally, all matching characters are replaced with characters that are present within the first group.

How do I extract the first number from each row into a vector in R?

More articles: