How do I extract the first number from each row into a vector in R?
I am new to regex in R. Here I have a vector in which I am interested in retrieving the first fill of a number in each line of the vector.
I have a vector called "shootsummary" that looks like this.
> head(shootsummary)
[1] Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.
[2] Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him.
[3] John Zawahri, 23, armed with a homemade assault rifle and high-capacity magazines, killed his brother and father at home and then headed to Santa Monica College, where he was eventually killed by police.
[4] Dennis Clark III, 27, shot and killed his girlfriend in their shared apartment, and then shot two witnesses in the building parking lot and a third victim in another apartment, before being killed by police.
[5] Kurt Myers, 64, shot six people in neighboring towns, killing two in a barbershop and two at a car care business, before being killed by officers in a shootout after a nearly 19-hour standoff.
The first occurrence of a number in each line means the "age" of the person, and I am interested in extracting ages from these lines without mixing them with other numbers in the specified lines.
I used:
as.numeric(gsub("\\D", "", shootsummary))
The result was:
[1] 34128 42 23 27 6419
I'm looking for a result that looks the same as just the ages extracted from the sentence, without extracting other numbers that occur after the age.
[1] 34 42 23 27 64
source to share
You can try the following command sub
,
> test
[1] "Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police."
[2] "Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him."
> sub("^\\D*(\\d+).*$", "\\1", test)
[1] "34" "42"
Sample Explanation:
-
^
asserts that we are at the beginning of the line. -
\D*
Matches zero or more non-digit characters. -
(\d+)
, then one or more digits are written to group 1 (first number). -
.*
Matches any character zero or more times. -
$
It claims that we are at the end of the line. - Finally, all matching characters are replaced with characters that are present within the first group.
source to share
One parameter str_extract
from stringr
with a wrapper as.numeric
.
> library(stringr)
> as.numeric(str_extract(shootsummary, "[0-9]+"))
# [1] 34 42 23 27 64
Update In response to your question in the comments to this answer, here's a bit of explanation. A full description of the function can be found in the help file.
-
str_extract
returns the first occurrence of the regular expression. It is vectorized over a character vector in its first argument. - Regular expression
[0-9]+
matches any character: '0' - '9' (1 or more times) -
as.numeric
changes the resulting character vector to a numeric vector.
source to share
You can use sub
:
test <- ("xff 34 sfsdg 352 efsrg")
sub(".*?(\\d+).*", "\\1", test)
# [1] "34"
How does regular expression work?
.
matches any character. Quantifier *
means any number of occurrences. ?
used to match all characters up to the first match \\d
(digit). A quantifier +
means one or more occurrences. The parentheses around \\d
are the first group of matches. This may be followed by additional characters ( .*
). The second argument ( \\1
) replaces the entire string with the first match group (i.e., the first number).
source to share
You can do this very well with a function first_number()
from a package, filesstrings
or for more general needs, there is a function there nth_number()
. Install it with install.packages("filesstrings")
.
library(filesstrings)
#> Loading required package: stringr
shootsummary <- c("Aaron Alexis, 34, a military veteran and contractor ...",
"Pedro Vargas, 42, set fire to his apartment, killed six ...",
"John Zawahri, 23, armed with a homemade assault rifle ...",
"John Zawahri, 23, armed with a homemade assault rifle ...",
"Dennis Clark III, 27, shot and killed his girlfriend ...",
"Kurt Myers, 64, shot six people in neighboring ..."
)
first_number(shootsummary)
#> [1] 34 42 23 23 27 64
nth_number(shootsummary, n = 1)
#> [1] 34 42 23 23 27 64
source to share