How do I extract the first number from each row into a vector in R?

I am new to regex in R. Here I have a vector in which I am interested in retrieving the first fill of a number in each line of the vector.

I have a vector called "shootsummary" that looks like this.

> head(shootsummary)
[1] Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.                                         
[2] Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him.                           
[3] John Zawahri, 23, armed with a homemade assault rifle and high-capacity magazines, killed his brother and father at home and then headed to Santa Monica College, where he was eventually killed by police.      
[4] Dennis Clark III, 27, shot and killed his girlfriend in their shared apartment, and then shot two witnesses in the building parking lot and a third victim in another apartment, before being killed by police.
[5] Kurt Myers, 64, shot six people in neighboring towns, killing two in a barbershop and two at a car care business, before being killed by officers in a shootout after a nearly 19-hour standoff.  

      

The first occurrence of a number in each line means the "age" of the person, and I am interested in extracting ages from these lines without mixing them with other numbers in the specified lines.

I used:

as.numeric(gsub("\\D", "", shootsummary))

      

The result was:

[1]  34128     42     23     27   6419  

      

I'm looking for a result that looks the same as just the ages extracted from the sentence, without extracting other numbers that occur after the age.

[1]  34     42     23     27   64

      

+3


source to share


7 replies


You can try the following command sub

,

> test
[1] "Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police."              
[2] "Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him."
> sub("^\\D*(\\d+).*$", "\\1", test)
[1] "34" "42"

      



Sample Explanation:

  • ^

    asserts that we are at the beginning of the line.
  • \D*

    Matches zero or more non-digit characters.
  • (\d+)

    , then one or more digits are written to group 1 (first number).
  • .*

    Matches any character zero or more times.
  • $

    It claims that we are at the end of the line.
  • Finally, all matching characters are replaced with characters that are present within the first group.
+2


source


stringi

will be faster



library(stringi)
stri_extract_first(shootsummary, regex="\\d+")
#[1] "34" "42" "23" "27" "64"

      

+3


source


What about

splitbycomma <- strsplit(shootsummary, ",")
as.numeric(  sapply(splitbycomma, "[", 2)  )

      

+1


source


R regmatches()

returns a vector with the first regex match at each element:

regmatches(shootsummary, regexpr("\\d+", shootsummary, perl=TRUE));

      

+1


source


One parameter str_extract

from stringr

with a wrapper as.numeric

.

> library(stringr)
> as.numeric(str_extract(shootsummary, "[0-9]+"))
# [1] 34 42 23 27 64

      

Update In response to your question in the comments to this answer, here's a bit of explanation. A full description of the function can be found in the help file.

  • str_extract

    returns the first occurrence of the regular expression. It is vectorized over a character vector in its first argument.
  • Regular expression [0-9]+

    matches any character: '0' - '9' (1 or more times)
  • as.numeric

    changes the resulting character vector to a numeric vector.
+1


source


You can use sub

:

test <- ("xff 34 sfsdg 352 efsrg")

sub(".*?(\\d+).*", "\\1", test)
# [1] "34"

      

How does regular expression work?

.

matches any character. Quantifier *

means any number of occurrences. ?

used to match all characters up to the first match \\d

(digit). A quantifier +

means one or more occurrences. The parentheses around \\d

are the first group of matches. This may be followed by additional characters ( .*

). The second argument ( \\1

) replaces the entire string with the first match group (i.e., the first number).

+1


source


You can do this very well with a function first_number()

from a package, filesstrings

or for more general needs, there is a function there nth_number()

. Install it with install.packages("filesstrings")

.

library(filesstrings)
#> Loading required package: stringr
shootsummary <- c("Aaron Alexis, 34, a military veteran and contractor ...",
                  "Pedro Vargas, 42, set fire to his apartment, killed six ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "Dennis Clark III, 27, shot and killed his girlfriend ...",
                  "Kurt Myers, 64, shot six people in neighboring ..."
)
first_number(shootsummary)
#> [1] 34 42 23 23 27 64
nth_number(shootsummary, n = 1)
#> [1] 34 42 23 23 27 64

      

0


source







All Articles