R: How to extract some large numbers, but not others, from a data frame

I tried to use gsub to solve this problem, but it is too complicated. I don't know how to tell the function to only return certain numbers and not others.

My problem: I have a large dataframe that has one test.comments column for each test that is executed. This is a large chunk of text from which I am only interested in certain numbers.

Example:

** POSITIVE FOR BK VIRUS ** INTERPRETATION: Calculated 18.9 billion BKV genome equivalents per ml urine were found in this patient sample ... .................................................. .................................................. ......................... 1 out of 10 test patterns ... Call 555-122-634 with questions

I would like to add a value of 18.9 billion (but not the phone number and other random numbers) in a separate column.

Sometimes the number is surrounded by _______:

** POSITIVE FOR BK VIRUS ** INTERPRETATION: CALCULATED 33,400,000 ____ BK VKUS (BKV) VIRUSES GENETIC EQUIVALENTS PER ML WERE DETERMINED

In some cases, the number is also small:

Calculated genome equivalents of 900 BK (BKV) per ml were found in this patient sample

or

** POSITIVE FOR BK VIRUS ** INTERPRETATION: Calculated BKV genome equivalents (BKV) __ <250 __________ BKV per ml were found in this patient sample.

I hope this is a reliable team that will return

18,900,000,000

33400000

900

<250

It also helps me get a command that just returns numbers> 1000 and I could manually edit other cases.

But there must be a more elegant solution?!?

edit: Thank you for your help, Sven's solution works best for me!

+3


source to share


3 answers


Here's a possible solution with sub

:

sub(".*?([<>]?[,0-9]+)[ _]+BK.*", "\\1", vec)
# [1] "18,900,000,000" "33,400,000"     "900"            "<250"  

      



where vec

is a vector containing 4 examples.

+4


source


This will pull targets in these examples (added fourth case):

 dput(test)
c("** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions", 
"** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED", 
"A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen", 
"** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen."
)

      

Need a better example if it doesn't work well:



> gsub("(^[^>_0-9]+)([0-9,]{14}|[_]+[<0-9,]+[_]+|[,0-9]+ BK)(.+$)", 
       "\\2", test)
[1] "18,900,000,000 BK" "__33,400,000____"  "900 BK" 
[4] "__<250__________" 

      

Then you can just remove the underscores and commas. The logic is that reports have a predefined number of spaces for data (these are all digits and commas if 14 characters or if not all digits are padded on both sides with underscores.

+2


source


The two approaches are still not completely reliable and I'm not sure how to fix them since I'm not a very good regexxxer

p1 <- "** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions"
p2 <- "** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED"
p3 <- "A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen

** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen."

      

This first one does not capture 900 in the third line of the example

pattern <- '(?:\\s+)*[\\d<>]((?:[\\d,])*(?![\\s-\\d]))'
regmatches(p1, gregexpr(pattern, p1, perl = TRUE))
# [[1]]
# [1] " 18,900,000"

regmatches(p2, gregexpr(pattern, p2, perl = TRUE))
# [[1]]
# [1] "33,400,000"

regmatches(p3, gregexpr(pattern, p3, perl = TRUE))
# [[1]]
# [1] "<250"

      

This second one grabs the extra strings of numbers in the first example, but grabs 900 in the third example

pattern <- "[\\d<>]((?:[\\d,])*)"
regmatches(p1, gregexpr(pattern, p1, perl = TRUE))
# [[1]]
# [1] "18,900,000,000" "1"              "10"             "555"           
# [5] "122"            "634"           

regmatches(p2, gregexpr(pattern, p2, perl = TRUE))
# [[1]]
# [1] "33,400,000"

regmatches(p3, gregexpr(pattern, p3, perl = TRUE))
# [[1]]
# [1] "900"  "<250"

      

+2


source







All Articles