R: How to extract some large numbers, but not others, from a data frame
I tried to use gsub to solve this problem, but it is too complicated. I don't know how to tell the function to only return certain numbers and not others.
My problem: I have a large dataframe that has one test.comments column for each test that is executed. This is a large chunk of text from which I am only interested in certain numbers.
Example:
** POSITIVE FOR BK VIRUS ** INTERPRETATION: Calculated 18.9 billion BKV genome equivalents per ml urine were found in this patient sample ... .................................................. .................................................. ......................... 1 out of 10 test patterns ... Call 555-122-634 with questions
I would like to add a value of 18.9 billion (but not the phone number and other random numbers) in a separate column.
Sometimes the number is surrounded by _______:
** POSITIVE FOR BK VIRUS ** INTERPRETATION: CALCULATED 33,400,000 ____ BK VKUS (BKV) VIRUSES GENETIC EQUIVALENTS PER ML WERE DETERMINED
In some cases, the number is also small:
Calculated genome equivalents of 900 BK (BKV) per ml were found in this patient sample
or
** POSITIVE FOR BK VIRUS ** INTERPRETATION: Calculated BKV genome equivalents (BKV) __ <250 __________ BKV per ml were found in this patient sample.
I hope this is a reliable team that will return
18,900,000,000
33400000
900
<250
It also helps me get a command that just returns numbers> 1000 and I could manually edit other cases.
But there must be a more elegant solution?!?
edit: Thank you for your help, Sven's solution works best for me!
source to share
This will pull targets in these examples (added fourth case):
dput(test)
c("** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions",
"** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED",
"A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen",
"** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen."
)
Need a better example if it doesn't work well:
> gsub("(^[^>_0-9]+)([0-9,]{14}|[_]+[<0-9,]+[_]+|[,0-9]+ BK)(.+$)",
"\\2", test)
[1] "18,900,000,000 BK" "__33,400,000____" "900 BK"
[4] "__<250__________"
Then you can just remove the underscores and commas. The logic is that reports have a predefined number of spaces for data (these are all digits and commas if 14 characters or if not all digits are padded on both sides with underscores.
source to share
The two approaches are still not completely reliable and I'm not sure how to fix them since I'm not a very good regexxxer
p1 <- "** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions"
p2 <- "** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED"
p3 <- "A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen
** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen."
This first one does not capture 900 in the third line of the example
pattern <- '(?:\\s+)*[\\d<>]((?:[\\d,])*(?![\\s-\\d]))'
regmatches(p1, gregexpr(pattern, p1, perl = TRUE))
# [[1]]
# [1] " 18,900,000"
regmatches(p2, gregexpr(pattern, p2, perl = TRUE))
# [[1]]
# [1] "33,400,000"
regmatches(p3, gregexpr(pattern, p3, perl = TRUE))
# [[1]]
# [1] "<250"
This second one grabs the extra strings of numbers in the first example, but grabs 900 in the third example
pattern <- "[\\d<>]((?:[\\d,])*)"
regmatches(p1, gregexpr(pattern, p1, perl = TRUE))
# [[1]]
# [1] "18,900,000,000" "1" "10" "555"
# [5] "122" "634"
regmatches(p2, gregexpr(pattern, p2, perl = TRUE))
# [[1]]
# [1] "33,400,000"
regmatches(p3, gregexpr(pattern, p3, perl = TRUE))
# [[1]]
# [1] "900" "<250"
source to share