Getting good cuts in Hmisc with cut2 (unsigned [))
I am currently trying to neatly slice data using a batch Hmisc
like in the example below:
dummy <- data.frame(important_variable=seq(1:1000))
require(Hmisc)
dummy$cuts <- cut2(dummy$important_variable, g = 4)
The cuts produced are correct relative to the values:
important_variable cuts
1 1 [ 1, 251)
2 2 [ 1, 251)
3 3 [ 1, 251)
4 4 [ 1, 251)
5 5 [ 1, 251)
6 6 [ 1, 251)
> table(dummy$cuts)
[ 1, 251) [251, 501) [501, 751) [751,1000]
250 250 250 250
However, I would like the data to be presented slightly differently. For example, instead of
[ 1, 251 )
[ 251, 501 )
I would prefer notation
1 - 250
251 - 500
Since I am doing a lot for multiple variables, I am interested in a reproducible solution that can be easily applied to multiple variables.
Edit
After discussion in the comments, the solution will have to work on more messy variables like x2 <- runif(100, 5.0, 7.5)
.
source to share
We could use gsubfn
to remove parentheses as well as change the numeric part by subtracting one from the second set of numbers
library(gsubfn)
v1 <- dummy$cuts
v1New <- gsubfn('\\[\\s*(\\d+),\\s*(\\d+)[^0-9]+', ~paste0(x, '-',
as.numeric(y)-1), as.character(v1))
table(v1New)
# 1-250 251-500 501-750 751-999
# 250 250 250 250
For the second case involving decimal places, we need to match the numbers along with the decimal places and capture those groups by putting them in parentheses ( ([0-9.]+)
, (\\d+\\.\\d+)
). We change the second set of capture group by converting it to "numeric" and subtracting 0.01 ( as.numeric(y)-0.01
) from it . \\s*
stands for 0 or more spaces. The spaces were uneven in the format, so we had to use this instead \\s+
, which is 1 or more spaces.
v2New <- gsubfn('\\[\\s*([0-9.]+),(\\d+\\.\\d+).*', ~paste0(x,
'-',as.numeric(y)-0.01), as.character(v2))
table(v2New)
v2New
#5.00-5.59 5.60-6.12 6.13-6.71 6.72-7.49
# 25 25 25 25
data
set.seed(24)
x2 <- runif(100, 5.0, 7.5)
v2 <- cut2(x2, g=4)
source to share
This provides a general solution for integer and decimal ranges (no need to manually specify the increment):
library(stringr)
pretty_cuts <- function(cut_str) {
# so we know when to not do something
first_val <- as.numeric(str_extract_all(cut_str[1], "[[:digit:]\\.]+")[[1]][1])
last_val <- as.numeric(str_extract_all(cut_str[length(cut_str)], "[[:digit:]\\.]+")[[1]][2])
sapply(seq_along(cut_str), function(i) {
# get cut range
x <- str_extract_all(cut_str[i], "[[:digit:]\\.]+")[[1]]
# see if a double vs an int & get # of places if decimal so
# we know how much to inc/dec
inc_dec <- 1
if (str_detect(x[1], "\\.")) {
x <- as.numeric(x)
inc_dec <- 10^(-match(TRUE, round(x[1], 1:20) == x[1]))
} else {
x <- as.numeric(x)
}
# if not the edge cases inc & dec
if (x[1] != first_val) { x[1] <- x[1] + inc_dec }
if (x[2] != last_val) { x[2] <- x[2] - inc_dec }
sprintf("%s - %s", as.character(x[1]), as.character(x[2]))
})
}
dummy <- data.frame(important_variable=seq(1:1000))
dummy$cuts <- cut2(dummy$important_variable, g = 4)
a <- pretty_cuts(dummy$cuts)
unique(dummy$cuts)
## [1] [ 1, 251) [251, 501) [501, 751) [751,1000]
## Levels: [ 1, 251) [251, 501) [501, 751) [751,1000]
unique(a)
## [1] "1 - 250" "252 - 500" "502 - 750" "752 - 1000"
x2 <- runif(100, 5.0, 7.5)
b <- pretty_cuts(cut2(x2, g=4))
unique(cut2(x2, g=4))
## [1] [5.54,6.28) [6.28,6.97) [6.97,7.50] [5.02,5.54)
## Levels: [5.02,5.54) [5.54,6.28) [6.28,6.97) [6.97,7.50]
unique(b)
## [1] "5.54 - 6.27" "6.29 - 6.97" "6.98 - 7.49" "5.03 - 5.53"
source to share