Getting good cuts in Hmisc with cut2 (unsigned [))

I am currently trying to neatly slice data using a batch Hmisc

like in the example below:

dummy <- data.frame(important_variable=seq(1:1000))
require(Hmisc)
dummy$cuts <- cut2(dummy$important_variable, g = 4)

      

The cuts produced are correct relative to the values:

  important_variable       cuts
1                  1 [  1, 251)
2                  2 [  1, 251)
3                  3 [  1, 251)
4                  4 [  1, 251)
5                  5 [  1, 251)
6                  6 [  1, 251)
> table(dummy$cuts)
[  1, 251) [251, 501) [501, 751) [751,1000] 
       250        250        250        250 

      

However, I would like the data to be presented slightly differently. For example, instead of

[ 1, 251 )

[ 251, 501 )

I would prefer notation

1 - 250

251 - 500

Since I am doing a lot for multiple variables, I am interested in a reproducible solution that can be easily applied to multiple variables.


Edit

After discussion in the comments, the solution will have to work on more messy variables like x2 <- runif(100, 5.0, 7.5)

.

+3


source to share


2 answers


We could use gsubfn

to remove parentheses as well as change the numeric part by subtracting one from the second set of numbers

 library(gsubfn)
 v1 <- dummy$cuts
 v1New <-  gsubfn('\\[\\s*(\\d+),\\s*(\\d+)[^0-9]+', ~paste0(x, '-', 
                     as.numeric(y)-1), as.character(v1))
 table(v1New)
 # 1-250 251-500 501-750 751-999 
 #  250     250     250     250 

      

For the second case involving decimal places, we need to match the numbers along with the decimal places and capture those groups by putting them in parentheses ( ([0-9.]+)

, (\\d+\\.\\d+)

). We change the second set of capture group by converting it to "numeric" and subtracting 0.01 ( as.numeric(y)-0.01

) from it . \\s*

stands for 0 or more spaces. The spaces were uneven in the format, so we had to use this instead \\s+

, which is 1 or more spaces.



 v2New <- gsubfn('\\[\\s*([0-9.]+),(\\d+\\.\\d+).*', ~paste0(x,
                 '-',as.numeric(y)-0.01), as.character(v2))
 table(v2New)
 v2New
 #5.00-5.59 5.60-6.12 6.13-6.71 6.72-7.49 
 #    25        25        25        25 

      

data

 set.seed(24)
 x2 <- runif(100, 5.0, 7.5)
 v2 <- cut2(x2, g=4)

      

+3


source


This provides a general solution for integer and decimal ranges (no need to manually specify the increment):



library(stringr)

pretty_cuts <- function(cut_str) {

  # so we know when to not do something

  first_val <- as.numeric(str_extract_all(cut_str[1], "[[:digit:]\\.]+")[[1]][1])
  last_val <- as.numeric(str_extract_all(cut_str[length(cut_str)], "[[:digit:]\\.]+")[[1]][2])

  sapply(seq_along(cut_str), function(i) {

    # get cut range

    x <- str_extract_all(cut_str[i], "[[:digit:]\\.]+")[[1]]

    # see if a double vs an int & get # of places if decimal so
    # we know how much to inc/dec

    inc_dec <- 1
    if (str_detect(x[1], "\\.")) {
      x <- as.numeric(x)
      inc_dec <- 10^(-match(TRUE, round(x[1], 1:20) == x[1]))
    } else {
      x <- as.numeric(x)
    }

    # if not the edge cases inc & dec

    if (x[1] != first_val) { x[1] <- x[1] + inc_dec }
    if (x[2] != last_val)  { x[2] <- x[2] - inc_dec }

    sprintf("%s - %s", as.character(x[1]), as.character(x[2]))

  })

}

dummy <- data.frame(important_variable=seq(1:1000))
dummy$cuts <- cut2(dummy$important_variable, g = 4)
a <- pretty_cuts(dummy$cuts)

unique(dummy$cuts)
## [1] [  1, 251) [251, 501) [501, 751) [751,1000]
## Levels: [  1, 251) [251, 501) [501, 751) [751,1000]

unique(a)
## [1] "1 - 250"    "252 - 500"  "502 - 750"  "752 - 1000"

x2 <- runif(100, 5.0, 7.5)
b <- pretty_cuts(cut2(x2, g=4))

unique(cut2(x2, g=4))
## [1] [5.54,6.28) [6.28,6.97) [6.97,7.50] [5.02,5.54)
## Levels: [5.02,5.54) [5.54,6.28) [6.28,6.97) [6.97,7.50]

unique(b)
## [1] "5.54 - 6.27" "6.29 - 6.97" "6.98 - 7.49" "5.03 - 5.53"

      

+2


source







All Articles