Find frequencies in the third quartile in a table

Question

Find frequencies in the third quartile in a table

I have a large data frame (+ 239k observations on 57 variables) with some descriptions of diseases and medications prescribed for those diseases for people in different age ranges. I would like to find these drugs in the upper quartile of frequency of use for each disease description.

To make a reproducible example, I created 1000 frames of observation data:

set.seed(1);sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T));md<-as.factor(sample(c("med 1","med 2","med 3","med 4","med 5")));age<-as.factor(sample(c("group a","group b","group c"),1000,replace=T))
df<-data.frame(obs=1:1000,md=md,sk=sk,age=age)

I can create a frequency table with

xt<-xtabs(~md+sk+age,df)

Then I can create a data frame for each age group

XTDF_a<-as.data.frame(xt[,,"group a"])

and then find the 3rd quartile of the frequencies of each disease with:

Q3_a<-apply(XTDF_a,2,function(x) quantile(x,probs = .75))

to which I can compare and get which drugs are above the 3rd quartile for each disease

XTDF_a>Q3_a


    sk
md      sick A sick B sick C sick D
  med 1  FALSE  FALSE   TRUE  FALSE
  med 2  FALSE  FALSE  FALSE  FALSE
  med 3   TRUE   TRUE  FALSE  FALSE
  med 4  FALSE  FALSE  FALSE   TRUE
  med 5  FALSE  FALSE  FALSE  FALSE

I can conclude that med 3

is the top pick for Sickness A, etc. (I automatically iterate over this information). Then I go back and repeat the process for group b, c .... which is almost impossible with the size of the data I have (diseases are about 4200 levels, and drugs are about 1150 levels).

I'm sure there must be another, easier way to achieve this. I would appreciate a better path to follow.

+3

r frequency categorical-data contingency

PavoDive May 19 '15 at 17:02

source to share

2 answers

apply

can work on 3 arrays, and you can specify multiple dimensions to iterate over:

> apply(xt,2:3,function(x) x > quantile(x, probs = .75))
, , age = group a

       sk
md      sick A sick B sick C sick D
  med 1  FALSE  FALSE   TRUE  FALSE
  med 2  FALSE  FALSE  FALSE  FALSE
  med 3   TRUE   TRUE  FALSE  FALSE
  med 4  FALSE  FALSE  FALSE   TRUE
  med 5  FALSE  FALSE  FALSE  FALSE

, , age = group b

       sk
md      sick A sick B sick C sick D
  med 1  FALSE  FALSE   TRUE  FALSE
  med 2  FALSE  FALSE  FALSE  FALSE
  med 3  FALSE  FALSE  FALSE  FALSE
  med 4   TRUE  FALSE  FALSE  FALSE
  med 5  FALSE   TRUE  FALSE  FALSE

, , age = group c

       sk
md      sick A sick B sick C sick D
  med 1  FALSE  FALSE  FALSE   TRUE
  med 2  FALSE  FALSE  FALSE  FALSE
  med 3  FALSE  FALSE  FALSE  FALSE
  med 4  FALSE  FALSE  FALSE  FALSE
  med 5   TRUE  FALSE  FALSE  FALSE

+2

Neal fultz May 19 '15 at 17:25

source to share

TARehman · Accepted Answer · 2015-05-19T18:09:02+0000

I think you can speed this up by writing a slightly more precise function and then using aggregate

to get the results. You can also use by

if you need a more list-based approach, which might be more useful for your next use. I think it will still be slow, but not as slow as a loop.

# Here is what you gave me originally
set.seed(1)
sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T))
md<-as.factor(sample(c("med 1","med 2","med 3","med 4","med 5")))
age<-as.factor(sample(c("group a","group b","group c"),1000,replace=T))
df<-data.frame(obs=1:1000,md=md,sk=sk,age=age)

# Define a function that basically does what you did before, but uses table()
func.get_75th_meds <- function(vector_of_meds) {

    freq <- table(vector_of_meds)
    return(names(freq)[freq >= quantile(x = freq,probs = 0.75)])
}

aggregate(x = list(Meds = df$md),
          by = list(Sickness = df$sk,Group = df$age),
          FUN = func.get_75th_meds)

   Sickness   Group                       Meds
1    sick A group a               med 3, med 5
2    sick B group a               med 3, med 5
3    sick C group a med 1, med 2, med 4, med 5
4    sick D group a               med 2, med 4
5    sick A group b               med 4, med 5
6    sick B group b        med 1, med 2, med 5
7    sick C group b               med 1, med 2
8    sick D group b               med 2, med 3
9    sick A group c               med 2, med 5
10   sick B group c               med 2, med 4
11   sick C group c        med 1, med 2, med 4
12   sick D group c        med 1, med 3, med 4

EDITED TO ADD: Here's an alternative by()

with using the same function.

by(data = df$md,
   INDICES = list(Sickness = df$sk,Group = df$age),
   FUN = func.get_75th_meds)

Sickness: sick A
Group: group a
[1] "med 3" "med 5"
---------------------------------------------------------------
Sickness: sick B
Group: group a
[1] "med 3" "med 5"
---------------------------------------------------------------
... and so on

Find frequencies in the third quartile in a table

More articles: