How to show frequencies in discrete categories with range data in R?

Question

How to show frequencies in discrete categories with range data in R?

I'm trying to figure out a bunch of data I have about dinosaurs and their age ranges. So far, my data consists of a column of names followed by two columns of maximum and minimum dates in millions of years in the past, as you can see here:

GENUS           ma_max  ma_min  ma_mid    
Abydosaurus     109     94.3    101.65    
Achelousaurus   84.9    70.6    77.75    
Acheroraptor    70.6    66.043  68.3215

Geological time is divided into different ages (eg Jurassic and Cretaceous), and they are also subdivided into stages. These milestones have specific age ranges, and I created data for them:

Stage          ma_max ma_min ma_mid
Hettangian      201.6  197.0 199.30
Sinemurian      197.0  190.0 193.50
Pliensbachian   190.0  183.0 186.50
Toarcian        183.0  176.0 179.50
Aalenian        176.0  172.0 174.00
Bajocian        172.0  168.0 170.00
Bathonian       168.0  165.0 166.50
Callovian       165.0  161.0 163.00
Oxfordian       161.0  156.0 158.50
Kimmeridgian    156.0  151.0 153.50
Tithonian       151.0  145.5 148.25
Berriasian      145.5  140.0 142.75
Valanginian     140.0  136.0 138.00
Hauterivian     136.0  130.0 133.00
Barremian       130.0  125.0 127.50
Aptian          125.0  112.0 118.50
Albian          112.0   99.6 105.80
Cenomanian      99.6   93.5  96.55
Turonian        93.5   89.3  91.40
Coniacian       89.3   85.8  87.55
Santonian       85.8   83.5  84.65
Campanian       83.5   70.6  77.05
Maastrichtian   70.6   66.5  68.05

I am trying to figure out how many births are in each stage. The problem is range - for example, a genus might have a range that spans 3 or more stages, and I want each of those stages to register the presence of the genus. Is there an easy way to do this? I've been thinking about using the "shingle" from the trellis packages as suggested in a similar discussion here, but I'm very new to R and not sure if it can be implemented in such a way that the data has a range.

+3

r range

Christopher dean 15 nov. 14 at 20:02

source to share

5 answers

I would recommend the sqldf package.

library(sqldf)

Suppose your GENUS data is located in the Genus and Stage dataframe located in the stage dataframe.

res <- sqldf("select count(*) as countDinos , s.Stage, GROUP_CONCAT(g.GENUS) as names from genus g,stage s where (g.ma_max>=s.ma_min AND g.ma_max<=s.ma_max)  OR  (g.ma_min>=s.ma_min AND g.ma_min<=s.ma_max) OR (g.ma_max>s.ma_max AND g.ma_min<s.ma_min)   group by s.Stage order by s.ma_mid DESC  ")

Should give you an answer like this:

countDinos  Stage         names
   1        Albian                         Abydosaurus   
   1        Cenomanian                     Abydosaurus   
   1        Santonian                      Achelousaurus 
   2        Campanian       Achelousaurus ,Acheroraptor  
   2        Maastrichtian   Achelousaurus ,Acheroraptor

+3

Dusan Grubjesic 15 nov. 14 at 22:01

source to share

You may also want to consider using a feature foverlaps

from the latest package data.table

.

# the setup is straight forward
library(data.table)               # need version 1.9.5+ **
                                  # can only download the latest version from  
                                  # https://github.com/Rdatatable/data.table  
                                  # at the time of posting this

setDT(genus); setDT(stage)        # This line sets your data frames to data tables
setkey(genus, ma_min, ma_max)     # This keys the start and end time of your time frame
setkey(stage, ma_min, ma_max)     # This keys the start and end time of your time frame

# the opration
matches <- foverlaps(genus, stage, type="any", nomatch=0L)
matches
#            Stage ma_max ma_min ma_mid         GENUS i.ma_max i.ma_min i.ma_mid
# 1: Maastrichtian   70.6   66.5  68.05  Acheroraptor     70.6   66.043  68.3215
# 2:     Campanian   83.5   70.6  77.05  Acheroraptor     70.6   66.043  68.3215
# 3: Maastrichtian   70.6   66.5  68.05 Achelousaurus     84.9   70.600  77.7500
# 4:     Campanian   83.5   70.6  77.05 Achelousaurus     84.9   70.600  77.7500
# 5:     Santonian   85.8   83.5  84.65 Achelousaurus     84.9   70.600  77.7500
# 6:    Cenomanian   99.6   93.5  96.55   Abydosaurus    109.0   94.300 101.6500
# 7:        Albian  112.0   99.6 105.80   Abydosaurus    109.0   94.300 101.6500

# This line below gives the frequency count (see column N)
matches[, N := length(GENUS), by=Stage][]

#            Stage ma_max ma_min ma_mid         GENUS i.ma_max i.ma_min i.ma_mid N
# 1: Maastrichtian   70.6   66.5  68.05  Acheroraptor     70.6   66.043  68.3215 2
# 2:     Campanian   83.5   70.6  77.05  Acheroraptor     70.6   66.043  68.3215 2
# 3: Maastrichtian   70.6   66.5  68.05 Achelousaurus     84.9   70.600  77.7500 2
# 4:     Campanian   83.5   70.6  77.05 Achelousaurus     84.9   70.600  77.7500 2
# 5:     Santonian   85.8   83.5  84.65 Achelousaurus     84.9   70.600  77.7500 1
# 6:    Cenomanian   99.6   93.5  96.55   Abydosaurus    109.0   94.300 101.6500 1
# 7:        Albian  112.0   99.6 105.80   Abydosaurus    109.0   94.300 101.6500 1

# Of course you could also chain the two lines of code into one:
foverlaps(genus, stage, type="any", nomatch=0L)[, N := length(GENUS), by=Stage][]

# If you prefer simplify the output by removing a few columns (6th to 8th), you could
foverlaps(genus, stage, type="any", nomatch=0L)[, N := length(GENUS), by=Stage][,6:8 := NULL][]

#            Stage ma_max ma_min ma_mid         GENUS N
# 1: Maastrichtian   70.6   66.5  68.05  Acheroraptor 2
# 2:     Campanian   83.5   70.6  77.05  Acheroraptor 2
# 3: Maastrichtian   70.6   66.5  68.05 Achelousaurus 2
# 4:     Campanian   83.5   70.6  77.05 Achelousaurus 2
# 5:     Santonian   85.8   83.5  84.65 Achelousaurus 1
# 6:    Cenomanian   99.6   93.5  96.55   Abydosaurus 1
# 7:        Albian  112.0   99.6 105.80   Abydosaurus 1

We recommend checking this ?foverlaps

+3

KFB 15 nov. 14 at 22:44

source to share

How about finding the maximum and minimum exceedance ranges? Here's a small dataset that I hope mimics your actual data quite well - the "eras" are adjacent, and my two "dinosaurs" have a random range of time.

 dino
            dmax dmin
[1,] 1 0.5500000 0.11
[2,] 2 0.3721239 0.05
> epoch
      [,1] [,2]
 [1,]  0.1  0.0
 [2,]  0.2  0.1
 [3,]  0.3  0.2
 [4,]  0.4  0.3
 [5,]  0.5  0.4
 [6,]  0.6  0.5
 [7,]  0.7  0.6
 [8,]  0.8  0.7
 [9,]  0.9  0.8
[10,]  1.0  0.9
[11,]  1.1  1.0
> max(which(epoch[,2]<dino[1,3])):min(which(epoch[,1]>dino[1,2]))
[1] 2 3 4 5 6
> max(which(epoch[,2]<dino[2,3])):min(which(epoch[,1]>dino[2,2]))
[1] 1 2 3 4

Thus, these last two rows identify the row numbers in the matrix epoch

for which the selected dinosaur existed (row from the dinosaur matrix). If you iterate over all lines i.e.for (j in 1:nrow(GENUS_matrix) ) {do that "max(which...)) stuff}

+1

Carl Witthoft 15 nov. 14 at 21:48

source to share

You can use "findInterval" to define the stage:

> stages <- read.table(text = "Stage          ma_max ma_min ma_mid
+ Hettangian      201.6  197.0 199.30
+ Sinemurian      197.0  190.0 193.50
+ Pliensbachian   190.0  183.0 186.50
+ Toarcian        183.0  176.0 179.50
+ Aalenian        176.0  172.0 174.00
+ Bajocian        172.0  168.0 170.00
+ Bathonian       168.0  165.0 166.50
+ Callovian       165.0  161.0 163.00
+ Oxfordian       161.0  156.0 158.50
+ Kimmeridgian    156.0  151.0 153.50
+ Tithonian       151.0  145.5 148.25
+ Berriasian      145.5  140.0 142.75
+ Valanginian     140.0  136.0 138.00
+ Hauterivian     136.0  130.0 133.00
+ Barremian       130.0  125.0 127.50
+ Aptian          125.0  112.0 118.50
+ Albian          112.0   99.6 105.80
+ Cenomanian      99.6   93.5  96.55
+ Turonian        93.5   89.3  91.40
+ Coniacian       89.3   85.8  87.55
+ Santonian       85.8   83.5  84.65
+ Campanian       83.5   70.6  77.05
+ Maastrichtian   70.6   66.5  68.05", as.is = TRUE, header = TRUE)
> 
> myData <- read.table(text = "GENUS           ma_max  ma_min  ma_mid    
+ Abydosaurus     109     94.3    101.65    
+ Achelousaurus   84.9    70.6    77.75    
+ Acheroraptor    70.6    66.043  68.3215    ", as.is = TRUE, header = TRUE)
> 
> # flip around to create the intervals
> stages <- stages[rev(seq(nrow(stages))), ]
> interval <- c(stages$ma_min, tail(stages$ma_max, 1))  # create interval
> 
> # for each item get the start/end stages
> start <- findInterval(myData$ma_max,interval, all.inside = TRUE)
> end <- findInterval(myData$ma_min, interval, all.inside = TRUE)
> 
> myData$stages <- apply(cbind(start, end), 1L, function(.row){
+     paste(stages$Stage[.row[1L]:.row[2L]], collapse = ', ')
+ })
> myData
          GENUS ma_max ma_min   ma_mid                   stages
1   Abydosaurus  109.0 94.300 101.6500       Albian, Cenomanian
2 Achelousaurus   84.9 70.600  77.7500     Santonian, Campanian
3  Acheroraptor   70.6 66.043  68.3215 Campanian, Maastrichtian

+1

Data munger 15 nov. 14 at 22:51

source to share

eipi10 · Accepted Answer · 2014-11-15T21:00:45+0000

Assuming your data frames are named genus

and stage

, first create a list containing, for each, the stage

names of the genera that lived during this stage

. Then we'll add this to the dataframe stage

and also add another column that counts the number of births alive during each stage

.

The code below sapply

takes each value in turn stage

and checks which values genus

fall within that time range stage

, comparing stage

ma_max

both ma_min

with ma_max

and ma_min

for each genus

.

# List of genera that lived during each Stage
stages.genus = sapply(stage$Stage, function(x){
  genus$GENUS[which((stage$ma_max[stage$Stage==x] <= genus$ma_max & 
                       stage$ma_max[stage$Stage==x] >= genus$ma_min) |
                      (stage$ma_min[stage$Stage==x] >= genus$ma_min & 
                         stage$ma_min[stage$Stage==x] <= genus$ma_max))]
})

For each element, stages.genus

insert together all the values genus

that apply to this stage

, separated by commas, giving us a vector containing the genders that come with each value stage

. Assign this vector as a new column stage

, which we'll call genera

.

# Add list of genera by stage to the stage data frame
stage$genera = lapply(stages.genus, paste, sep=", ")

To get the number of numbers in each stage

, simply count the number of genera in each element stages.genus

and assign it to a new column stage

, which we will callNgenera

# Add count of genera for each Stage to the stage data frame
stage$Ngenera = lapply(stages.genus, length)

And here's the result:

> stage

           Stage ma_max ma_min ma_mid                      genera Ngenera
1     Hettangian  201.6  197.0 199.30                                   0
2     Sinemurian  197.0  190.0 193.50                                   0
...
16        Aptian  125.0  112.0 118.50                                   0
17        Albian  112.0   99.6 105.80                 Abydosaurus       1
18    Cenomanian   99.6   93.5  96.55                 Abydosaurus       1
19      Turonian   93.5   89.3  91.40                                   0
20     Coniacian   89.3   85.8  87.55                                   0
21     Santonian   85.8   83.5  84.65               Achelousaurus       1
22     Campanian   83.5   70.6  77.05 Achelousaurus, Acheroraptor       2
23 Maastrichtian   70.6   66.5  68.05 Achelousaurus, Acheroraptor       2

An additional option is to create a column in stage

for each genus

and set the value to 1 if genus

lived at this stage or zero otherwise:

stage[, genus$GENUS] = lapply(genus$GENUS, function(x) {
  ifelse(grepl(x, stages.genus), 1, 0)
})

Here are the additional columns we just added:

> stage[ , c(1,7:9)]   # Just show the Stage plus the three new GENUS columns

           Stage Abydosaurus Achelousaurus Acheroraptor
1     Hettangian           0             0            0
2     Sinemurian           0             0            0
...
16        Aptian           0             0            0
17        Albian           1             0            0
18    Cenomanian           1             0            0
19      Turonian           0             0            0
20     Coniacian           0             0            0
21     Santonian           0             1            0
22     Campanian           0             1            1
23 Maastrichtian           0             1            1

The last step will also help you get a stage-by-stage visualization of labor. For example:

library(reshape2)
library(ggplot2)

# Melt data into long format
stage.m = melt(stage[,c(1:4,7:9)], id.var=1:4)

# Tile plot where height of each Stage is proportional to how long it lasted
ggplot(stage.m, aes(variable, ma_mid, fill=factor(value))) +
  geom_tile(aes(height=ma_max - ma_min), colour="grey20", lwd=0.2) +
  scale_fill_manual(values=c("white","blue")) +
  scale_y_continuous(breaks=stage$ma_mid, labels=stage$Stage) +
  xlab("Genus") + ylab("Stage") +
  theme_bw(base_size=15) +
  guides(fill=FALSE)

The previous code can also be modified to use time ranges from data frames stage

and genus

if you want the blue to only cover the time range when each genus

lived, not the full range of each one stage

they lived in.

enter image description here

How to show frequencies in discrete categories with range data in R?

More articles: