How to show frequencies in discrete categories with range data in R?
I'm trying to figure out a bunch of data I have about dinosaurs and their age ranges. So far, my data consists of a column of names followed by two columns of maximum and minimum dates in millions of years in the past, as you can see here:
GENUS ma_max ma_min ma_mid
Abydosaurus 109 94.3 101.65
Achelousaurus 84.9 70.6 77.75
Acheroraptor 70.6 66.043 68.3215
Geological time is divided into different ages (eg Jurassic and Cretaceous), and they are also subdivided into stages. These milestones have specific age ranges, and I created data for them:
Stage ma_max ma_min ma_mid
Hettangian 201.6 197.0 199.30
Sinemurian 197.0 190.0 193.50
Pliensbachian 190.0 183.0 186.50
Toarcian 183.0 176.0 179.50
Aalenian 176.0 172.0 174.00
Bajocian 172.0 168.0 170.00
Bathonian 168.0 165.0 166.50
Callovian 165.0 161.0 163.00
Oxfordian 161.0 156.0 158.50
Kimmeridgian 156.0 151.0 153.50
Tithonian 151.0 145.5 148.25
Berriasian 145.5 140.0 142.75
Valanginian 140.0 136.0 138.00
Hauterivian 136.0 130.0 133.00
Barremian 130.0 125.0 127.50
Aptian 125.0 112.0 118.50
Albian 112.0 99.6 105.80
Cenomanian 99.6 93.5 96.55
Turonian 93.5 89.3 91.40
Coniacian 89.3 85.8 87.55
Santonian 85.8 83.5 84.65
Campanian 83.5 70.6 77.05
Maastrichtian 70.6 66.5 68.05
I am trying to figure out how many births are in each stage. The problem is range - for example, a genus might have a range that spans 3 or more stages, and I want each of those stages to register the presence of the genus. Is there an easy way to do this? I've been thinking about using the "shingle" from the trellis packages as suggested in a similar discussion here, but I'm very new to R and not sure if it can be implemented in such a way that the data has a range.
source to share
Assuming your data frames are named genus
and stage
, first create a list containing, for each, the stage
names of the genera that lived during this stage
. Then we'll add this to the dataframe stage
and also add another column that counts the number of births alive during each stage
.
The code below sapply
takes each value in turn stage
and checks which values genus
fall within that time range stage
, comparing stage
ma_max
both ma_min
with ma_max
and ma_min
for each genus
.
# List of genera that lived during each Stage
stages.genus = sapply(stage$Stage, function(x){
genus$GENUS[which((stage$ma_max[stage$Stage==x] <= genus$ma_max &
stage$ma_max[stage$Stage==x] >= genus$ma_min) |
(stage$ma_min[stage$Stage==x] >= genus$ma_min &
stage$ma_min[stage$Stage==x] <= genus$ma_max))]
})
For each element, stages.genus
insert together all the values genus
that apply to this stage
, separated by commas, giving us a vector containing the genders that come with each value stage
. Assign this vector as a new column stage
, which we'll call genera
.
# Add list of genera by stage to the stage data frame
stage$genera = lapply(stages.genus, paste, sep=", ")
To get the number of numbers in each stage
, simply count the number of genera in each element stages.genus
and assign it to a new column stage
, which we will callNgenera
# Add count of genera for each Stage to the stage data frame
stage$Ngenera = lapply(stages.genus, length)
And here's the result:
> stage
Stage ma_max ma_min ma_mid genera Ngenera
1 Hettangian 201.6 197.0 199.30 0
2 Sinemurian 197.0 190.0 193.50 0
...
16 Aptian 125.0 112.0 118.50 0
17 Albian 112.0 99.6 105.80 Abydosaurus 1
18 Cenomanian 99.6 93.5 96.55 Abydosaurus 1
19 Turonian 93.5 89.3 91.40 0
20 Coniacian 89.3 85.8 87.55 0
21 Santonian 85.8 83.5 84.65 Achelousaurus 1
22 Campanian 83.5 70.6 77.05 Achelousaurus, Acheroraptor 2
23 Maastrichtian 70.6 66.5 68.05 Achelousaurus, Acheroraptor 2
An additional option is to create a column in stage
for each genus
and set the value to 1 if genus
lived at this stage or zero otherwise:
stage[, genus$GENUS] = lapply(genus$GENUS, function(x) {
ifelse(grepl(x, stages.genus), 1, 0)
})
Here are the additional columns we just added:
> stage[ , c(1,7:9)] # Just show the Stage plus the three new GENUS columns
Stage Abydosaurus Achelousaurus Acheroraptor
1 Hettangian 0 0 0
2 Sinemurian 0 0 0
...
16 Aptian 0 0 0
17 Albian 1 0 0
18 Cenomanian 1 0 0
19 Turonian 0 0 0
20 Coniacian 0 0 0
21 Santonian 0 1 0
22 Campanian 0 1 1
23 Maastrichtian 0 1 1
The last step will also help you get a stage-by-stage visualization of labor. For example:
library(reshape2)
library(ggplot2)
# Melt data into long format
stage.m = melt(stage[,c(1:4,7:9)], id.var=1:4)
# Tile plot where height of each Stage is proportional to how long it lasted
ggplot(stage.m, aes(variable, ma_mid, fill=factor(value))) +
geom_tile(aes(height=ma_max - ma_min), colour="grey20", lwd=0.2) +
scale_fill_manual(values=c("white","blue")) +
scale_y_continuous(breaks=stage$ma_mid, labels=stage$Stage) +
xlab("Genus") + ylab("Stage") +
theme_bw(base_size=15) +
guides(fill=FALSE)
The previous code can also be modified to use time ranges from data frames stage
and genus
if you want the blue to only cover the time range when each genus
lived, not the full range of each one stage
they lived in.
source to share
I would recommend the sqldf package.
library(sqldf)
Suppose your GENUS data is located in the Genus and Stage dataframe located in the stage dataframe.
res <- sqldf("select count(*) as countDinos , s.Stage, GROUP_CONCAT(g.GENUS) as names from genus g,stage s where (g.ma_max>=s.ma_min AND g.ma_max<=s.ma_max) OR (g.ma_min>=s.ma_min AND g.ma_min<=s.ma_max) OR (g.ma_max>s.ma_max AND g.ma_min<s.ma_min) group by s.Stage order by s.ma_mid DESC ")
Should give you an answer like this:
countDinos Stage names
1 Albian Abydosaurus
1 Cenomanian Abydosaurus
1 Santonian Achelousaurus
2 Campanian Achelousaurus ,Acheroraptor
2 Maastrichtian Achelousaurus ,Acheroraptor
source to share
You may also want to consider using a feature foverlaps
from the latest package data.table
.
# the setup is straight forward
library(data.table) # need version 1.9.5+ **
# can only download the latest version from
# https://github.com/Rdatatable/data.table
# at the time of posting this
setDT(genus); setDT(stage) # This line sets your data frames to data tables
setkey(genus, ma_min, ma_max) # This keys the start and end time of your time frame
setkey(stage, ma_min, ma_max) # This keys the start and end time of your time frame
# the opration
matches <- foverlaps(genus, stage, type="any", nomatch=0L)
matches
# Stage ma_max ma_min ma_mid GENUS i.ma_max i.ma_min i.ma_mid
# 1: Maastrichtian 70.6 66.5 68.05 Acheroraptor 70.6 66.043 68.3215
# 2: Campanian 83.5 70.6 77.05 Acheroraptor 70.6 66.043 68.3215
# 3: Maastrichtian 70.6 66.5 68.05 Achelousaurus 84.9 70.600 77.7500
# 4: Campanian 83.5 70.6 77.05 Achelousaurus 84.9 70.600 77.7500
# 5: Santonian 85.8 83.5 84.65 Achelousaurus 84.9 70.600 77.7500
# 6: Cenomanian 99.6 93.5 96.55 Abydosaurus 109.0 94.300 101.6500
# 7: Albian 112.0 99.6 105.80 Abydosaurus 109.0 94.300 101.6500
# This line below gives the frequency count (see column N)
matches[, N := length(GENUS), by=Stage][]
# Stage ma_max ma_min ma_mid GENUS i.ma_max i.ma_min i.ma_mid N
# 1: Maastrichtian 70.6 66.5 68.05 Acheroraptor 70.6 66.043 68.3215 2
# 2: Campanian 83.5 70.6 77.05 Acheroraptor 70.6 66.043 68.3215 2
# 3: Maastrichtian 70.6 66.5 68.05 Achelousaurus 84.9 70.600 77.7500 2
# 4: Campanian 83.5 70.6 77.05 Achelousaurus 84.9 70.600 77.7500 2
# 5: Santonian 85.8 83.5 84.65 Achelousaurus 84.9 70.600 77.7500 1
# 6: Cenomanian 99.6 93.5 96.55 Abydosaurus 109.0 94.300 101.6500 1
# 7: Albian 112.0 99.6 105.80 Abydosaurus 109.0 94.300 101.6500 1
# Of course you could also chain the two lines of code into one:
foverlaps(genus, stage, type="any", nomatch=0L)[, N := length(GENUS), by=Stage][]
# If you prefer simplify the output by removing a few columns (6th to 8th), you could
foverlaps(genus, stage, type="any", nomatch=0L)[, N := length(GENUS), by=Stage][,6:8 := NULL][]
# Stage ma_max ma_min ma_mid GENUS N
# 1: Maastrichtian 70.6 66.5 68.05 Acheroraptor 2
# 2: Campanian 83.5 70.6 77.05 Acheroraptor 2
# 3: Maastrichtian 70.6 66.5 68.05 Achelousaurus 2
# 4: Campanian 83.5 70.6 77.05 Achelousaurus 2
# 5: Santonian 85.8 83.5 84.65 Achelousaurus 1
# 6: Cenomanian 99.6 93.5 96.55 Abydosaurus 1
# 7: Albian 112.0 99.6 105.80 Abydosaurus 1
We recommend checking this ?foverlaps
source to share
How about finding the maximum and minimum exceedance ranges? Here's a small dataset that I hope mimics your actual data quite well - the "eras" are adjacent, and my two "dinosaurs" have a random range of time.
dino
dmax dmin
[1,] 1 0.5500000 0.11
[2,] 2 0.3721239 0.05
> epoch
[,1] [,2]
[1,] 0.1 0.0
[2,] 0.2 0.1
[3,] 0.3 0.2
[4,] 0.4 0.3
[5,] 0.5 0.4
[6,] 0.6 0.5
[7,] 0.7 0.6
[8,] 0.8 0.7
[9,] 0.9 0.8
[10,] 1.0 0.9
[11,] 1.1 1.0
> max(which(epoch[,2]<dino[1,3])):min(which(epoch[,1]>dino[1,2]))
[1] 2 3 4 5 6
> max(which(epoch[,2]<dino[2,3])):min(which(epoch[,1]>dino[2,2]))
[1] 1 2 3 4
Thus, these last two rows identify the row numbers in the matrix epoch
for which the selected dinosaur existed (row from the dinosaur matrix). If you iterate over all lines i.e.for (j in 1:nrow(GENUS_matrix) ) {do that "max(which...)) stuff}
source to share
You can use "findInterval" to define the stage:
> stages <- read.table(text = "Stage ma_max ma_min ma_mid
+ Hettangian 201.6 197.0 199.30
+ Sinemurian 197.0 190.0 193.50
+ Pliensbachian 190.0 183.0 186.50
+ Toarcian 183.0 176.0 179.50
+ Aalenian 176.0 172.0 174.00
+ Bajocian 172.0 168.0 170.00
+ Bathonian 168.0 165.0 166.50
+ Callovian 165.0 161.0 163.00
+ Oxfordian 161.0 156.0 158.50
+ Kimmeridgian 156.0 151.0 153.50
+ Tithonian 151.0 145.5 148.25
+ Berriasian 145.5 140.0 142.75
+ Valanginian 140.0 136.0 138.00
+ Hauterivian 136.0 130.0 133.00
+ Barremian 130.0 125.0 127.50
+ Aptian 125.0 112.0 118.50
+ Albian 112.0 99.6 105.80
+ Cenomanian 99.6 93.5 96.55
+ Turonian 93.5 89.3 91.40
+ Coniacian 89.3 85.8 87.55
+ Santonian 85.8 83.5 84.65
+ Campanian 83.5 70.6 77.05
+ Maastrichtian 70.6 66.5 68.05", as.is = TRUE, header = TRUE)
>
> myData <- read.table(text = "GENUS ma_max ma_min ma_mid
+ Abydosaurus 109 94.3 101.65
+ Achelousaurus 84.9 70.6 77.75
+ Acheroraptor 70.6 66.043 68.3215 ", as.is = TRUE, header = TRUE)
>
> # flip around to create the intervals
> stages <- stages[rev(seq(nrow(stages))), ]
> interval <- c(stages$ma_min, tail(stages$ma_max, 1)) # create interval
>
> # for each item get the start/end stages
> start <- findInterval(myData$ma_max,interval, all.inside = TRUE)
> end <- findInterval(myData$ma_min, interval, all.inside = TRUE)
>
> myData$stages <- apply(cbind(start, end), 1L, function(.row){
+ paste(stages$Stage[.row[1L]:.row[2L]], collapse = ', ')
+ })
> myData
GENUS ma_max ma_min ma_mid stages
1 Abydosaurus 109.0 94.300 101.6500 Albian, Cenomanian
2 Achelousaurus 84.9 70.600 77.7500 Santonian, Campanian
3 Acheroraptor 70.6 66.043 68.3215 Campanian, Maastrichtian
source to share