Multiple probability distribution functions

I am having a hard time creating an efficient procedure that adds and multiplies probability density functions to predict the distribution of the time it will take to complete two steps in the process.

Let "a" represent the probability distribution function of how long it takes for process "A" to complete. Zero days = 10%, one day = 40%, two days = 50%. Let "b" represent the probability distribution function of how long it takes for process "B" to complete. Zero days = 10%, one day = 20%, etc.

Process "B" cannot start before process "A" finishes, so "B" depends on "A".

a <- c(.1, .4, .5)
b <- c(.1,.2,.3,.3,.1)

      

How can I calculate the probability density function of the time to complete "A" and "B"?

This is what I would expect as output for or the following example:

totallength <- 0 # initialize
totallength[1:(length(a) + length(b))] <- 0 # initialize
totallength[1] <- a[1]*b[1]
totallength[2] <- a[1]*b[2] + a[2]*b[1]
totallength[3] <- a[1]*b[3] + a[2]*b[2] + a[3]*b[1]
totallength[4] <- a[1]*b[4] + a[2]*b[3] + a[3]*b[2]
totallength[5] <- a[1]*b[5] + a[2]*b[4] + a[3]*b[3]
totallength[6] <- a[2]*b[5] + a[3]*b[4]
totallength[7] <- a[3]*b[5]

print(totallength)
[1] [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
sum(totallength)
[1] 1

      

I have an approach in Visual Basic that used three for loops (one for each of the steps and one for the output), but I hope I don't need to loop over R.

Since this seems to be a pretty standard question about process flow, the second part of my question is if there are any libraries for modeling workflows, so I am not building this from scratch.

+3


source to share


3 answers


An efficient way to perform this kind of operation is to use convolution:

convolve(a, rev(b), type="open")
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05

      

This is efficient both because it prints less than it calculates each value individually, and also because it is implemented in an efficient way (using Fast Fourier Transform or FFT).



You can confirm that each of these values ​​is correct using the formulas laid out:

(expected <- c(a[1]*b[1], a[1]*b[2] + a[2]*b[1], a[1]*b[3] + a[2]*b[2] + a[3]*b[1], a[1]*b[4] + a[2]*b[3] + a[3]*b[2], a[1]*b[5] + a[2]*b[4] + a[3]*b[3], a[2]*b[5] + a[3]*b[4], a[3]*b[5]))
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05

      

+4


source


See package: distr

. The choice of the term "multiply" is unfortunate because the situation described is not one where the contributions to the probabilities are independent (where multiplication of probabilities would be a natural term to use). It is rather a kind of sequential addition, and this is what the package distr

provides as its interpretation of what "+" means when used as a symbolic manipulation of two discrete distributions.

 A <- DiscreteDistribution ( setNames(0:2, c('Zero', 'one', 'two') ), a)
 B <- DiscreteDistribution(setNames(0:2, c(  "Zero2" ,"one2", "two2", 
                                               "three2", "four2") ),  b )
?'operators-methods'  # where operations on 2 DiscreteDistribution are convolution
plot(A+B)

      

enter image description here



After a little reversal, I see that the actual numeric values ​​can be found here:

 A.then.B <- A + B
> environment(A.the.nB@d)$dx
[1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05

      

It looks like there was supposed to be a method for displaying probabilities, and I'm not a regular user of this fascinating package, so there may be a way. Read the vignette and demo code ... which I haven't done yet. Further noodling around convinces me that the right place to look is in the package: distrDoc

where the vignette is over 100 pages long. And it shouldn't have taken any effort to find it, as this advice is in the messages that are printed when the package is loaded ... except in my defense there were several pages of messages, so it was more tempting to go into encoding and use the help pages.

+4


source


I am not familiar with a dedicated package that accurately describes your example. but let me adapt a more robust solution to this problem. You are looking for a method to estimate the distribution of a process that can be combined with an n step process, in your case 2, which is not as easy to compute as your example. The approach I would like to use is modeling, observing 10k observations buried from the underlying distributions, and then calculating the density function of the simulated results. using your example, we can do the following:

x <- runif(10000)
y <- runif(10000)

library(data.table)
z <- as.data.table(cbind(x,y))
z[x>=0 & x<0.1, a_days:=0]
z[x>=0.1 & x<0.5, a_days:=1]
z[x>=0.5 & x<=1, a_days:=2]
z[y>=0 & y <0.1, b_days:=0]
z[x>=0.1 & x<0.3, b_days:=1]
z[x>=0.3 & x<0.5, b_days:=2]
z[x>=0.5 & x<0.8, b_days:=3]
z[x>=0.8 & x<=1, b_days:=4]
z[,total_days:=a_days+b_days]
hist(z[,total_days])

      

this will lead to a very good proxy if the density and aproach will also work if your second process drowns out of exponential distribution. in this case, you must use the function rexp

to calculate b_days directly.

+2


source







All Articles