Computing Anderson-Darlene Test Statistics for Continuous Distributions in R

First of all, I'm not sure if this applies to CrossValidated or StackOverflow. Sorry if I posted this question on the wrong site.

I am comparing several datasets to an observational dataset using R. Each has about 10 million continuous float values โ€‹โ€‹(the length of the data vector is not exactly the same for each dataset).

I usually calculate the Kolmogorov-Smirnov statistics using a function ks.test()

from the standard package stats

, but now I am especially interested in the extreme values โ€‹โ€‹of the distributions. From what I understand, KS is pretty much hiding them. The same happens for Kullback-Leibler (feel free to correct me if I'm wrong).

On the other hand, the Anderson-Darling test is weighted to account for the extremes of the distributions. However, I have not been able to find a simple AD test implementation that only works on two vectors as inputs (as it stats::ks.test()

does by just outputting ks.test(obs.data, mod.data)

where the two inputs are simple vectors) and neither of them "I was able to figure out how to adapt my data to the functions I tested.

I looked at the following features:

  • cvm.test()

    from package dgof

    , with option type="A2"

    : requires distribution as second input, not vector
  • ad.test()

    from package truncgof

    : requires distribution as second input
  • ad.test()

    from package goftest

    : as above
  • ad.test()

    from package ADGofTest

    : as above
  • ad.test()

    from the package kSamples

    : in this case it is not clear to me what the result represents and how I could normalize it as it seems to be highly dependent on the number of samples
  • ad.test()

    from package nortest

    : only tests for normality
  • ADbootstrap.test()

    from the package homtest

    : this seems to be completely different from the standard AD test

None of the above, in short, can be used as a standard function ks.test()

or as a Kullbach-Leibler function KLdiv

from a package flexmix

(which accepts a matrix of density values).

How can I calculate AD statistics between two distributions represented as simply two vectors of continuous data using R?

+3


source to share


1 answer


I am not a statistics expert and I am researching AD test on my own and I have the same question. After reading some articles, I know how to interpret the results ad.test()

on kSamples

.

The original AD test is designed to test if a sample of numbers is from a specific distribution. Therefore, to compare two samples (or more), we have to use a function that tests the k-sample method instead of the original one.

If you put two vectors into a batch ad.test()

from kSamples

:

library(kSamples)
x <- ad.test(c(1,2,3,4,5), c(11,22,33,44,55))

      

the result gives you a matrix:



print(x)

Anderson-Darling k-sample test.

Number of samples:  2
Sample sizes:  5, 5
Number of ties: 0

Mean of  Anderson-Darling  Criterion: 1
Standard deviation of  Anderson-Darling  Criterion: 0.63786

T.AD = ( Anderson-Darling  Criterion - mean)/sigma

Null Hypothesis: All samples come from a common population.

              AD  T.AD  asympt. P-value
version 1: 3.913 4.566          0.00517
version 2: 4.010 4.726          0.00452

      

or,

x$ad

               AD   T.AD  asympt. P-value
version 1: 3.9127 4.5664        0.0051703
version 2: 4.0100 4.7260        0.0045199

      

AD - Anderson-Darling statistic calculated according to the corresponding equations. ( ref article ), T.AD is calculated by the formula (AD- (k-1)) / sigma, where (k-1) denotes the limiting distribution of the statistic AD under the null hypothesis is the (k-1) -fold convolution of the asymptotic distribution for statistics with one AD sample; sigma is the standard deviation of AD statistics. Then asymp. The P value would be the "p-value" we are looking for. As for strings, version 1 is a K-shaped AD test in contiguous populations, and version 2 presents it with the original descrete population. So my guess is that if your data is contiguous, you should take the first p-value of the row, and if it is discrete, then the 2nd row.

+2


source







All Articles