How do I get a detailed list of tables in R?
Used by SAS for 6 years and ported to R. I used the proc content to get a useful description of the table, characteristic and datatype.
Using str(tableName)
, I can see the type but not the position of the vector in the dataframe.
Using name(tableName)
, I can see the names and positions of the vectors, but not the type.
Using summary(tableName)
, I can see the quantile / category, but not the type easily or the vector position.
Is there a way I can just get the list Name vectorPosition type min max avg med [..]
source to share
You can use lapply
to call a function on each column of data.frame, and calculate all the quantities you want in that function.
summary_text <- function(d) {
do.call(rbind, lapply( d, function(u)
data.frame(
Type = class(u)[1],
Min = if(is.numeric(u)) min( u, na.rm=TRUE) else NA,
Mean = if(is.numeric(u)) mean( u, na.rm=TRUE) else NA,
Median = if(is.numeric(u)) median(u, na.rm=TRUE) else NA,
Max = if(is.numeric(u)) max( u, na.rm=TRUE) else NA,
Missing = sum(is.na(u))
)
) )
}
summary_text(iris)
But I personally prefer to look at the data graphically: the following function will display a histogram and quantile plot for each numeric variable and a barcode for each coefficient, on one page. If you have 20 to 30 variables, it should remain useful.
summary_plot <- function(d, aspect=1) {
# Split the screen: find the optimal number of columns
# and rows to be as close as possible from the desired aspect ratio.
n <- ncol(d)
dx <- par()$din[1]
dy <- par()$din[2]
f <- function(u,v) {
if( u*v >= n && (u-1)*v < n && u*(v-1) < n ) {
abs(log((dx/u)/(dy/v)) - log(aspect))
} else {
NA
}
}
f <- Vectorize(f)
r <- outer( 1:n, 1:n, f )
r <- which( r == min(r,na.rm=TRUE), arr.ind=TRUE )
r <- r[1,2:1]
op <- par(mfrow=c(1,1),mar=c(2,2,2,2))
plot.new()
if( is.null( names(d) ) ) { names(d) <- 1:ncol(d) }
ij <- matrix(seq_len(prod(r)), nr=r[1], nc=r[2], byrow=TRUE)
for(k in seq_len(ncol(d))) {
i <- which(ij==k, arr.ind=TRUE)[1]
j <- which(ij==k, arr.ind=TRUE)[2]
i <- r[1] - i + 1
f <- c(j-1,j,i-1,i) / c(r[2], r[2], r[1], r[1] )
par(fig=f, new=TRUE)
if(is.numeric(d[,k])) {
hist(d[,k], las=1, col="grey", main=names(d)[k], xlab="", ylab="")
o <- par(fig=c(
f[1]*.4 + f[2]*.6,
f[1]*.15 + f[2]*.85,
f[3]*.4 + f[4]*.6,
f[3]*.15 + f[4]*.85
),
new=TRUE,
mar=c(0,0,0,0)
)
qqnorm(d[,k],axes=FALSE,xlab="",ylab="",main="")
qqline(d[,k])
box()
par(o)
} else {
o <- par(mar=c(2,5,2,2))
barplot(table(d[,k]), horiz=TRUE, las=1, main=names(d)[k])
par(o)
}
}
par(op)
}
summary_plot(iris)
source to share
It looks like you can find something like describe()
, from the package Hmisc
. My recollection is that Frank Harrell (package author) was a longtime SAS programmer who came into the R world quite early on. The style of summaries that describe()
provides no doubt reflects that computational genealogy:
library(Hmisc)
describe(cars) # for example
cars
2 Variables 50 Observations
---------------------------------------------------------------------------------
speed
n missing unique Mean .05 .10 .25 .50 .75 .90
50 0 19 15.4 7.0 8.9 12.0 15.0 19.0 23.1
.95
24.0
4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25
Frequency 2 2 1 1 3 2 4 4 4 3 2 3 4 3 5 1 1 4 1
% 4 4 2 2 6 4 8 8 8 6 4 6 8 6 10 2 2 8 2
---------------------------------------------------------------------------------
dist
n missing unique Mean .05 .10 .25 .50 .75 .90
50 0 35 42.98 10.00 15.80 26.00 36.00 56.00 80.40
.95
88.85
lowest : 2 4 10 14 16, highest: 84 85 92 93 120
---------------------------------------------------------------------------------
source to share
It is really "quick", "dirty", but if I understood you correctly, this is what you need.
As an example, I took the information returned summary()
and just added information about class
and mode
for each column of the data frame. I'm not very familiar with the class table
in R, so formatting is really off.
df <- data.frame(
a=1:5,
b=rep(TRUE, 5),
c=letters[1:5]
)
mySummary <- function(x, ...) {
out <- NULL
for (ii in 1:ncol(x)) {
temp <- list(
c(paste("Class:", class(x[,ii])), paste("Mode:", mode(x[,ii])),
c(a[,ii]))
)
names(temp) <- names(x)[ii]
out <- c(out, temp)
}
out
}
> mySummary(df)
$a
"Class: integer" "Mode: numeric" "Min. :1 " "1st Qu.:2 "
"Median :3 " "Mean :3 " "3rd Qu.:4 " "Max. :5 "
$b
"Class: logical" "Mode: logical" "Mode:logical " "TRUE:5 "
"NA's:0 " NA NA NA
$c
"Class: factor" "Mode: numeric" "a:1 " "b:1 " "c:1 "
"d:1 " "e:1 " NA
You might want to know how a method is defined summary()
for a class data.frame
, and then go ahead and customize it to suit your needs.
Find out which methods are defined for summary()
methods("summary")
> methods("summary")
[1] summary.aov summary.aovlist summary.aspell*
[4] summary.connection summary.data.frame summary.Date
[7] summary.default summary.ecdf* summary.factor
[10] summary.glm summary.infl summary.lm
[13] summary.loess* summary.manova summary.matrix
[16] summary.mlm summary.nls* summary.packageStatus*
[19] summary.PDF_Dictionary* summary.PDF_Stream* summary.POSIXct
[22] summary.POSIXlt summary.ppr* summary.prcomp*
[25] summary.princomp* summary.srcfile summary.srcref
[28] summary.stepfun summary.stl* summary.table
[31] summary.tukeysmooth*
Non-visible functions are asterisked
Here is a way to get the code
summary.data.frame
source to share