Hash or factor list levels
I am dealing with a categorical variable retrieved from a database, and I want to use factors to keep the data "complete".
For example, I have a table that stores colors and their associated numeric ID
ID | Color ------ + ------- 1 | Black 1805 | Red 3704 | White
So I would like to use a factor to store this information in a dataframe, for example:
Car Model | Color ---------- + ------- Civic | Black Accord | White Sentra | Red
where the colored column is a factor, and the underlying data stored, not as a string, is actually c (1, 3704, 1805) - the identifiers associated with each color.
This way I can create a custom factor by changing the attribute of the factor class object's levels to achieve this effect.
Unfortunately, as you can see in this example, my IDs are not incrementing. In my application, I have 30 levels and the maximum ID for one level is ~ 9000. Since the levels are stored in an array for the coefficient, this means that I am storing a 9000 long integer vector with only 30 elements in it.
Can a hash or list be used to more efficiently accomplish this effect? that is, if I were to use a hash in the factor levels attribute, I could store all 30 elements at whatever indices I like without creating a maximum size (ID) array.
Thanks in advance!
source to share
Ok, I'm sure you cannot change how the factors work. A factor always has level identifiers, which are whole numbers 1..n
, where n
is the number of levels.
... but you can easily have a cast vector to get to your color IDs:
# The translation vector...
colorIds <- c(Black=1,Red=1805,White=3704)
# Create a factor with the correct levels
# (but with level ids that are 1,2,3...)
f <- factor(c('Red','Black','Red','White'), levels=names(colorIds))
as.integer(f) # 2 1 2 3
# Translate level ids to your color ids
colorIds[f] # 1805 1 1805 3704
Technically, you colorIds
don't need to define color names, but it makes it easier to work in one place as the names are used when creating levels for the coefficient. You want to explicitly specify the levels so that their numbering matches, even if the levels are not in alphabetical order (as you think).
EDIT . However, it is possible to create a class derived from a coefficient that has codes as an attribute. Lets call this glorious new class foo
:
foo <- function(x = character(), levels, codes) {
f <- factor(x, levels)
attr(f, 'codes') <- codes
class(f) <- c('foo', class(f))
f
}
`[.foo` <- function(x, ...) {
y <- NextMethod('[')
attr(y, 'codes') <- attr(x, 'codes')
y
}
as.integer.foo <- function(x, ...) attr(x,'codes')[unclass(x)]
# Try it out
set.seed(42)
f <- foo(sample(LETTERS[1:5], 10, replace=TRUE), levels=LETTERS[1:5], codes=101:105)
d <- data.frame(i=11:15, f=f)
# Try subsetting it...
d2 <- d[2:5,]
# Gets the codes, not the level ids...
as.integer(d2$f) # 105 102 105 104
Then you can also fix print.foo
etc.
source to share
With that in mind, the only element a "level" has to implement in order to have the correct ratio is accessor <<20>. Thus, any object that implements an accessory [
can be viewed as a vector in terms of any conjugation function.
I looked at the hash class , but saw that it uses the normal R behavior (as seen in lists), returning a slice of the original hash when only one parenthesis is used (when retrieving the actual value when using a double parenthesis). However, I had to override this with setMethod (), I was actually able to get the desired behavior.
library(hash)
setMethod(
'[' ,
signature( x="hash", i="ANY", j="missing", drop = "missing") ,
function(
x,i,j, ... ,
drop
) {
if (class(i) == "factor"){
#presumably trying to lookup the values associated with the ordered keys in this hash
toReturn <- NULL
for (k in make.keys(as.integer(i))){
toReturn <- c(toReturn, get(k, envir=x@.xData))
}
return(toReturn)
}
#default, just make keys and get from the environment
toReturn <- NULL
for (k in make.keys(i)){
toReturn <- c(toReturn, get(k, envir=x@.xData))
}
return(toReturn)
}
)
as.character.hash <- function(h){
as.character(values(h))
}
print.hash <- function(h){
print(as.character(h))
}
h <- hash(1:26, letters)
df <- data.frame(ID=1:26, letter=26:1, stringsAsFactors=FALSE)
attributes(df$letter)$class <- "factor"
attributes(df$letter)$levels <- h
> df
ID letter
1 1 z
2 2 y
3 3 x
4 4 w
5 5 v
6 6 u
7 7 t
8 8 s
9 9 r
10 10 q
11 11 p
12 12 o
13 13 n
14 14 m
15 15 l
16 16 k
17 17 j
18 18 i
19 19 h
20 20 g
21 21 f
22 22 e
23 23 d
24 24 c
25 25 b
26 26 a
> attributes(df$letter)$levels
<hash> containing 26 key-value pair(s).
1 : a
10 : j
11 : k
12 : l
13 : m
14 : n
15 : o
16 : p
17 : q
18 : r
19 : s
2 : b
20 : t
21 : u
22 : v
23 : w
24 : x
25 : y
26 : z
3 : c
4 : d
5 : e
6 : f
7 : g
8 : h
9 : i
>
> df[1,2]
[1] z
Levels: a j k l m n o p q r s b t u v w x y z c d e f g h i
> as.integer(df$letter)
[1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
[26] 1
Any feedback on this? As far as I can tell, everything works. It looks like it works correctly as far as it prints and the underlying data stored in the actual data.frame is intact, so I don't feel like I'm in danger. I can even get away with adding a new class to my package that simply implements this accessor to avoid having to add a hash class dependency.
Any feedback or points I do not notice will be greatly appreciated.
source to share