Hash or factor list levels

I am dealing with a categorical variable retrieved from a database, and I want to use factors to keep the data "complete".

For example, I have a table that stores colors and their associated numeric ID

  ID | Color
------ + -------
    1 | Black
 1805 | Red
 3704 | White

So I would like to use a factor to store this information in a dataframe, for example:

Car Model | Color
---------- + -------
Civic | Black
Accord | White
Sentra | Red

where the colored column is a factor, and the underlying data stored, not as a string, is actually c (1, 3704, 1805) - the identifiers associated with each color.

This way I can create a custom factor by changing the attribute of the factor class object's levels to achieve this effect.

Unfortunately, as you can see in this example, my IDs are not incrementing. In my application, I have 30 levels and the maximum ID for one level is ~ 9000. Since the levels are stored in an array for the coefficient, this means that I am storing a 9000 long integer vector with only 30 elements in it.

Can a hash or list be used to more efficiently accomplish this effect? that is, if I were to use a hash in the factor levels attribute, I could store all 30 elements at whatever indices I like without creating a maximum size (ID) array.

Thanks in advance!

+3


source to share


2 answers


Ok, I'm sure you cannot change how the factors work. A factor always has level identifiers, which are whole numbers 1..n

, where n

is the number of levels.

... but you can easily have a cast vector to get to your color IDs:

# The translation vector...
colorIds <- c(Black=1,Red=1805,White=3704)

# Create a factor with the correct levels 
# (but with level ids that are 1,2,3...)
f <- factor(c('Red','Black','Red','White'), levels=names(colorIds))
as.integer(f) # 2 1 2 3

# Translate level ids to your color ids
colorIds[f] # 1805 1 1805 3704

      

Technically, you colorIds

don't need to define color names, but it makes it easier to work in one place as the names are used when creating levels for the coefficient. You want to explicitly specify the levels so that their numbering matches, even if the levels are not in alphabetical order (as you think).



EDIT . However, it is possible to create a class derived from a coefficient that has codes as an attribute. Lets call this glorious new class foo

:

foo <- function(x = character(), levels, codes) {
    f <- factor(x, levels)
    attr(f, 'codes') <- codes
    class(f) <- c('foo', class(f))
    f
}

`[.foo` <- function(x, ...) {
    y <- NextMethod('[')
    attr(y, 'codes') <- attr(x, 'codes')
    y
}

as.integer.foo <- function(x, ...) attr(x,'codes')[unclass(x)]

# Try it out
set.seed(42)
f <- foo(sample(LETTERS[1:5], 10, replace=TRUE), levels=LETTERS[1:5], codes=101:105)

d <- data.frame(i=11:15, f=f)

# Try subsetting it...
d2 <- d[2:5,]

# Gets the codes, not the level ids...
as.integer(d2$f) # 105 102 105 104

      

Then you can also fix print.foo

etc.

+2


source


With that in mind, the only element a "level" has to implement in order to have the correct ratio is accessor <<20>. Thus, any object that implements an accessory [

can be viewed as a vector in terms of any conjugation function.

I looked at the hash class , but saw that it uses the normal R behavior (as seen in lists), returning a slice of the original hash when only one parenthesis is used (when retrieving the actual value when using a double parenthesis). However, I had to override this with setMethod (), I was actually able to get the desired behavior.

library(hash)

setMethod( 
    '[' , 
    signature( x="hash", i="ANY", j="missing", drop = "missing") ,  
    function( 
        x,i,j, ... ,        
        drop
        ) {     

        if (class(i) == "factor"){
            #presumably trying to lookup the values associated with the ordered keys in this hash
            toReturn <- NULL
            for (k in make.keys(as.integer(i))){
                toReturn <- c(toReturn, get(k, envir=x@.xData))
            }
            return(toReturn)
        }

        #default, just make keys and get from the environment
        toReturn <- NULL
        for (k in make.keys(i)){
            toReturn <- c(toReturn, get(k, envir=x@.xData))
        }
        return(toReturn)        
    }
    )

as.character.hash <- function(h){
    as.character(values(h))
}

print.hash <- function(h){
    print(as.character(h))
}

h <- hash(1:26, letters)

df <- data.frame(ID=1:26, letter=26:1, stringsAsFactors=FALSE)

attributes(df$letter)$class <- "factor"
attributes(df$letter)$levels <- h

>   df
   ID letter
1   1      z
2   2      y
3   3      x
4   4      w
5   5      v
6   6      u
7   7      t
8   8      s
9   9      r
10 10      q
11 11      p
12 12      o
13 13      n
14 14      m
15 15      l
16 16      k
17 17      j
18 18      i
19 19      h
20 20      g
21 21      f
22 22      e
23 23      d
24 24      c
25 25      b
26 26      a
>   attributes(df$letter)$levels
<hash> containing 26 key-value pair(s).
  1 : a
  10 : j
  11 : k
  12 : l
  13 : m
  14 : n
  15 : o
  16 : p
  17 : q
  18 : r
  19 : s
  2 : b
  20 : t
  21 : u
  22 : v
  23 : w
  24 : x
  25 : y
  26 : z
  3 : c
  4 : d
  5 : e
  6 : f
  7 : g
  8 : h
  9 : i
>
> df[1,2]
[1] z
Levels: a j k l m n o p q r s b t u v w x y z c d e f g h i
> as.integer(df$letter)
 [1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2
[26]  1

      



Any feedback on this? As far as I can tell, everything works. It looks like it works correctly as far as it prints and the underlying data stored in the actual data.frame is intact, so I don't feel like I'm in danger. I can even get away with adding a new class to my package that simply implements this accessor to avoid having to add a hash class dependency.

Any feedback or points I do not notice will be greatly appreciated.

0


source







All Articles