Hashing functions in R to classify text

I'm trying to implement hashing functions in R to help me with text classification problem, but I'm not sure if I'm doing it the way it should be. Part of my code is based on this post: Hash function to map integers to a given range? ...

My code:

random.data = function(n = 200, wlen = 40, ncol = 10){

  random.word = function(n){
    paste0(sample(c(letters, 0:9), n, TRUE), collapse = '')
  } 
  matrix(replicate(n, random.word(wlen)), ncol = ncol)   
}

feature_hash = function(doc, N){

  doc = as.matrix(doc)
  library(digest)

  idx = matrix(strtoi(substr(sapply(doc, digest), 28, 32), 16L) %% (N + 1), ncol = ncol(doc))
  sapply(1:N, function(r)apply(idx, 1, function(v)sum(v == r)))  
}

set.seed(1)
doc = random.data(50, 16, 5)
feature_hash(doc, 3)

       [,1] [,2] [,3]
 [1,]    2    0    1
 [2,]    2    1    1
 [3,]    2    0    1
 [4,]    0    2    1
 [5,]    1    1    1
 [6,]    1    0    1
 [7,]    1    2    0
 [8,]    2    0    0
 [9,]    3    1    0
[10,]    2    1    0

      

So, I basically convert strings to integers using the last 5 hex digits of the md5 hash returned digest

. Questions:

1 - Is there any package that can do this for me? I haven't found them. 2 - Is it useful to use digest

as a hash function? If not, what can I do?

PS: I have to check if it works before posting, but my files are quite large and take a long time to process, so I find it smarter if someone points me in the right direction, because I'm sure m do it is not right!

Thanks for this help!

+3


source to share


1 answer


I don't know of any existing CRAN package for this.

However, I wrote a package for myself to do hashing functions. Source code is here: https://github.com/wush978/FeatureHashing , but the API is different.

In my case, I am using it to convert the data.frame to CSRMatrix

, a customized sparse matrix in a package. I also implemented a helper function to convert CSRMatrix

to Matrix::dgCMatrix

. For text classification, I think a sparse matrix would be more appropriate.



If you want to give it a try, please check the test script here: https://github.com/wush978/FeatureHashing/blob/master/tests/test-conver-to-dgCMatrix.R

Note that I've only used it on Ubuntu, so I don't know if it works for windows or macs or not. Please feel free to ask me a question about the package at https://github.com/wush978/FeatureHashing/issues .

+2


source







All Articles