How to deal with a lot of factors / categories in partykit

I am using the package partykit

and I find the following error message:

Error in matrix(0, nrow = mi, ncol = nl) : 
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In matrix(0, nrow = mi, ncol = nl) :
NAs introduced by coercion to integer range

      

I used the example provided in this article, which compares packages and their handling with a large number of categories.

The problem is that the split variable being used has too many categories. mob()

A matrix with all possible partitions is created inside the functions . Only this matrix has a size p * (2^(p-1)-1)

, where p is the number of categories of the separating variable. Depending on the used system resources (RAM, etc.), this error occurs for different p numbers.

The article proposes to use the Gini criterion. I think that with the intention of the partykit package the Gini test cannot be used because I have no classification problem with the target variable, but the model specification problem.

So my question is, is there a way to find the partition for such cases, or a way to reduce the number of partitions to check?

+3


source to share


1 answer


This trick of finding only k ordered splits rather than 2 ^ k -1 unordered partitions only works under certain circumstances, such as when one can order an answer by their mean in each category. I've never looked at the underlying theory in sufficient detail, but it only works under certain assumptions, and I'm not sure if they are well written enough. You definitely want a one-dimensional problem in the sense that only one base parameter (usually the average) is optimized. It is likely that the constant differentiation of the objective function can also be a problem, given the emphasis on Gini.

As mob()

it is probably the most commonly used in situations where you are breaking down more than one parameter, I don't think this trick can be used. Likewise, it ctree()

can be easily applied in situations with multivariate estimates, even the response variable is unidirectional (for example, to capture the location and difference in scales).



I usually recommend breaking the factor down into many levels into smaller parts. For example, if you have a factor for a zip code of observation: then you can use a factor for state / province and a numeric variable that encodes "size" (province or population), a factor that encodes rural or urban, and so on. This is of course additional work, but usually also leads to more interpretable results.

Having said that, tricks like these can be used on our partykit wishlist when available. But this is not on our agenda ...

+1


source







All Articles