The process of categorical functions in building decision tree models

I used H2O to create classification models like GBM, DRF and DL. The dataset I have contains multiple categorical columns, and if I want to use them as functions to build models, do I have to manually convert them to dummy variables? I read that GBM can internally destroy categorical variables?

+3


source to share


2 answers


Yes, H2O is one of the few computer training libraries that doesn't require user preprocessing or hot coding (aka "dummy coding") of categorical variables alone. As long as the column type is a "factor" (aka "enum") in your dataframe, then H2O knows what to do automatically.



In particular, H2O allows the direct use of categorical variables in tree-based methods such as Random Forest or GBM. Tree-based algorithms have the ability to use categorical data natively, and this usually results in better performance than single-line encoding. In GLM or Deep Learning, H2O will hot-encode categorical elements automatically under the hood - so you don't have to do preprocessing. If you want more control, you can control the auto-encoding type with an argument categorical_encoding

.

+4


source


IMHO, the ability to handle categorical variables directly in the Tree algorithm is a huge advantage with H2O.



If you are doing one-line encoding of a categorical variable, you effectively took one variable and split it into multiple variables whose values ​​are mostly 0 (sparse for example). As Erin stated, this makes the trees worse. This is because trees use "information gain" with each split. Sparse functions (from single encoding) have less gain in information and are therefore less useful than a categorical function.

0


source







All Articles