TFIDF: tf implementation

I am using the classification tool and experimenting with different versions of TF: two logarithmic (correction inside / outside logarithmic call), normalized, padded and log average. There seems to be a significant difference in my classifier accuracy that they modulate - up to 5%. However, it is strange that I cannot tell in advance what is better to work with a given dataset. I wonder if there is some work that I am missing, or maybe someone can share experience with them?

+3


source to share


2 answers


Basically, the increase in importance by adding a given term to a document should decrease with the number of occurrences of that term. For example, "car" appearing twice in a document implies that the term is much more important than appearing only once. However, if you compare a term appearing 20 times with the same 19 appearing, the difference should be lower.

What you are doing by defining various normalizations is determining how quickly the TF value saturates at some point.



You can try to match your data with some information about the average TF on the document or similar.

+1


source


Indeed, it is very difficult to tell in advance which weighting scheme will work best. All in all, there is no free lunch - the algorithm that works best for one dataset may be terrible for another. Moreover, we are not talking about radically different options. TF-IDF embodies one specific intuition regarding classification / search, and all of its different options are similar. The only way to tell is to experiment



PS A note on terminology: When you say validly, have you done any kind of statistical significance testing with cross-validation or random resampling? Perhaps the differences you see are related to chance.

+2


source







All Articles