Evaluating a Joint Filtration Algorithm Using a Test Set

In element-based collaborative filtering, we use similar user ratings for that user to create recommendations. Research has often suggested using a benchmark to evaluate an algorithm, for example. 20% data with 80% for training. However, what if all the ratings of a certain item are given in the hold? Our training data will no longer contain this item and will never be recommended.

eg. 5 users each see 10 movies, one of which is "Titanic". We randomly expose a test case of 20% data per user = 2 movies / user. What if Titanic is in a test suite for each user? It will never be recommended.

+3


source to share


2 answers


The assessment methodology depends on the use case and the type of data. Therefore, in some situations, an estimate with a randomized 80/20 gap is not sufficient, i.e. When time is of the essence, such as session-based recommendations.

Assuming this use case can be scored this way, try not to base the score on just one random train / test schedule, but go for an N-fold cross-validation. In this case, a five-fold cross-validation with hold. The evaluation result will be aggregated as a result of all folds. Going further, this single experiment can be repeated several times to get the final result.

Check out two projects:



both can be helpful to you. At least in search of the correct assessment methodology.

+2


source


The first answer is that this effect will be negligible if the performance metric is averaged correctly. For this I always use MAP @k or medium precision. It only measures the accuracy of your recommendations, but it does so in such a way that if some are missing, the averages are usually valid if you have too little data.

As Bartlomius Twardowski says, you can also do something like a k-fold test that evaluates across different splits and averages these.This is less prone to minor dataset problems, and you can still use MAP @k as your metric. as k-fold only solves the partition problem.

We use MAP @k and split the date to get 80 older users per training split and 20% new users per probe / test. This mimics somewhat better how a real world recommender would work. Because there are often new users that appear after creating your model.



By the way, don't forget that the โ€œspreadโ€ recommendation is also associated with the lift in conversions. Therefore, it is important to remember. As a cheap and not very strict way to get feedback, we look at how many people in the backup recruitment receive recommendations. If you are comparing one setting to another, the feedback is related to how many people can get recommendations, but in both cases, you should use the exact same split.

BTW2 Note that you are using ratings from users like this user, this is collaborative filtering based on users. When you find items that are similar to some of the examples from the point of view of someone who appreciated it, you create a subject matter. The difference is whether the item or the user is a request.

One final plugin for the new algorithm we are using. In order to use both product and custom props (as well as item set props such as a shopping cart), we use CCO (Correlated Cross-Occurrence) to use all custom actions as input. In a blog post about this, we found a 20% + MAP @k increase for the dataset collected from the movie watching website when we used the user "likes" as well as "dislikes" to predict likes. The algorithm is implemented in Apache Mahout , and the complete turnkey key is here .

0


source







All Articles