Recently I've started reading
"Mahout in Action" - a book about machine learning library Apache Mahout. A few years ago I was involved in chess programming and came across machine learning algorithms for tuning chess opening books. Therefore I thought it might be interesting to see what modern machine learning libraries have to offer.
First part of the book is dedicated to making recommendations. I'm sure everyone has seen suggestions offered by online stores - "users who chose that product also liked...". Implementation of this feature can be made much easier with tools like Mahout. Let's see.
For our trivial proof of concept let's assume we have a travel site which collects ratings that users give to the countries they visited. Basing on these ratings the site suggests other countries that the user might enjoy visiting.
For the sake of simplicity we make a few assumptions:
- We have only 6 users so far - 3 females and 3 males.
- We have only 10 countries rated so far - only America and Europe.
- It happened that women seem to have strong preference for European countries (higher ratings) while guys apparently enjoy American countries.
The assumptions above serve only one purpose - to make the problem and answers easy to comprehend for us at first glance. They are absolutely not required for Mahout to work properly.
User ID | User name | Country ID | Country name | Rating (1.0-5.0) |
1 | Albert | 101 | Albania | 2.0 |
1 | Albert | 102 | Costa Rica | 4.0 |
1 | Albert | 105 | France | 2.5 |
1 | Albert | 106 | Mexico | 3.5 |
2 | Caroline | 102 | Costa Rica | 1.5 |
2 | Caroline | 103 | Denmark | 4.5 |
2 | Caroline | 105 | France | 4.0 |
2 | Caroline | 107 | Poland | 5.0 |
2 | Caroline | 108 | USA | 2.5 |
3 | Jacob | 101 | Albania | 2.5 |
3 | Jacob | 103 | Denmark | 2.0 |
3 | Jacob | 104 | Guatemala | 5.0 |
3 | Jacob | 106 | Mexico | 4.5 |
3 | Jacob | 108 | USA | 4.5 |
4 | Joanne | 101 | Albania | 5.0 |
4 | Joanne | 102 | Costa Rica | 2.0 |
4 | Joanne | 103 | Denmark | 4.5 |
4 | Joanne | 107 | Poland | 4.5 |
5 | Paul | 101 | Albania | 2.5 |
5 | Paul | 102 | Costa Rica | 4.0 |
5 | Paul | 105 | France | 3.0 |
5 | Paul | 106 | Mexico | 4.0 |
5 | Paul | 107 | Poland | 2.5 |
5 | Paul | 108 | USA | 5.0 |
6 | Monica | 101 | Albania | 3.5 |
6 | Monica | 104 | Guatemala | 1.0 |
6 | Monica | 105 | France | 4.0 |
6 | Monica | 107 | Poland | 4.0 |
6 | Monica | 108 | USA | 3.0 |
Given the data above let's try to generate recommendations for Caroline and Jacob.
First we need the input data above represented in a format that Mahout understands. Our format of choice is <User ID>,<Item ID>,<Preference>:
1,101,2.0
1,102,4.0
1,105,2.5
1,106,3.5
2,102,1.5
2,103,4.5
2,105,4.0
2,107,5.0
2,108,2.5
3,101,2.5
3,103,2.0
3,104,5.0
3,106,4.5
3,108,4.5
4,101,5.0
4,102,2.0
4,103,4.5
4,107,4.5
5,101,2.5
5,102,4.0
5,105,3.0
5,106,4.0
5,107,2.5
5,108,5.0
6,101,3.5
6,104,1.0
6,105,4.0
6,107,4.0
6,108,3.0
A complete Java program using Mahout that would provide a recommendation for a given user looks as follows:
package pl.marekstrejczek.mahout;
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.EuclideanDistanceSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
public class TravelSiteRecommender {
public static void main(final String[] args) throws IOException, TasteException {
DataModel model = new FileDataModel(new File("travel.dat"));
UserSimilarity similarity = new EuclideanDistanceSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List recommendationForCaroline = recommender.recommend(2, 1);
List recommendationForJacob = recommender.recommend(3, 1);
System.out.println("Recommendation for Caroline: "+recommendationForCaroline);
System.out.println("Recommendation for Jacob: "+recommendationForJacob);
}
}
The output of this program is:
Recommendation for Caroline: [RecommendedItem[item:101, value:4.3082685]]
Recommendation for Jacob: [RecommendedItem[item:102, value:4.0]]
The simple program above recommends visiting Albania to Caroline and going to Costa Rica for Jacob. Intuitively it makes sense:
- other users who like the same countries as Caroline (Joanne, Monica) like Albania.
- Therefore there's a good chance that Caroline would also enjoy Albania.
Similar reasoning applies to the recommendation for Jacob.
The recommender expects that Caroline would rate Albania at 4.3 and Jacob would rate Costa Rica at 4.0.
The program is able to process input data and give a reasonable recommendation - quite an impressive outcome from just a few lines of code. Well done Mahout!
Let's see what the code actually does:
DataModel model = new FileDataModel(new File("travel.dat"));
Loads input data stored as User ID,Item ID,Preference triples into memory. It's possible to customize the way input data is loaded if our set of preferences is not represented in this format. Input data can also be taken from database.
UserSimilarity similarity = new EuclideanDistanceSimilarity(model);
The recommender we use is a user-based one. It doesn't pay any attention to attributes of items - it tries to find out what a user might like basing on his preferences so far and what other users with similar preferences also liked. UserSimilarity is an abstraction for the concept of "similarity" - there are many possible metrics and Mahout comes with ready implementations. EuclideanDistanceSimilarity defines similarity between two users basing on euclidean distance between their location in n-dimensional space (where n is number of items). Coordinates for each user location are her preference values. Other UserSimilarity implementations include PearsonCorrelationSimilarity, SpearmanCorrelationSimilarity, TanimotoCoefficientSimilarity. All UserSimilarityImplementations represent similarity as a number from range <-1-1> (from completely different to having identical preferences). Choosing the best implementation for the given problem is not easy and always involves some trial-and-error.
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
UserNeighborhood is another abstraction present in Mahout. It determines which users should be considered when providing a recommendation. NearestNUserNeighborhood is an implementation which takes N most similar users (where similarity is determined by UserSimilarity described above). We take 2 most similar users in this example - this is one of the parameters that can be tuned for best accurracy. Other UserNeighborhood implementation is ThresholdUserNeighborhood - number of similar users is not fixed in this case. Parameter for this ThresholdUserNeighborhood tells how similar a user needs to be to be still included in the process of looking for a recommendation. Any number of users that are similar enough is used for arriving at the recommendation.
Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
Finally, a Recommender is the actual engine which provides the results. This Recommender implementation is user-based. Another approach is to use item-based recommenders, which consider how similar items are to each other rather than how users are similar to other users. There are at least a few other approaches, but this post is too short to even mention them.
List<Recommendeditem> recommendationForCaroline = recommender.recommend(2, 1);
Here we ask the Recommender for top 1 recommendation for user with ID 2 (Caroline).
Let's see what changes when we use a different UserSimilarity and UserNeighborhood implementation:
package pl.marekstrejczek.mahout;
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
public class TravelSiteRecommender {
public static void main(final String[] args) throws IOException, TasteException {
DataModel model = new FileDataModel(new File("travel.dat"));
UserSimilarity similarity = new PearsonCorrelationSimilarity(model); // changed
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.7, similarity, model); // changed
Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List<Recommendeditem> recommendationForCaroline = recommender.recommend(2, 1);
List<Recommendeditem> recommendationForJacob = recommender.recommend(3, 1);
System.out.println("Recommendation for Caroline: "+recommendationForCaroline);
System.out.println("Recommendation for Jacob: "+recommendationForJacob);
}
}
Output now is:
Recommendation for Caroline: [RecommendedItem[item:101, value:4.2789083]]
Recommendation for Jacob: [RecommendedItem[item:107, value:3.542936]]
So the recommendation for Caroline didn't change (even the expected preference value is almost the same), but for Jacob this time the suggestion is to visit Poland (although the expected rating is quite low - only 3.54 out of 5). This recommendation looks significantly worse than the previous one, which shows how important it is to choose appropriate implementations and tune parameters for a given problem.
This was just a brief introduction to recommendation capabilities of Mahout. For real production solutions one would certainly need to consider at least:
- which implementations and what parameter values to use for most accurrate results. Fortunately Mahout comes with facilities that help evaluating performance of recommendations.
- how to enrich standard algorithms with domain-specific knowledge for better accurracy.
- performance and memory consumption of the solution.
For non-trivial data sets performance may become an issue - especially if recommendations are needed in real-time, within split second.
Even batch processing may turn out to be too slow - a solution can be to distribute computations among many nodes using
Apache Hadoop. Mahout provides Job implementations that allow distributing computations using standard algorithms, like the ones described above. However if data sets are too large to fit into single machine memory then a different approach is needed - redesigning algorithms to fit into MapReduce concept to take full advantage of Hadoop facilities.
If you want to run the program then:
- save the Maven pom.xml file included below.
- save TravelSiteRecommender class in a folder under src/main/java.
- save preference data (comma separated triplets) as travel.dat file.
- use Maven to run the code: mvn clean package exec:java
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>pl.marekstrejczek</groupId>
<artifactId>mahout-exercises</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>Mahout Exercises</name>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>1.2.1</version>
<executions>
<execution>
<goals>
<goal>java</goal>
</goals>
</execution>
</executions>
<configuration>
<mainClass>pl.marekstrejczek.mahout.TravelSiteRecommender</mainClass>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-core</artifactId>
<version>0.7</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.0.11</version>
</dependency>
</dependencies>
</project>