Comparing Distance Formulas. Part I
Match.com, Singles.com and Netflix.com and countless other sites out there rely on comparing a group of people to answers users submit on a daily basis. Some like Match.com use a long questioner while others, such as Netflix.com, uses renting habits of a user to determine if a the user might like another product. In a nuttshell, How similar two people are depends on how they behave during their visit to the application.
What I’ll cover in this first part will be a brief intro into the overall plan. What algorithms will be covered and what data set I will use.
Overall plan
I was given the task of creating a simple but effective, match.com-like, compatibility application. The user would fill out a group of questions and based on the answers the application would present the users overall compatibility to other users in the system. The application was a success and to my surprise resisted the load that 3,000 concurrent users exerted on the application. After launch, I had a lingering question in the back of my mind. How good were those results? Could I have made a better engine?
The goal will be to compare 8 distance formulas and determine which formula produces the best results, which formula produces results efficiently for a web application with 500,000 users (small compared to other data sets out there).
Algorithm’s to Cover
Distance formulas will be the key focus in this piece. I will cover the more traditional distance formulas along with a few obscure ones. The list of algorithms is:
- Eucledian
- Manhattan
- Mahalantos
- Jacaard
- Hammin
- Pearson
- Sorensen
Of course I will touch on the benefits of each algorithm and the disadvantages that each algorithm contains. I will also have small working Ruby and PHP code to accompany this article.