Comparing Distance Formulas. Part I

Match.com, Singles.com and Netflix.com and countless other sites out there rely on comparing a group of people to answers users submit on a daily basis. Some like Match.com use a long questioner while others, such as Netflix.com, uses renting habits of a user to determine if a the user might like another product. In a nuttshell, How similar two people are depends on how they behave during their visit to the application.

What I’ll cover in this first part will be a brief intro into the overall plan. What algorithms will be covered and what data set I will use.

Overall plan
I was given the task of creating a simple but effective, match.com-like, compatibility application. The user would fill out a group of questions and based on the answers the application would present the users overall compatibility to other users in the system. The application was a success and to my surprise resisted the load that 3,000 concurrent users exerted on the application. After launch, I had a lingering question in the back of my mind. How good were those results? Could I have made a better engine?

The goal will be to compare 8 distance formulas and determine which formula produces the best results, which formula produces results efficiently for a web application with 500,000 users (small compared to other data sets out there).

Algorithm’s to Cover
Distance formulas will be the key focus in this piece. I will cover the more traditional distance formulas along with a few obscure ones. The list of algorithms is:

  1. Eucledian
  2. Manhattan
  3. Mahalantos
  4. Jacaard
  5. Hammin
  6. Pearson
  7. Sorensen

Of course I will touch on the benefits of each algorithm and the disadvantages that each algorithm contains. I will also have small working Ruby and PHP code to accompany this article.

Add a Comment

Your email address will not be published. Required fields are marked *