February 11, 2008

Comparing Distance Formulas. Part II

By Armando Padilla Technology 0 Comments

Before diving into the algorithms I would like to step back and take a look at the
data we will be analyzing with our examples.

Where did the data come from?
The data originated from the Netflix.com data mining challenge. The challenge was created by Netflix 2-3 years ago to allowed data mining junkies the ability to help the movie renting population find movies they might enjoy based on their current renting selection and similar users with similar renting habits. Of course every junkie needs an incentive so Netflix.com is offered $1,000,000 to anyone that could increase the recommendation engines success rate by 10%. The challenge
is ongoing and ends 2010. More information can be found here

The Data
Focusing on the data, the files contain a list of movies, roughly 17,000, along with 547 user ratings per movie. The format that Netflix has placed the data
into is:

1:
1488844,3,2005-09-06
822109,5,2005-05-13
885013,4,2005-10-19
30878,4,2005-12-26
823519,3,2004-05-03
893988,3,2005-11-17
124105,4,2004-08-05
1248029,3,2004-04-22

Where 1 is the movie id. The movie id is a unique number which represents a movie title in their system. For example, the movie id shown above, 1, can be associated to the movie, The Abyss. Since the movie id is unique to one and only one movie we can safely assume that The Abyss will always be 1.

The following lines contain user rating information. Taking one line as an example, we can break up the data by commos. The first piece of the line is the unique id of a user, in this case a person using Netflix to rent a movie. Again, we can safely
assume that one id is associated to one and only one user in the system. As an
example. The id, 30878, is associated to the user, Armando Padilla.

The next item on the line is the rating that the user has given to the movie. In
the data that Netflix.com has used the ratings can be anywhere from 1 to 5, where 1 s a very low rating, indicating that the user hated the movie while 5 is a very high rating indicating that the user has extremely enjoyed watching the movie. Again with an example, the user 30878 has rated The Abyss with a 4, indicating that he extremely enjoyed it.

Finally the last item on the line is the date the user decided to rate the movie.
The format the date is in is, Year-month-dayofmonth. The final reading of a line
will be, “The user, Armando Padilla, with the id of 30878, rated the movie The Abyss with the id of 1 a 4 on December 26th 2006.

How are we using the data for our examples?
As mentioned before there are a little over 17,000 movies in the training data. The specific number of movies rated is 17,770 according the to “readme” file the tranining data is accompined with and the total number of users with movie ratings is 480,189.

For our example we will use all 17,770 movies as attributes for all 480,189 users in the system.

{ Online Notes }

Comparing Distance Formulas. Part II

About Author

armandop@dilla

Add a Comment

Related Posts

About Author

armandop@dilla

Add a Comment