Like-minded users and Dynamic Links
Intro
You begin to browse the online catalog and you realize, “hey what’s his face also bought this toy!”. If your like me, the majority of times you’ll tend to skim through an online retailer such as Amazon.com starting at your initial search and then following all the recommended books, toys, gadgets, and gizmos that other users shopped for.
What will this short piece focus on?
Abstracting the idea behind a typical recommendation system and applying it to recommendations of links. “Huh?”, yea i said that too when i picked up this paper, From User Access Patterns to Dynamic Hypertext Linking written back in 1996. The paper focuses on a method to recommend links to users based on their session information. But let’s start from the beginning.
What is a recommendation system and what’s it used for.
In the opening paragraph I put together a short example. The example, you or any user visits Amazon.com and selects a product, the product page comes up and a section under the products image is displayed. The section is usually called, “Users who bought this item also bought…”. This section is called, the, “recommendation”, since the items are recommendations that Amazon.com believes you will express interest in, based on products you previously viewed and currently are viewing. See, not so hard.
Amazon.com uses a cool, neat, method to recommend a products. No, they cant simply chose something at random and recommend it to you, that would not serve any purpose. No, Amazon.com and other retail sites use recommendation algorithms, tools, which allow these sites to intelligently recommend items to you. How is this done? Well, in a nutshell. When you initially visit the site, Amazon.com begins to track what your looking at. Based on what you’ve already seen, a profile is created for you.
The profile normally contains a list of items you have clicked on, a click here means your interested in the item, and compares your list with all the other profiles Amazon has collected. When a set of users with, somewhat, similar profiles are found we look at what items the users have visited and chose an item you from their list that you have yet to visit. The idea behind it, “like minds think alike” and if like minds think alike, like minds will buy alike too!
Back to the paper
So how is this related in any way with a Dynamic Link Generator? Easy, both are the same. The overall idea behind the paper is, “Like minds browse alike”. In other words. If one user visits sections on a web site and another user visits somewhat similar pages, overall we can predict what the user will select to view next based on this data. If we can predict what the user will visit, we can display a link that might persuade a user to visit and hopefully purchase a product. Lets take a simple example.
A user, Armando, goes to Amazon.com and visits the category “Sports” starts to look around the “Sports” section and then decides, “hey, this blows, lets check out the electronics section”. Armando clicks on the “Electronics” link and then decides to go into the “Pets Section”. He buys a toy for Snoopy and then remembers he needs to buy something electronic to balance things out. He heads back to the electronics section.
The session’s web log for Armando would look something like this
- GET /www.somestorehere.com/
- GET /www.somestorehere.com/electronics
- GET /www.somestorehere.com/petgoods
- GET /www.somestorehere.com/electronics
Given the logs above we convert the log into a vector. How do we do this?
From Logs to Vectors
I hope your not confused yet. If your not confused, great! Its about to get much easier. The authors propose that we use an ID system to track what pages the users visits. Each page on the site will have a unique ID. Again, another example.
Take a look at the table below and pay close attention to the ID each page contains. Notice that each page’s ID is unique.
Page to Page Id | |
index.php | 100 |
electronics/index.php | 200 |
petgoods/index.php | 300 |
We now have the Ids for each page. Its time to create the vector. The vector for Armando will look like this:
<(100, x), (200, x), (300, x)> |
A quick note. The paper does not focus on the order the pages were accessed, so I wont pay any attention to it either.
As you can see from the vector above, the vector contains all the pages the user visited, removing duplicates. You can also tell that each page contains
an x. your probably wondering what the x is all about.
What the x is all about. (weights as interest)
The “x” is a way to tell how important the page is. How do the authors rate importance? By the number of times the user visited the page during the session. Can this be something else? Sure, we can use any type of method you believe a user can rate the importance of the page. Some people use the time spent on the page, while other literally ask users to click on a 1-5 rating scale while on the page. But lets stick to what the authors propose, page visits during the session.
Going back to the example, we look at the logs again for Armando and count the number of times Armando visited each section. We notice that he visited “index.php” once, “electronics/index.php” twice, and “petgoods/index.php” once. We take this data and place the number of times the page was access into the “x” values for each ID. Our final vector looks something like this:
<(100, 1), (200, 2), (300, 1)> |
What do we do with this? Nothing! trash it and call it a day. No no, come back. The vector is used to group like minded users based on those page ids.
Like-Minds Unite!
We now get into the meat and potatoes of the paper. The previous “stuff” was just that, “stuff”. It introduced how to clean the data and how to format the data, or what the paper calls, “Preprocessing”. In this section we introduce what the paper uses to cluster the data, the “Leader Algorithm”.
Leader Algorithm
The leader algorithms takes a collection of vectors, and begins with the initial vector. With the initial vector on hand it compares it with all the other clusters already available. Since there are no cluster, the algorithm creates a new cluster and continues with the next vector. While analyzing, If a vector resembles, measured by how close a user’s visited pages is to all members of a cluster visited pages, it places the vector into that same cluster, otherwise it creates a new cluster. This continues until all vectors are analyzed. Thats it!
A brief note on the papers data sets. The authors used only specific users from the data set they collected. They only used users that accessed more than 3 pages per session and limit the size of the cluster, which they do not specify.
Benefits
Faster than other clustering algorithms since it only runs once over the entire set of data once.
Creating the Link
Now that we have our group of people lets create a dynamic link. Lets do this by using another example which adds to the previous Armando example.
We know that Armando accessed the index page, the electronics section and the pets section. his overall vector was <(100, 1), (200, 2), (300, 1)>. Continuing, Armando is placed into his own cluster until another person visits the site, since there are no clusters to compare him to.
Beth comes to the site, she visits a few pages and after the third page the system kicks in. Her partial session log, looks like this:
- GET /www.somestorehere.com/beautysupply/
- GET /www.somestorehere.com/womensclothing/
- GET /www.somestorehere.com/kidsclothing/
- GET /www.somestorehere.com/
- GET /www.somestorehere.com/kidsclothing/
- GET /www.somestorehere.com/beautysupply/
- GET /www.somestorehere.com/kidsclothing/
and her vector after preprocessing is: <(101, 1), (103, 2), (106, 3), (100, 1)>.
Since her vector shares no resemblance to Armando’s she’s placed into a separate cluster.
Now, Ambi comes to the site and begins to browse, after her third page view her resulting partial session vector is: <(103, 1), (106, 1), (101, 1), (201, 4)>
Since Ambi has a strong similarity to Beth in the pages she’s visited we can chose one of Beth’s pages that Ambi has not visited, (100) for example, and present Ambi with a prominent link to the page.
Summary
The benefits to a small mom and pop shop are apparent since Amazon.com im sure has better algorithms than what was presented here. The paper is short but to the point. It points out how to capture the data, process the data into useable vectors, and then cluster the data using the leader algorithm to recommend links to users.
Given that the paper was written in 1996. The paper shows a very simple method in not only creating dynamic links for users based on like-minded users, users with the same pattern of clicks, but proposes a new algorithm which for its time might have been “cutting-edge”.
Armando Padilla