| paul perry . |   blog   |   about   |   notes |
A Minimalist Implementation of Collaborative FilteringAutomated Collaborative Filtering (ACF) in SQL.Paul Perry, November 2000. Most papers in computer science describe how their author learned what someone else already knew.- Peter Landin * IntroductionBefore you blindly deploy a recommender system or build your own from scratch, you should evaluate whether your domain lends itself to k-nearest neighbor collaborative filtering and compare its performance against other approaches. I recently had such a task on my hands and needed a minimalist implementation of ACF as an algorithmic testbed for data analysis. Others have done this and have developed full applications to study various tradeoffs [Breeze], but building my own gives me more control to evolve other approaches and optimizations. Below is the smallest amount of code that could implement ACF using SQL as the easiest way to handle the data sets. I bet that new recommender architectures could be based on such a small codebase. The performance issue will be the caching properties of the database, but new in-memory databases could postpone the performance issue for a while. Note (Dec 2001): After writing this note I found DBLens Collaborative Filtering Research Engine by Jon Herlocker of the GroupLens team. Looks interesting but I haven't tried it. This paper assumes knowledge of ACF and delves right into the implementation details. If you download the code and find problems or make improvements, I would like to hear about it, thanks. The Basic ACF Algorithm1. Collect user ratings of items RI,p is user I's rating for item p2. Calculate distance D between users I and J Using the mean square difference method:
Where ||Items|| is the number of items user I and user J have rated in common. I make the assumption that missing ratings don't factor into the distance calculations. 3. Construct the Neighbor set for user I User J is a possible nearest neighbor of I iff DI,J ≤ Lwhere L is the maximum distance that two uses can be separated to be considered neighbors. 4. Calculate weight of each neighbor
5. Calculate the predicted rating of Items The predicted rating for item p is calculated as:
An Analytical WorkbenchBased on the schema described in the subsequent section, the steps for evaluating an ACF approach are:
Note: we can ignore the need to calculate the weighted averages of each neighbor as it doesn't affect the final predicted rating. SQL Schema RepresentationCREATE TABLE [dbo].[users] ( [user_id] [int] NOT NULL ) ON [PRIMARY] GO CREATE TABLE [dbo].[ratings] ( [user_id] [int] NOT NULL , [item_id] [int] NOT NULL , [rating] [int] NULL ) ON [PRIMARY] GO CREATE TABLE [dbo].[acf_neighbors] ( [user_id] [int] NOT NULL , [neighbor_id] [int] NOT NULL , [distance] [int] NULL , [common_items] [int] NULL ) ON [PRIMARY] GO ALTER TABLE [dbo].[acf_neighbors] WITH NOCHECK ADD CONSTRAINT [PK_acf_neighbors] PRIMARY KEY CLUSTERED ( [user_id] DESC , [neighbor_id] DESC ) ON [PRIMARY] GO ALTER TABLE [dbo].[users] WITH NOCHECK ADD CONSTRAINT [PK_users] PRIMARY KEY CLUSTERED ( [user_id] ) ON [PRIMARY] GO CREATE INDEX [user_id_ratings_idx] ON [dbo].[ratings]([user_id] DESC ) ON [PRIMARY] GO CREATE INDEX [item_id_ratings_idx] ON [dbo].[ratings]([item_id] DESC ) ON [PRIMARY] GO Store the predicted ratings in a separate table. CREATE TABLE [dbo].[pred_ratings] ( [user_id] [int] NOT NULL , [item_id] [int] NOT NULL , [rating] [int] NULL ) ON [PRIMARY] GO CREATE INDEX [user_id_pred_ratings_idx] ON [dbo].[pred_ratings]([user_id] ) ON [PRIMARY] GO CREATE INDEX [item_id_pred_ratings_idx] ON [dbo].[pred_ratings]([item_id]) ON [PRIMARY] GO Collecting the dataAfter collecting all the ratings in the ratings table, extract the unique user list. -- insert users -- insert into users (user_id) select distinct user_id from ratings order by user_id Looking at the dataAfter collecting user ratings on items you need to look at some of the data properties, even before looking at the ACF results. Flags should be raised if the average rating isn't centered on the rating scale and the rating distribution isn't normal. These would raise questions in the data collection process. All the code makes the assumption that we are working with a rating scale of 1-100. Average Rating per Item-- average rating select item_id, avg(rating) as average from ratings group by item_id order by avg(rating) desc compute avg(avg(rating)) Ratings distribution-- distribution of ratings select rating, count(*) as count from ratings group by rating order by rating Item ratings at the extreme ends of the scale-- which items occur more often at the extreme ends of the scale select item_id, count(*) as count into #extreme_ratings from ratings where rating > 80 or rating < 20 group by item_id select item_id, count(*) as high_count into #high_ratings from ratings where rating > 80 group by item_id select item_id, count(*) as low_count into #low_ratings from ratings where rating < 20 group by item_id select h.item_id, e.count, h.high_count, l.low_count from #extreme_ratings e, #high_ratings h, #low_ratings l where e.item_id = h.item_id and e.item_id = l.item_id order by e.count desc drop table #extreme_ratings drop table #high_ratings drop table #low_ratings Calculating Neighbors-- nearest neighbor computation insert into acf_neighbors (user_id, neighbor_id, distance, common_items) select u1.user_id, u2.user_id as neighbor_id, sum(SQUARE(u1.rating - u2.rating))/count(u1.user_id) as distance, count(u1.user_id) as common_items from ratings u1, ratings u2 where u1.user_id != u2.user_id and u1.item_id = u2.item_id group by u1.user_id, u2.user_id order by u1.user_id, distance desc But you need to calculate neighbors only based on the "known ratings" data, so use the following: -- nearest neighbor computation for testbed data insert into acf_neighbors (user_id, neighbor_id, distance, common_items) select u1.user_id, u2.user_id as neighbor_id, sum(SQUARE(u1.rating - u2.rating))/count(u1.user_id) as distance, count(u1.user_id) as common_items from ratings u1, ratings u2 where u1.user_id != u2.user_id and u1.item_id = u2.item_id and u1.rating > -1 and u2.rating > -1 group by u1.user_id, u2.user_id order by u1.user_id, distance desc Looking at the Neighbor dataDistance distribution-- distribution of mean squared differences select distance, count(*) as count from acf_neighbors group by distance order by distance Average number of neighborsIf the neighbor count is too low, then we don't have enough data, either because not enough users rated items in common, or we didn't have enough of a user population to rate items. -- average number of neighbors within a given distance select user_id, count(*) from acf_neighbors where distance < 500 group by user_id compute avg(count(*)), stdev(count(*)) Calculating Predictions-- insert items to predict insert into pred_ratings (user_id, item_id) select user_id, item_id from ratings where rating < 0 order by user_id, item_id We calculate predictions based on the neighbor set, which is defined by a fixed cutoff. The cutoff could be determined dynamically on a per user basis, but in this case it is fixed at one standard deviation (600 for my data).
-- calculate predictions
declare pred_cursor cursor
for select user_id, item_id from pred_ratings
declare @user int, @item int
open pred_cursor
fetch pred_cursor into @user, @item
while @@fetch_status != -1 begin
update pred_ratings
set rating =
(select SUM(r.rating * n.distance)/SUM(n.distance) as rating
from
ratings r,
acf_neighbors n
where
-- for every rating not yet predicted
n.user_id = @user
and r.item_id = @item
-- find the neighbors under a certain cutoff
-- in this case 600, you pick your own
and n.neighbor_id = r.user_id
and n.distance < 600)
where current of pred_cursor
fetch pred_cursor into @user, @item
end
close pred_cusor
deallocate pred_cursor
Finding the best items for a specific user
-- best items for user 52
select item_id, SUM(rating*n.distance)/SUM(distance) as quality,
count(52) as num_matches
from ratings r,
acf_neighbors n
where n.user_id = 52
and n.neighbor_id = r.user_id
and item_id not in (select item_id from ratings where user_id = 52)
group by item_id
order by quality
Evaluating PerformanceError Distribution-- distribution of errors in predictions select avg((a.rating * -1) - b.rating) as error into #avg_err from ratings a, pred_ratings b where a.user_id = b.user_id and a.item_id = b.item_id and b.rating is not null group by a.user_id select error, count(*) from #avg_err group by error order by error Standard error deviation-- standard deviation of errors select stdev((a.rating * -1) - b.rating) from ratings a, pred_ratings b where a.user_id = b.user_id and a.item_id = b.item_id and b.rating is not null group by a.user_id compute stdev(stdev((a.rating * -1) - b.rating)) Error distribution at the extremes-- distribution of extreme errors in predictions select avg((a.rating * -1) - b.rating) as error into #average_error from ratings a, pred_ratings b where a.user_id = b.user_id and a.item_id = b.item_id and b.rating is not null and a.rating > 80 or a.rating < 20 group by a.user_id select error, count(*) from #average_error group by error order by error Calculate Predictability-- predictability -- divide first result by second (and * 100) to get -- the % of items that could be predicted select count(*) from pred_ratings where rating is not null select count(*) from pred_ratings ReferencesUpendra Shardanand, Social Information Filtering for Music Recommendation, Master's Thesis, MIT Media Lab, Cambridge, MA, September 1994. Upendra Shardanand and Pattie Maes Social Information Filtering: Algorithms for Automating "Word of Mouth", CHI '95. Max Metral, MotorMouth: A Generic Engine for Large-Scale, Real-Time Automated Collaborative Filtering, Master's Thesis, MIT Media Lab, Cambridge, MA, June 1995. Yezdi Lashkari Feature-Guided Automated Collaborative Filtering, MIT Media Lab, Cambridge, MA, July 1995. AcknowledgementsThanks to Max Metral for showing me the way, and David Tiu for getting me started on the FGACF version. | ||||||||||||||||||
| © 2000-2007 Paul Perry |