paul perry .   blog   |   about   |   notes

A Minimalist Implementation of Collaborative Filtering

Automated Collaborative Filtering (ACF) in SQL.

Paul Perry, November 2000.

Most papers in computer science describe how their author learned what someone else already knew.- Peter Landin *

Introduction

Before you blindly deploy a recommender system or build your own from scratch, you should evaluate whether your domain lends itself to k-nearest neighbor collaborative filtering and compare its performance against other approaches.

I recently had such a task on my hands and needed a minimalist implementation of ACF as an algorithmic testbed for data analysis. Others have done this and have developed full applications to study various tradeoffs [Breeze], but building my own gives me more control to evolve other approaches and optimizations.

Below is the smallest amount of code that could implement ACF using SQL as the easiest way to handle the data sets.

I bet that new recommender architectures could be based on such a small codebase. The performance issue will be the caching properties of the database, but new in-memory databases could postpone the performance issue for a while.

Note (Dec 2001): After writing this note I found DBLens Collaborative Filtering Research Engine by Jon Herlocker of the GroupLens team. Looks interesting but I haven't tried it.

This paper assumes knowledge of ACF and delves right into the implementation details. If you download the code and find problems or make improvements, I would like to hear about it, thanks.

The Basic ACF Algorithm

1. Collect user ratings of items

RI,p is user I's rating for item p

2. Calculate distance D between users I and J

Using the mean square difference method:

    DI,J = ||Items||
å
p=1
(RI,p - RJ,p)2

||Items||

Where ||Items|| is the number of items user I and user J have rated in common. I make the assumption that missing ratings don't factor into the distance calculations.

3. Construct the Neighbor set for user I

User J is a possible nearest neighbor of I iff

    DI,JL

where L is the maximum distance that two uses can be separated to be considered neighbors.

4. Calculate weight of each neighbor

    WI,J = L - DI,J

L

5. Calculate the predicted rating of Items

The predicted rating for item p is calculated as:

    PI,p = ||NeighborsI||
å
J=1
WI,J × RJ,p

||NeighborsI||
å
J=1
WI,J

An Analytical Workbench

Based on the schema described in the subsequent section, the steps for evaluating an ACF approach are:

  1. Collect user ratings on items in the ratings table
  2. Populate the user and item tables from the ratings table after all the data has been collected
  3. Separate 20% of the ratings data as the 'predicted set' by making the rating negative (note that ratings of 0 will be problematic with this method, so change them to a -1)
  4. Populate the predicted_ratings table with the items to be predicted
  5. Calculate the neighbor set of each user based ONLY on the "known" ratings, i.e. no ratings in the predicted set
  6. Calculate the predictions for every item in the predicted_ratings table

Note: we can ignore the need to calculate the weighted averages of each neighbor as it doesn't affect the final predicted rating.

SQL Schema Representation


CREATE TABLE [dbo].[users] (
	[user_id] [int] NOT NULL 
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[ratings] (
	[user_id] [int] NOT NULL ,
	[item_id] [int] NOT NULL ,
	[rating] [int] NULL 
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[acf_neighbors] (
	[user_id] [int] NOT NULL ,
	[neighbor_id] [int] NOT NULL ,
	[distance] [int] NULL ,
	[common_items] [int] NULL 
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[acf_neighbors] WITH NOCHECK ADD 
	CONSTRAINT [PK_acf_neighbors] PRIMARY KEY  CLUSTERED 
	(
		[user_id] DESC ,
		[neighbor_id] DESC 
	)  ON [PRIMARY] 
GO

ALTER TABLE [dbo].[users] WITH NOCHECK ADD 
	CONSTRAINT [PK_users] PRIMARY KEY  CLUSTERED 
	(
		[user_id]
	)  ON [PRIMARY] 
GO

 CREATE  INDEX [user_id_ratings_idx] ON [dbo].[ratings]([user_id] DESC ) ON [PRIMARY]
GO

 CREATE  INDEX [item_id_ratings_idx] ON [dbo].[ratings]([item_id] DESC ) ON [PRIMARY]
GO

Store the predicted ratings in a separate table.

CREATE TABLE [dbo].[pred_ratings] (
	[user_id] [int] NOT NULL ,
	[item_id] [int] NOT NULL ,
	[rating] [int] NULL 
) ON [PRIMARY]
GO

 CREATE  INDEX [user_id_pred_ratings_idx] ON [dbo].[pred_ratings]([user_id] ) ON [PRIMARY]
GO

 CREATE  INDEX [item_id_pred_ratings_idx] ON [dbo].[pred_ratings]([item_id]) ON [PRIMARY]
GO

Collecting the data

After collecting all the ratings in the ratings table, extract the unique user list.

-- insert users
-- 
insert into users (user_id)
select distinct user_id
from ratings
order by user_id

Looking at the data

After collecting user ratings on items you need to look at some of the data properties, even before looking at the ACF results. Flags should be raised if the average rating isn't centered on the rating scale and the rating distribution isn't normal. These would raise questions in the data collection process.

All the code makes the assumption that we are working with a rating scale of 1-100.

Average Rating per Item

-- average rating 
select item_id, avg(rating) as average
from ratings
group by item_id
order by avg(rating) desc
compute avg(avg(rating))

Ratings distribution

-- distribution of ratings
select rating, count(*) as count
from ratings
group by rating
order by rating

Item ratings at the extreme ends of the scale

-- which items occur more often at the extreme ends of the scale

select item_id, count(*) as count
into #extreme_ratings
from ratings
where rating > 80
or rating < 20
group by item_id

select item_id, count(*) as high_count
into #high_ratings
from ratings
where rating > 80
group by item_id

select item_id, count(*) as low_count
into #low_ratings
from ratings
where rating < 20
group by item_id

select h.item_id, e.count, h.high_count, l.low_count
from #extreme_ratings e, #high_ratings h, #low_ratings l
where e.item_id = h.item_id 
and e.item_id = l.item_id
order by e.count desc

drop table #extreme_ratings
drop table #high_ratings
drop table #low_ratings

Calculating Neighbors

-- nearest neighbor computation
insert into acf_neighbors (user_id, neighbor_id, distance, common_items)
select u1.user_id, u2.user_id as neighbor_id,
sum(SQUARE(u1.rating - u2.rating))/count(u1.user_id) as distance, 
count(u1.user_id) as common_items
from ratings u1, ratings u2
where u1.user_id != u2.user_id
and u1.item_id = u2.item_id
group by u1.user_id, u2.user_id
order by u1.user_id, distance desc

But you need to calculate neighbors only based on the "known ratings" data, so use the following:

-- nearest neighbor computation for testbed data
insert into acf_neighbors (user_id, neighbor_id, distance, common_items)
select u1.user_id, u2.user_id as neighbor_id,
sum(SQUARE(u1.rating - u2.rating))/count(u1.user_id) as distance, 
count(u1.user_id) as common_items
from ratings u1, ratings u2
where u1.user_id != u2.user_id
and u1.item_id = u2.item_id
and u1.rating > -1
and u2.rating > -1
group by u1.user_id, u2.user_id
order by u1.user_id, distance desc

Looking at the Neighbor data

Distance distribution

-- distribution of mean squared differences
select distance, count(*) as count
from acf_neighbors
group by distance
order by distance

Average number of neighbors

If the neighbor count is too low, then we don't have enough data, either because not enough users rated items in common, or we didn't have enough of a user population to rate items.

-- average number of neighbors within a given distance

select user_id, count(*) 
from acf_neighbors
where distance < 500
group by user_id
compute avg(count(*)), stdev(count(*))

Calculating Predictions

-- insert items to predict
insert into pred_ratings (user_id, item_id)
select user_id, item_id
from ratings
where rating < 0
order by user_id, item_id

We calculate predictions based on the neighbor set, which is defined by a fixed cutoff. The cutoff could be determined dynamically on a per user basis, but in this case it is fixed at one standard deviation (600 for my data).

-- calculate predictions

declare pred_cursor cursor
for select user_id, item_id from pred_ratings
declare @user int, @item int
open pred_cursor
fetch pred_cursor into @user, @item
while @@fetch_status != -1 begin
    update pred_ratings 
    set rating =
    (select SUM(r.rating * n.distance)/SUM(n.distance) as rating
    from 
       ratings r,
       acf_neighbors n
    where 
    -- for every rating not yet predicted
    n.user_id = @user
    and r.item_id = @item
    -- find the neighbors under a certain cutoff
    -- in this case 600, you pick your own
    and n.neighbor_id = r.user_id
    and n.distance < 600)
   where current of pred_cursor
fetch pred_cursor into @user, @item
end
close pred_cusor
deallocate pred_cursor

Finding the best items for a specific user

-- best items for user 52
select item_id, SUM(rating*n.distance)/SUM(distance) as quality,
count(52) as num_matches
from ratings r,
     acf_neighbors n
where n.user_id = 52
and n.neighbor_id = r.user_id
and item_id not in (select item_id from ratings where user_id = 52)
group by item_id
order by quality

Evaluating Performance

Error Distribution

-- distribution of errors in predictions

select avg((a.rating * -1) - b.rating) as error
into #avg_err
from ratings a, pred_ratings b
where a.user_id = b.user_id
and a.item_id = b.item_id
and b.rating is not null
group by a.user_id

select error, count(*)
from #avg_err
group by error
order by error

Standard error deviation

-- standard deviation of errors

select stdev((a.rating * -1) - b.rating) 
from ratings a, pred_ratings b
where a.user_id = b.user_id
and a.item_id = b.item_id
and b.rating is not null
group by a.user_id
compute stdev(stdev((a.rating * -1) - b.rating))

Error distribution at the extremes

-- distribution of extreme errors in predictions

select avg((a.rating * -1) - b.rating) as error
into #average_error
from ratings a, pred_ratings b
where a.user_id = b.user_id
and a.item_id = b.item_id
and b.rating is not null
and a.rating > 80 or a.rating < 20
group by a.user_id

select error, count(*)
from #average_error
group by error
order by error

Calculate Predictability

-- predictability
-- divide first result by second (and * 100) to get 
-- the % of items that could be predicted

select count(*)
from pred_ratings
where rating is not null

select count(*)
from pred_ratings

References

Upendra Shardanand, Social Information Filtering for Music Recommendation, Master's Thesis, MIT Media Lab, Cambridge, MA, September 1994.

Upendra Shardanand and Pattie Maes Social Information Filtering: Algorithms for Automating "Word of Mouth", CHI '95.

Max Metral, MotorMouth: A Generic Engine for Large-Scale, Real-Time Automated Collaborative Filtering, Master's Thesis, MIT Media Lab, Cambridge, MA, June 1995.

Yezdi Lashkari Feature-Guided Automated Collaborative Filtering, MIT Media Lab, Cambridge, MA, July 1995.

Acknowledgements

Thanks to Max Metral for showing me the way, and David Tiu for getting me started on the FGACF version.