Tutorial - Recommending Music with the last.fm 360K dataset.
This tutorial shows the major functionality of the implicit library by building a music recommender system using the the last.fm 360K dataset.
Getting the Dataset
Implicit includes code to access several different popular recommender datasets in the implicit.datasets
module. The following code will both download the lastfm dataset locally, as well as load it up into memory:
[1]:
from implicit.datasets.lastfm import get_lastfm
artists, users, artist_user_plays = get_lastfm()
artist_user_plays
is a scipy sparse matrix, with the each row corresponding to a different musician and each column corresponding to a different user. The non-zero entries in the artist_user_plays
matrix contains the number of times that the user has played that artist. The artists
and users
variables are arrays of string labels for each row and column in the artist_user_plays
matrix.
The implicit library is solely focused on implicit feedback recommenders systems - where we are given positive examples of what the user has interacted with, but aren’t given the corresponding negative examples of what users aren’t interested in. For this example we’re shown the number of times that the user has played an artist in the dataset and can infer that a high play count indicates that the user likes an artist. However we can’t infer that just because the user hasn’t played a band before that means the user doesn’t like the band.
Training a Model
Implicit provides implementations of several different algorithms for implicit feedback recommender systems. For this example we’ll be looking at the AlternatingLeastSquares
model that’s based off the paper Collaborative Filtering for Implicit Feedback Datasets. This model aims to learn a binary target of whether each user has interacted with each item - but weights each binary interaction by a confidence value of how confident we are in this user/item
interaction. The implementation in implicit uses the values of a sparse matrix to represent the confidences, with the non zero entries representing whether or not the user has interacted with the item.
The first step in using this model is going to be transforming the raw play counts from the original dataset into values that can be used as confidences. We want to give repeated plays more confidence in the model, but have this effect taper off as the number of repeated plays increases to reduce the impact a single superfan has on the model. Likewise we want to direct some of the confidence weight away from popular items. To do this we’ll use a bm25 weighting scheme inspired from classic information retrieval:
[2]:
from implicit.nearest_neighbours import bm25_weight
# weight the matrix, both to reduce impact of users that have played the same artist thousands of times
# and to reduce the weight given to popular items
artist_user_plays = bm25_weight(artist_user_plays, K1=100, B=0.8)
# get the transpose since the most of the functions in implicit expect (user, item) sparse matrices instead of (item, user)
user_plays = artist_user_plays.T.tocsr()
Once we have a weighted confidence matrix, we can use that to train an ALS model using implicit:
[3]:
from implicit.als import AlternatingLeastSquares
model = AlternatingLeastSquares(factors=64, regularization=0.05, alpha=2.0)
model.fit(user_plays)
Fitting the model will happen on any compatible Nvidia GPU, or using all the available cores on your CPU if you don’t have a GPU enabled. You can force the operation by setting the use_gpu
flag on the constructor of the AlternatingLeastSquares
model.
Making Recommendations
After training the model, you can make recommendations for either a single user or a batch of users with the .recommend
function on the model:
[4]:
# Get recommendations for the a single user
userid = 12345
ids, scores = model.recommend(userid, user_plays[userid], N=10, filter_already_liked_items=False)
The .recommend
call will compute the N
best recommendations for each user in the input, and return the itemids in the ids
array as well as the computed scores in the scores
array. We can see what the musicians are recommended for each user by looking up the ids in the artists
array:
[5]:
# Use pandas to display the output in a table, pandas isn't a dependency of implicit otherwise
import numpy as np
import pandas as pd
pd.DataFrame({"artist": artists[ids], "score": scores, "already_liked": np.in1d(ids, user_plays[userid].indices)})
[5]:
artist | score | already_liked | |
---|---|---|---|
0 | mortiis | 1.056453 | True |
1 | puissance | 1.036747 | True |
2 | rome | 1.006126 | True |
3 | laibach | 1.003616 | False |
4 | the coffinshakers | 1.000682 | True |
5 | spiritual front | 0.980971 | False |
6 | karjalan sissit | 0.974622 | False |
7 | von thronstahl | 0.974596 | True |
8 | ordo rosarius equilibrio | 0.956722 | False |
9 | type o negative | 0.954947 | True |
The already_liked
column there shows if the user has interacted with the item already, and in this result most of the items being returned have already been interacted with by the user. We can remove these items from the result set with the filter_already_liked_items
parameter - setting to True
will remove all of these items from the results. The user_plays[userid]
parameter is used to look up what items each user has interacted with, and can just be set to None if you aren’t
filtering the users own likes or recalculating the user representation on the fly.
There are also more filtering options present in the filter_items
parameter and items
parameter, as well as options for recalculating the user representation on the fly with the recalculate_user
parameter. See the API reference for more details.
Recommending similar items
Each model in implicit also has the ability to show related items through the similar_items
method. For instance to get the related items for the Beatles:
[6]:
# get related items for the beatles (itemid = 25512)
ids, scores= model.similar_items(252512)
# display the results using pandas for nicer formatting
pd.DataFrame({"artist": artists[ids], "score": scores})
[6]:
artist | score | |
---|---|---|
0 | the beatles | 1.000000 |
1 | john lennon | 0.902621 |
2 | the beach boys | 0.875299 |
3 | the who | 0.874556 |
4 | the rolling stones | 0.871904 |
5 | bob dylan | 0.861967 |
6 | the kinks | 0.846969 |
7 | simon & garfunkel | 0.840297 |
8 | paul mccartney | 0.829233 |
9 | david bowie | 0.818386 |
Making batch recommendations
The .recommend
, .similar_items
and .similar_users
calls all have the ability to generate batches of recommendations - in addition to just calculating a single user or item. Passing an array of userids or itemids to these methods will trigger the batch methods, and return a 2D array of ids and scores - with each row in the output matrices corresponding to value in the input. This will tend to be quite a bit more efficient calling the method repeatedly, as implicit will use multiple
threads on the CPU and achieve better device utilization on the GPU with larger batches.
[7]:
# Make recommendations for the first 1000 users in the dataset
userids = np.arange(1000)
ids, scores = model.recommend(userids, user_plays[userids])
ids, ids.shape
[7]:
(array([[161850, 107119, 150177, ..., 249560, 136336, 76757],
[128505, 189597, 71465, ..., 111764, 255779, 71225],
[186835, 167270, 142885, ..., 113686, 241312, 120981],
...,
[ 83885, 265625, 279139, ..., 202346, 43598, 264562],
[109930, 1560, 97970, ..., 116857, 236697, 33602],
[ 21090, 276679, 197984, ..., 272293, 185495, 22505]], dtype=int32),
(1000, 10))