Tutorial - Recommending Music with the last.fm 360K dataset.

This tutorial shows the major functionality of the implicit library by building a music recommender system using the the last.fm 360K dataset.

Getting the Dataset

Implicit includes code to access several different popular recommender datasets in the implicit.datasets module. The following code will both download the lastfm dataset locally, as well as load it up into memory:

[1]:

from implicit.datasets.lastfm import get_lastfm

artists, users, artist_user_plays = get_lastfm()

artist_user_plays is a scipy sparse matrix, with the each row corresponding to a different musician and each column corresponding to a different user. The non-zero entries in the artist_user_plays matrix contains the number of times that the user has played that artist. The artists and users variables are arrays of string labels for each row and column in the artist_user_plays matrix.

The implicit library is solely focused on implicit feedback recommenders systems - where we are given positive examples of what the user has interacted with, but aren’t given the corresponding negative examples of what users aren’t interested in. For this example we’re shown the number of times that the user has played an artist in the dataset and can infer that a high play count indicates that the user likes an artist. However we can’t infer that just because the user hasn’t played a band before that means the user doesn’t like the band.

Training a Model

Implicit provides implementations of several different algorithms for implicit feedback recommender systems. For this example we’ll be looking at the AlternatingLeastSquares model that’s based off the paper Collaborative Filtering for Implicit Feedback Datasets. This model aims to learn a binary target of whether each user has interacted with each item - but weights each binary interaction by a confidence value of how confident we are in this user/item interaction. The implementation in implicit uses the values of a sparse matrix to represent the confidences, with the non zero entries representing whether or not the user has interacted with the item.

The first step in using this model is going to be transforming the raw play counts from the original dataset into values that can be used as confidences. We want to give repeated plays more confidence in the model, but have this effect taper off as the number of repeated plays increases to reduce the impact a single superfan has on the model. Likewise we want to direct some of the confidence weight away from popular items. To do this we’ll use a bm25 weighting scheme inspired from classic information retrieval:

[2]:

from implicit.nearest_neighbours import bm25_weight

# weight the matrix, both to reduce impact of users that have played the same artist thousands of times
# and to reduce the weight given to popular items
artist_user_plays = bm25_weight(artist_user_plays, K1=100, B=0.8)

# get the transpose since the most of the functions in implicit expect (user, item) sparse matrices instead of (item, user)
user_plays = artist_user_plays.T.tocsr()

Once we have a weighted confidence matrix, we can use that to train an ALS model using implicit:

[3]:

from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(factors=64, regularization=0.05, alpha=2.0)
model.fit(user_plays)

Fitting the model will happen on any compatible Nvidia GPU, or using all the available cores on your CPU if you don’t have a GPU enabled. You can force the operation by setting the use_gpu flag on the constructor of the AlternatingLeastSquares model.

Making Recommendations

After training the model, you can make recommendations for either a single user or a batch of users with the .recommend function on the model:

[4]:

# Get recommendations for the a single user
userid = 12345
ids, scores = model.recommend(userid, user_plays[userid], N=10, filter_already_liked_items=False)

The .recommend call will compute the N best recommendations for each user in the input, and return the itemids in the ids array as well as the computed scores in the scores array. We can see what the musicians are recommended for each user by looking up the ids in the artists array:

[5]:

# Use pandas to display the output in a table, pandas isn't a dependency of implicit otherwise
import numpy as np
import pandas as pd
pd.DataFrame({"artist": artists[ids], "score": scores, "already_liked": np.in1d(ids, user_plays[userid].indices)})

[5]:

	artist	score	already_liked
0	mortiis	1.056453	True
1	puissance	1.036747	True
2	rome	1.006126	True
3	laibach	1.003616	False
4	the coffinshakers	1.000682	True
5	spiritual front	0.980971	False
6	karjalan sissit	0.974622	False
7	von thronstahl	0.974596	True
8	ordo rosarius equilibrio	0.956722	False
9	type o negative	0.954947	True

The already_liked column there shows if the user has interacted with the item already, and in this result most of the items being returned have already been interacted with by the user. We can remove these items from the result set with the filter_already_liked_items parameter - setting to True will remove all of these items from the results. The user_plays[userid] parameter is used to look up what items each user has interacted with, and can just be set to None if you aren’t filtering the users own likes or recalculating the user representation on the fly.

There are also more filtering options present in the filter_items parameter and items parameter, as well as options for recalculating the user representation on the fly with the recalculate_user parameter. See the API reference for more details.

Recommending similar items

Each model in implicit also has the ability to show related items through the similar_items method. For instance to get the related items for the Beatles:

[6]:

# get related items for the beatles (itemid = 25512)
ids, scores= model.similar_items(252512)

# display the results using pandas for nicer formatting
pd.DataFrame({"artist": artists[ids], "score": scores})

[6]:

	artist	score
0	the beatles	1.000000
1	john lennon	0.902621
2	the beach boys	0.875299
3	the who	0.874556
4	the rolling stones	0.871904
5	bob dylan	0.861967
6	the kinks	0.846969
7	simon & garfunkel	0.840297
8	paul mccartney	0.829233
9	david bowie	0.818386

Making batch recommendations

The .recommend, .similar_items and .similar_users calls all have the ability to generate batches of recommendations - in addition to just calculating a single user or item. Passing an array of userids or itemids to these methods will trigger the batch methods, and return a 2D array of ids and scores - with each row in the output matrices corresponding to value in the input. This will tend to be quite a bit more efficient calling the method repeatedly, as implicit will use multiple threads on the CPU and achieve better device utilization on the GPU with larger batches.

[7]:

# Make recommendations for the first 1000 users in the dataset
userids = np.arange(1000)
ids, scores = model.recommend(userids, user_plays[userids])
ids, ids.shape

[7]:

(array([[161850, 107119, 150177, ..., 249560, 136336,  76757],
        [128505, 189597,  71465, ..., 111764, 255779,  71225],
        [186835, 167270, 142885, ..., 113686, 241312, 120981],
        ...,
        [ 83885, 265625, 279139, ..., 202346,  43598, 264562],
        [109930,   1560,  97970, ..., 116857, 236697,  33602],
        [ 21090, 276679, 197984, ..., 272293, 185495,  22505]], dtype=int32),
 (1000, 10))