{ "cells": [ { "cell_type": "markdown", "id": "341158a9-edf4-488e-ae11-22114df42f3d", "metadata": { "tags": [] }, "source": [ "# Tutorial - Recommending Music with the last.fm 360K dataset." ] }, { "cell_type": "markdown", "id": "369a7190-0d1d-47ff-9a54-b297c770c417", "metadata": {}, "source": [ "This tutorial shows the major functionality of the [implicit](https://github.com/benfred/implicit) library by building a music recommender system using the the [last.fm 360K dataset](http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html)." ] }, { "cell_type": "markdown", "id": "9b1ff9f0-71cc-420c-8f18-034f24f1164b", "metadata": {}, "source": [ "### Getting the Dataset" ] }, { "cell_type": "markdown", "id": "73d1ac7e-858d-4046-a8ce-327045d8edf0", "metadata": {}, "source": [ "Implicit includes code to access several different popular recommender datasets in the ```implicit.datasets``` module. The following code will both download the lastfm dataset locally, as well as load it up into memory:" ] }, { "cell_type": "code", "execution_count": 1, "id": "1934d72e-88de-4906-bf75-74dfb5255f42", "metadata": {}, "outputs": [], "source": [ "from implicit.datasets.lastfm import get_lastfm\n", "\n", "artists, users, artist_user_plays = get_lastfm()" ] }, { "cell_type": "markdown", "id": "fdbffd74-14fa-4678-befd-21029776cdf3", "metadata": {}, "source": [ "`artist_user_plays` is a scipy sparse matrix, with the each row corresponding to a different musician and each column corresponding to a different user. The non-zero entries in the `artist_user_plays` matrix contains the number of times that the user has played that artist. The `artists` and `users` variables are arrays of string labels for each row and column in the `artist_user_plays` matrix. \n", "\n", "The implicit library is solely focused on implicit feedback recommenders systems - where we are given positive examples of what the user has interacted with, but aren't given the corresponding negative examples of what users aren't interested in. For this example we're shown the number of times that the user has played an artist in the dataset and can infer that a high play count indicates that the user likes an artist. However we can't infer that just because the user hasn't played a band before that means the user doesn't like the band." ] }, { "cell_type": "markdown", "id": "6bc39fae-8d78-4e21-9754-08394b5819e1", "metadata": { "tags": [] }, "source": [ "### Training a Model" ] }, { "cell_type": "markdown", "id": "ba930088-8ba5-4899-9457-322b24152ff1", "metadata": {}, "source": [ "Implicit provides implementations of several different algorithms for implicit feedback recommender systems. For this example we'll be looking at the `AlternatingLeastSquares` model that's based off the paper [Collaborative Filtering for Implicit Feedback Datasets](http://yifanhu.net/PUB/cf.pdf). This model aims to learn a binary target of whether each user has interacted with each item - but weights each binary interaction by a confidence value of how confident we are in this user/item interaction. The implementation in implicit uses the values of a sparse matrix to represent the confidences, with the non zero entries representing whether or not the user has interacted with the item.\n", "\n", "The first step in using this model is going to be transforming the raw play counts from the original dataset into values that can be used as confidences. We want to give repeated plays more confidence in the model, but have this effect taper off as the number of repeated plays increases to reduce the impact a single superfan has on the model. Likewise we want to direct some of the confidence weight away from popular items. To do this we'll use a [bm25](https://en.wikipedia.org/wiki/Okapi_BM25) weighting scheme inspired from classic information retrieval:" ] }, { "cell_type": "code", "execution_count": 2, "id": "1ee95a00-867d-42b1-8f3b-3a3c552e32fc", "metadata": {}, "outputs": [], "source": [ "from implicit.nearest_neighbours import bm25_weight\n", "\n", "# weight the matrix, both to reduce impact of users that have played the same artist thousands of times\n", "# and to reduce the weight given to popular items\n", "artist_user_plays = bm25_weight(artist_user_plays, K1=100, B=0.8)\n", "\n", "# get the transpose since the most of the functions in implicit expect (user, item) sparse matrices instead of (item, user)\n", "user_plays = artist_user_plays.T.tocsr()" ] }, { "cell_type": "markdown", "id": "569b2c60-211a-45a2-bc35-74f4ec1ad9ea", "metadata": {}, "source": [ "Once we have a weighted confidence matrix, we can use that to train an ALS model using implicit:" ] }, { "cell_type": "code", "execution_count": 3, "id": "56c85d21-55e7-4f4d-9ba2-1d09679e50ae", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7d824141fc3b4bf096687375b99223e8", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/15 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artistscorealready_liked
0mortiis1.056453True
1puissance1.036747True
2rome1.006126True
3laibach1.003616False
4the coffinshakers1.000682True
5spiritual front0.980971False
6karjalan sissit0.974622False
7von thronstahl0.974596True
8ordo rosarius equilibrio0.956722False
9type o negative0.954947True
\n", "" ], "text/plain": [ " artist score already_liked\n", "0 mortiis 1.056453 True\n", "1 puissance 1.036747 True\n", "2 rome 1.006126 True\n", "3 laibach 1.003616 False\n", "4 the coffinshakers 1.000682 True\n", "5 spiritual front 0.980971 False\n", "6 karjalan sissit 0.974622 False\n", "7 von thronstahl 0.974596 True\n", "8 ordo rosarius equilibrio 0.956722 False\n", "9 type o negative 0.954947 True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use pandas to display the output in a table, pandas isn't a dependency of implicit otherwise\n", "import numpy as np\n", "import pandas as pd\n", "pd.DataFrame({\"artist\": artists[ids], \"score\": scores, \"already_liked\": np.in1d(ids, user_plays[userid].indices)})" ] }, { "cell_type": "markdown", "id": "f9bceb44-5bd2-4d35-ba9a-bab244a1bedd", "metadata": {}, "source": [ "The `already_liked` column there shows if the user has interacted with the item already, and in this result most of the items being returned have already been interacted with by the user. We can remove these items from the result set with the `filter_already_liked_items` parameter - setting to `True` will remove all of these items from the results. The `user_plays[userid]` parameter is used to look up what items each user has interacted with, and can just be set to None if you aren't filtering the users own likes or recalculating the user representation on the fly.\n", "\n", "There are also more filtering options present in the `filter_items` parameter and `items` parameter, as well as options for recalculating the user representation on the fly with the `recalculate_user` parameter. See the API reference for more details." ] }, { "cell_type": "markdown", "id": "11bfb804-bc05-4cf5-b463-5be7b39b534f", "metadata": {}, "source": [ "### Recommending similar items\n", "\n", "Each model in implicit also has the ability to show related items through the `similar_items` method. For instance to get the related items for the Beatles:" ] }, { "cell_type": "code", "execution_count": 6, "id": "dba78592-9f70-447c-a3a8-786c2f51e67e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artistscore
0the beatles1.000000
1john lennon0.902621
2the beach boys0.875299
3the who0.874556
4the rolling stones0.871904
5bob dylan0.861967
6the kinks0.846969
7simon & garfunkel0.840297
8paul mccartney0.829233
9david bowie0.818386
\n", "
" ], "text/plain": [ " artist score\n", "0 the beatles 1.000000\n", "1 john lennon 0.902621\n", "2 the beach boys 0.875299\n", "3 the who 0.874556\n", "4 the rolling stones 0.871904\n", "5 bob dylan 0.861967\n", "6 the kinks 0.846969\n", "7 simon & garfunkel 0.840297\n", "8 paul mccartney 0.829233\n", "9 david bowie 0.818386" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get related items for the beatles (itemid = 25512)\n", "ids, scores= model.similar_items(252512)\n", "\n", "# display the results using pandas for nicer formatting\n", "pd.DataFrame({\"artist\": artists[ids], \"score\": scores})" ] }, { "cell_type": "markdown", "id": "a1d4727c-ba8d-4f37-9aee-61446d4ef440", "metadata": {}, "source": [ "### Making batch recommendations\n", "\n", "The `.recommend`, `.similar_items` and `.similar_users` calls all have the ability to generate batches of recommendations - in addition to just calculating a single user or item. Passing an array of userids or itemids to these methods will trigger the batch methods, and return a 2D array of ids and scores - with each row in the output matrices corresponding to value in the input. This will tend to be quite a bit more efficient calling the method repeatedly, as implicit will use multiple threads on the CPU and achieve better device utilization on the GPU with larger batches. " ] }, { "cell_type": "code", "execution_count": 7, "id": "10924205-8df4-41e2-ac06-1f064d915442", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([[161850, 107119, 150177, ..., 249560, 136336, 76757],\n", " [128505, 189597, 71465, ..., 111764, 255779, 71225],\n", " [186835, 167270, 142885, ..., 113686, 241312, 120981],\n", " ...,\n", " [ 83885, 265625, 279139, ..., 202346, 43598, 264562],\n", " [109930, 1560, 97970, ..., 116857, 236697, 33602],\n", " [ 21090, 276679, 197984, ..., 272293, 185495, 22505]], dtype=int32),\n", " (1000, 10))" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make recommendations for the first 1000 users in the dataset\n", "userids = np.arange(1000)\n", "ids, scores = model.recommend(userids, user_plays[userids])\n", "ids, ids.shape" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }