A recommendation system enables us to offer products or services to new users. Undoubtedfuly, this is essential for many online businesses, such as Netflix. That’s why today we will learn what a recommendation system is, which type exist and, of course, we will learn how to code them from scratch in R. Let’s get to it!
Besides, two systems could also be combined, creating hybrid models, as in the case of ensemble models in Machine Learning.
That being said, let’s see how to code a recommendation system from scratch in R. Let’s do it!
In order to code a recommendation system in R, the first thing that we need is data. In my case, I will not use the typical MovieLens dataset. Instead, I will use a book dataset, as I find it much more real. Anyway, if you want to try with another dataset, check out this page.
We will download the dataset we will use from here.
url = "http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip" download.file(url, destfile = "data.zip") dir.create("data") unzip("data.zip",exdir = "data") files = paste0("data/",list.files("data")) ratings = read.csv(files[1], sep = ";") books = read.csv(files[2], sep = ";") users = read.csv(files[3], sep = ";") rm(files, url)
Before coding anythin, we will see what type of data we have in each dataset. This is very important, because the type of analysis we should do will depend on the data that we have (we will talk about this later in the post).
library(dplyr) glimpse(books)
## Rows: 115,253 ## Columns: 8 ## $ ISBN 0195153448, 0002005018, 0060973129, 0374157065. ## $ Book.Title Classical Mythology, Clara Callan, Decision in . ## $ Book.Author Mark P. O. Morford, Richard Bruce Wright, Carlo. ## $ Year.Of.Publication 2002, 2001, 1991, 1999, 1999, 1991, 2000, 1993. ## $ Publisher Oxford University Press, HarperFlamingo Canada. ## $ Image.URL.S http://images.amazon.com/images/P/0195153448.01. ## $ Image.URL.M http://images.amazon.com/images/P/0195153448.01. ## $ Image.URL.L http://images.amazon.com/images/P/0195153448.01.
As you can see, we have 4 variables referring to some features of the book (Title, Author, Year of Publication and Publisher). We will use these variable to code the content based recommendation system. Besides, we will use the images to show the recommendations the system is doing.
Anyway, in order to generate more realistic data, we will include a new variable called ‘Category’. This variable will indicate if the book belongs to any of the following categories:
set.seed(1234) categories = c("Action and Adventure","Classic","Detective and Mystery","Fantasy") books$category = sample( categories, nrow(books), replace=TRUE, prob=c(0.25, 0.3, 0.25, 0.20)) books$category = as.factor(books$category) rm(categories)
Besides, I will apply a small transformation: I will add the caracters ‘Id’ to all the ISBNs and User-Ids. I do this because at some point we will construct matrixes data have ISBNs or User-Ids as column or row names. As they all begin with a number, R would include an X before the column/row name. Adding this ‘Id’ strings will avoid this to happen.
books$ISBN = paste0("Isbn.",books$ISBN) users$User.ID = paste0("User.",users$User.ID) ratings$ISBN = paste0("Isbn.",ratings$ISBN) ratings$User.ID = paste0("User.",ratings$User.ID)
On the other hand, we will see how the ratings of the books are distributed. This is very important for the collaborative recommendation systems that we will build with R.
library(ggplot2) ratings %>% group_by(Book.Rating) %>% summarize(cases = n()) %>% ggplot(aes(Book.Rating, cases)) + geom_col() + theme_minimal() + scale_x_continuous(breaks = 0:10)
As we can see, there are a lot of zeros. This is a little weird, and it might indicate the absence of rating (a person has read the book but has not rate it). Thus, we will just get with the recommendations that are non-zero.
ratings = ratings[ratings$Book.Rating!= 0, ]
Know we can redo the same graph and better see how the ratings are distributed:
ratings %>% group_by(Book.Rating) %>% summarize(cases = n()) %>% ggplot(aes(Book.Rating, cases)) + geom_col() + theme_minimal() + scale_x_continuous(breaks = 0:10)
Finally, let’s see how much each person scores:
ratings_sum = ratings %>% group_by(User.ID) %>% count() summary(ratings_sum$n)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 1.000 1.000 5.319 3.000 1906.000
As we can see, 75% of users have given 3 recommendations or less. We are going to remove these people to keep only more significant users and, thus, reduce computing needs:
user_index = ratings_sum$User.ID[ratings_sum$n>4] users = users[users$User.ID %in% user_index, ] ratings = ratings[ratings$User.ID %in% user_index, ] books = books[books$ISBN %in% ratings$ISBN,] rm(ratings_sum, user_index)
Now that we know our data, let’s create our recommendation system in R. Let’s get to it!
A content-based recommendation system uses product characteristics to find similar products.
As we have seen previously, in our case we have different characteristics of the books: book title, year, author, publisher and category. However, are all these characteristics relevant to the user?
The title of the book, for example, seems like it’s not a very good feature for a recommendation. Perhaps, if we had the description we could do an analysis of the text to find the keywords of the description. That would be something that could make sense, but the title… I don’t think so, so we drop it.
In my opinion, the year can also be misleading: I would not buy a book because it is the same year as one that I liked. Perhaps it is something that only makes sense once certain conditions are met… In any case, in my case I have also removed it.
Therefore, we are left with the variables: Author, Publisher and Category. But… how can we know how similar two products are based on some categories? For that, we have to calculate distances.
The form of distance we can calculate will largely depend on the types of data we have. There are basically three options:
In our case, all the data is categorical. However, we will use the Gower distance, which in our case also works and, thus, you also know how it would be done in case we also include numerical data (such as the year). Let’s see how it works:
library(cluster) books_distance = books[,c("ISBN","Book.Author","Publisher")] # Convert variables to factors books_distance[,1]
# Error: a vector of size 49.5 Gb cannot be located
As we can see, we cannot calculate distances. Why? Well, because this formula calculates the distance between all the elements. Therefore, the result is a matrix of n x n , where n is the number of different books that we have.
In our case we have 115246 different books, so we would have to make a 115246 x 115246 matrix. This is something too heavy to work locally. In fact, if we try to create it, we will see how it returns the same error:
matrix(ncol = 115246, nrow = 115246)
# Error: a vector of size 49.5 Gb cannot be located
Here we run into one of the main problems with content-based recommendation systems: they are very difficult to scale and use with many products. By itself this is already a drawback to be able to use them … Anyway, let’s see how they work.
To avoid this problem, we are going to take only 10,000 books by the most common authors to see how they work. In addition, we are going to give a weight to each variable, in such a way that two books are more alike because they are by the same author than because they are from the same publisher.
library(dplyr) book_feature = books[1:10000,c("Book.Author","Publisher","category")] # convert to factors book_feature[,1]
## Isbn.0971880107 Isbn.0345402871 Isbn.0345417623 Isbn.0684823802 ## Isbn.0971880107 0.0000000 1.0000000 1.0000000 1.0000000 ## Isbn.0345402871 1.0000000 0.0000000 0.5714286 1.0000000 ## Isbn.0345417623 1.0000000 0.5714286 0.0000000 1.0000000 ## Isbn.0684823802 1.0000000 1.0000000 1.0000000 0.0000000 ## Isbn.0375759778 0.7142857 1.0000000 1.0000000 1.0000000 ## Isbn.0375406328 1.0000000 1.0000000 1.0000000 0.7142857 ## Isbn.0375759778 Isbn.0375406328 ## Isbn.0971880107 0.7142857 1.0000000 ## Isbn.0345402871 1.0000000 1.0000000 ## Isbn.0345417623 1.0000000 1.0000000 ## Isbn.0684823802 1.0000000 0.7142857 ## Isbn.0375759778 0.0000000 1.0000000 ## Isbn.0375406328 1.0000000 0.0000000
As you can see, the matrix has mainly two values: 0 and 1, 0 being the least degree of dissimilarity and 1 being the maximum degree. Be careful, we are talking about dissimilarity, so if the value is 0, it means that those books are the same, while if it is 1, they have nothing in common.
As we can see, of all the books that we have printed out, we see that there are books that have things in common: 0684823802 is similar to 0345402871 and 0345402871 looks like 0345417623.
Basically, under the hood this recommendation systems recommends books that are:
Now that we have the distance between books and we know how this recommendation system works…. Let’s see which book we recommend to a user!
To get the recommendations for a user, we are going to need the books that a user has read and rated. After that, we can search for books that look like those books. In addition, we must bear in mind that we do not have all the books, but only a sample of them, since we have stayed with the 10,000 most famous authors.
Anyway, we are going to choose a user and keep the books they have read. We will apply the algorithm on these books:
user_id = "User.1167" user_books = ratings %>% filter(User.ID == user_id & ISBN %in% books$ISBN[1:10000]) %>% arrange(desc(Book.Rating)) head(user_books,10)
As we can see, the user has given 6 recommendations giving scores of 10, 9, 8, 7 and 5 points. Using the recommendations is something important, since it allows us to weigh the recommendations we give: we will prioritize a book similar to the one that has scored a 10 over one that has scored a 5. In addition, we will stop recommending those books that are similar to the which has scored a 0.
With this in mind, we are going to find the books that most resemble these two copies:
library(tidyr) books$ISBN = as.character(books$ISBN) selected_books = user_books[ ,c("ISBN", "Book.Rating")] recomendar = function(selected_books, dissimilarity_matrix, books, n_recommendations = 5)< selected_book_indexes = which(colnames(dissimilarity_matrix) %in% selected_books$ISBN) results = data.frame(dissimilarity_matrix[, selected_book_indexes], recommended_book = row.names(dissimilarity_matrix), stringsAsFactors = FALSE) recomendaciones = results %>% pivot_longer(cols = c(-"recommended_book") , names_to = "readed_book", values_to = "dissimilarity") %>% left_join(selected_books, by = c("recommended_book" = "ISBN"))%>% arrange(desc(dissimilarity)) %>% filter(recommended_book != readed_book) %>% filter(!is.na(Book.Rating) ) %>% mutate( similarity = 1 - dissimilarity, weighted_score = similarity * Book.Rating) %>% arrange(desc(weighted_score)) %>% filter(weighted_score>0) %>% group_by(recommended_book) %>% slice(1) %>% top_n(n_recommendations, weighted_score) %>% left_join(books, by = c("recommended_book" = "ISBN")) return(recomendaciones) > recomendaciones = recomendar(selected_books, dissimilarity, books) recomendaciones
And with this… we have already created our content-based recommendation system! As you can see, we recommend several similar books to the user for having read the book 0060929596.
We are going to create a function that allows us to visualize it in an easier way:
visualizar_recomendacion = function(recomendation, recommended_book, image, n_books = 5) < if(n_books >nrow(recomendation)) plot = list() dir.create("content_recommended_images") for(i in 1:n_books) < # Create dir & Download the images img = pull(recomendation[i,which(colnames(recomendation) == image)]) name = paste0("content_recommended_images/",i,".jpg") suppressMessages( download.file(as.character(img), destfile = name ,mode = "wb") ) # Assign Objetc plot[[i]] = rasterGrob(readJPEG(name)) >do.call(marrangeGrob, args = list(plot, ncol = n_books, nrow = 1, top="")) > visualizar_recomendacion(recomendaciones, "recommended_book","Image.URL.M")
Now, we are going to learn how to create a collaborative recommendation system. Let’s do it!
The recommendation system based on the user or collaborative filter consists of using the ratings of the users about the products in order to recommend books to you. More specifically, there are two main types of collaborative recommendation systems.
Now that we intuitively know how these two recommendation systems work, let’s code them in R!
To code both item-based and user-based collaborative recommendation system, we first need to create the User-Product matrix. We can do this easily with the pivot_wider function from tidyr .
user_item = ratings %>% top_n(10000) %>% pivot_wider(names_from = ISBN,values_from = Book.Rating) %>% as.data.frame() row.names(user_item) = user_item$User.ID user_item$User.ID = NULL user_item = as.matrix(user_item) user_item[1:5,1:5]
## Isbn.0060096195 Isbn.0142302198 Isbn.038076041X Isbn.0699854289 ## User.276822 10 10 10 10 ## User.276847 NA NA NA NA ## User.276859 NA NA NA NA ## User.276861 NA NA NA NA ## User.276872 NA NA NA NA ## Isbn.0786817070 ## User.276822 10 ## User.276847 NA ## User.276859 NA ## User.276861 NA ## User.276872 NA
However, we see how this has a problem: there is a lot of NA. If you think about it, it is normal, since a user will only read a few books of all those that are available.
In any case, having many NAs is called sparsity . We can calculate the degree of sparsity as follows:
sum(is.na(user_item)) / ( ncol(user_item) * nrow(user_item) )
## [1] 0.9996276
As you can see we have a very sparse matrix, since 99.96% of the cells lack data. This is something that limits us quite a bit in order to act, since from this matrix we must find the similarity between products.
For this, there are different formulas: cosine similarity, Pearson’s correlation coefficient, Euclidean distance… In our case we will use the cosine similarity.
The formula of the cosine similarity is as follows:
We translate this formula into code:
cos_similarity = function(A,B)< num = sum(A *B, na.rm = T) den = sqrt(sum(A^2, na.rm = T)) * sqrt(sum(B^2, na.rm = T)) result = num/den return(result) >
Now that we have coded the cosine function, we can apply this function to all the items and thus obtain the Product-Product matrix.
However, it is not something that we are going to apply to all items, but only to the item from which we want to find similar products. Why? Because, once again, calculating the item-item matrix is computationally very demanding and would require a lot of time and memory.
So, we create a function to calculate the similarity only on the product id that we choose.
item_recommendation = function(book_id, rating_matrix = user_item, n_recommendations = 5)< book_index = which(colnames(rating_matrix) == book_id) similarity = apply(rating_matrix, 2, FUN = function(y) cos_similarity(rating_matrix[,book_index], y)) recommendations = tibble(ISBN = names(similarity), similarity = similarity) %>% filter(ISBN != book_id) %>% top_n(n_recommendations, similarity) %>% arrange(desc(similarity)) return(recommendations) > recom_cf_item = item_recommendation("Isbn.0446677450") recom_cf_item
With this we have just coded our item-based recommendation system! Finally, we just need to apply the function that we have previously created so that it is seen in a much more visual way:
recom_cf_item = recom_cf_item %>% left_join(books, by = c("ISBN" = "ISBN")) visualizar_recomendacion(recom_cf_item[!is.na(recom_cf_item$Book.Title),], "ISBN", "Image.URL.M" )
To code a user-based collaborative recommendation system we will start from the User-Item matrix. Only that, in this case, instead of calculating the distances at the column level, we will do it at the row level.
user_recommendation = function(user_id, user_item_matrix = user_item, ratings_matrix = ratings, n_recommendations = 5, threshold = 1, nearest_neighbors = 10)< user_index = which(rownames(user_item_matrix) == user_id) similarity = apply(user_item_matrix, 1, FUN = function(y) cos_similarity(user_item_matrix[user_index,], y)) similar_users = tibble(User.ID = names(similarity), similarity = similarity) %>% filter(User.ID != user_id) %>% arrange(desc(similarity)) %>% top_n(nearest_neighbors, similarity) readed_books_user = ratings_matrix$ISBN[ratings_matrix$User.ID == user_id] recommendations = ratings_matrix %>% filter( User.ID %in% similar_users$User.ID & !(ISBN %in% readed_books_user)) %>% group_by(ISBN) %>% summarise( count = n(), Book.Rating = mean(Book.Rating) ) %>% filter(count > threshold) %>% arrange(desc(Book.Rating), desc(count)) %>% head(n_recommendations) return(recommendations) > recom_cf_user = user_recommendation("User.99", n_recommendations = 20) recom_cf_user
We have just coded our user-based collaborative recommendation system in R. Basically, this system looks for similar people and finds those books that similar people have recommended but we have not read. Those are the books that we have recommended to the user. Let’s see them!
recom_cf_user = recom_cf_user %>% left_join(books, by = c("ISBN" = "ISBN")) visualizar_recomendacion(recom_cf_user[!is.na(recom_cf_user$Book.Title),], "ISBN","Image.URL.M")
We have just coded three different types of recommendation systems. But, when should we use each one of them? Let’s see it!
To better understand when to use each of the three recommendation systems that we have learned to program, it is important to understand what each of the recommendation systems assumes.
On the one hand, user-based recommendation system considers that the user may also like what others have liked. This system assumes that user tastes do not change over time: what I loved 1 year ago will continue to enchant me.
However, while this hypothesis may make sense when we talk about books or movies, in other cases, such as clothing, it will surely fail more. Therefore, although it is one of the most used recommendation systems, it is important to consider how users’ opinion of products changes over time before using this type of recommendation system.
On the other hand, the item-based recommendation system is more robust. And it is that, after there are many initial recommendations, it is very difficult for the average rating of an item to be affected.
Therefore, as a general rule, if we have more users than items and if the ratings do not change much over time, we would use item-based recommendation system. Otherwise, we would use user-based recommendation systems.
Finally, regarding the content-based recommendation system… although it is not usually the most optimal recommendation system, it could make sense when the items have many independent variables that affect the user’s rating.
In any case, and as we have seen when preparing it, it has many limitations, from the type of content it recommends to the computational capacity it requires.
And this is all for today! I hope you have find interesting to know how recommendation systems work. As always, if you have any suggestions do not hesitate to write to me on LinkedIn. See you next time!