Nagesh Singh Chauhan
- Sep 20, 2020
- 5 min read

How to build a Restaurant Recommendation Engine (part-2)

Updated: Sep 20, 2021

In part-1 of this article series, we saw how we could use simple correlational techniques to create a measure of similarity between restaurants based on their rating records.

In this article, we will learn how to build a Collaborative filtering Restaurant Recommendation Engine based on a user’s past experience using a k-NN machine learning algorithm.

Collaborative filtering

According to recommender-systems.org Collaborative filtering, also referred to as social filtering, filters information by using the recommendations of other people. It is based on the idea that people who agreed in their evaluation of certain items in the past are likely to agree again in the future.

Credits: Wikipedia

Key idea here is that similar users share the same interest and that similar items are liked by a user.

There are basically two categories of Collaborative filtering:

User-based: measure the similarity between target users and other users
Item-based: measure the similarity between the items that target users rates/ interacts with and other items

Typically, the workflow of a collaborative filtering system is:

A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user’s interest in the corresponding domain.
The system matches this user’s ratings against other users’ and finds the people with most “similar” tastes.
With similar users, the system recommends items that the similar users have rated highly but not yet been rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)

What we are going to build next is an Item-based Collaborative filtering approach.

Let’s Build A Restaurant Recommender Engine

Credits: cryptorated.com

Now, we are going to use the collaborative filtering technique discussed above and build our Engine using the k-NN machine learning algorithm.

K-NN or K-Nearest Neighbors: KNN is one of many ML algorithms used in data mining and machine learning, it’s a classifier algorithm where the learning is based similarity of a data point from others. It is one of the most famous classification algorithms as of now in the industry simply because of its simplicity and accuracy.

In K-NN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. K is generally an odd number if the number of classes is 2.

If you are not familiar with the k-NN algorithm I would suggest you go through this link before moving forward.

We will use Yelp datasets and it can be downloaded from Kaggle.

Credits: yelp.com

Yelp is a business directory service and crowd-sourced review forum, and a public company of the same name that is headquartered in San Francisco, California. The company develops, hosts, and markets the Yelp.com website and the Yelp mobile app, which publish crowd-sourced reviews about businesses — Wikipedia.

In total, there are : 1. 5,200,000 user reviews. 2. Information on 174,000 businesses but we are focussed only on Restaurant and Food industry. 3. The data spans 11 metropolitan areas around the world.

Start by importing all the required libraries:

import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

The data consists of two files: yelp_review and yelp_business.

review_df = pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/recommend/yelp_review.csv', encoding="latin-1")
review_df = review_df[['user_id', 'business_id', 'stars']]
review_df = review_df.rename(columns = {'stars':'usr_rating'})
review_df.dropna(inplace=True)business_df = pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/recommend/yelp_business.csv', encoding="latin-1")
business_df = business_df[['business_id', 'name', 'city', 'stars', 'review_count', 'categories']]
business_df = business_df.rename(columns = {'stars':'restaurant_rating'})
business_df.dropna(inplace=True)

review_df.head()

review_df

business_df.head()

business_df

Let us see what are the top 25 businesses listed on Yelp.

fig, ax = plt.subplots(figsize=[5,10])
sns.countplot(data=business_df[business_df['categories'].isin(
    categories['categories'].value_counts().head(25).index)],
                              y='categories', ax=ax)
plt.show()

Business listed in yelp.com

So we can observe clearly that the categories column contains all kinds of business categories but for us only categories such as Restaurants/Food/Dining/Pizza are important, so filter out the rest of the categories.

business_df = business_df[business_df['categories'].str.contains("Food|Coffee|Tea|Restaurants|Bakeries|Bars|Sports Bar|Pubs|Nighlife")]
business_df

Filtering business apart from Restaurants/Food

Perfect, now let us do some data exploration

Distribution of the restaurant rating

plt.figure(figsize=(12,4))
ax = sns.countplot(business_df['restaurant_rating'])
plt.title('Distribution of Restaurant Rating');

Bar graph of restaurant rating distribution

2. Top 10 most reviewed Restaurants on Yelp.

business_df[['name', 'review_count', 'city', 'restaurant_rating']].sort_values(ascending=False, by="review_count")[0:10]
business_df['name'].value_counts().sort_values(ascending=False).head(10)
business_df['name'].value_counts().sort_values(ascending=False).head(10).plot(kind='pie',figsize=(10,6), 
title="Most Popular Cuisines", autopct='%1.2f%%')
plt.axis('equal')

3. Cities with the most reviews and best ratings for their Restaurants

city_business_reviews = business_df[['city', 'review_count', 'restaurant_rating']].groupby(['city']).\
agg({'review_count': 'sum', 'restaurant_rating': 'mean'}).sort_values(by='review_count', ascending=False)
city_business_reviews['review_count'][0:20].plot(kind='bar', stacked=False, figsize=[10,10], \
                                                 colormap='winter')
plt.title('Top 20 cities by reviews')

Top 20 cities by reviews

Now, coming back to building our k-NN model, we group by names(Restaurant Names) and create a new column for the total rating count.

joined_restaurant_rating = pd.merge(business_df, review_df, on='business_id')
restaurant_ratingCount = (joined_restaurant_rating.
     groupby(by = ['name'])['restaurant_rating'].
     count().
     reset_index().
     rename(columns = {'restaurant_rating': 'totalRatingCount'})
     [['name', 'totalRatingCount']]
    )
restaurant_ratingCount.head()

As of now, if we do restaurant_ratingCount.shape we have 3644997 number of records.

Next, We combine the rating data with the total rating count data, this gives us what we need to find out which restaurants are popular and filter out lesser-known restaurants.

rating_with_totalRatingCount = joined_restaurant_rating.merge(rest_ratingCount, left_on = 'name', right_on = 'name', how = 'left')

rating_with_totalRatingCount

Next, let’s calculate the number of votes, populatity_threshold, received by a restaurant in the 90th percentile. The pandas library makes this task extremely trivial using the .quantile() method of a pandas Series:

populatity_threshold = rating_with_totalRatingCount['totalRatingCount'].quantile(0.90)
#output
2350.0

Filter out the restaurants that have totalRatingCount below the populatity_threshold .

rating_popular_rest = rating_with_totalRatingCount.query('totalRatingCount >= @populatity_threshold')
rating_popular_rest.shape

Now after filtering out less popular restaurants, at this point, we are reduced to 364816 records.

I’ll filter the records to only the top 10 cities just to avoid any computation and memory-related interruption. The top 10 cities I’m selecting are Las Vegas, Pheonix, Toronto, Scottsdale, Charlotte, Tempe, Chandler, Cleveland, Madison, and Gilbert.

us_city_user_rating = rating_popular_rest[rating_popular_rest['city'].str.contains("Las Vegas|Pheonix|Toronto|Scottsdale|Charlotte|Tempe|Chandler|Cleveland|Madison|Gilbert")]

Next, we join user data which contains the rating data and total rating count data.

us_city_user_rating = us_city_user_rating.drop_duplicates(['user_id', 'name'])
restaurant_features = us_city_user_rating.pivot(index = 'name', columns = 'user_id', values = 'restaurant_rating').fillna(0)

Transform the values(restaurant_rating) of the matrix data frame into a scipy sparse matrix for more efficient calculations.

restaurant_features_matrix = csr_matrix(restaurant_features.values)

To implement item-based collaborative filtering, k-NN is a perfect choice and also a very good baseline for recommender system development. K-NN is a non-parametric, lazy learning method, it uses a database in which the data points are separated into several clusters to make inferences for new samples.

K-NN does not make any assumptions on the underlying data distribution but it relies on item feature similarity. When k-NN makes an inference about a restaurant, k-NN will calculate the “distance” between the target restaurant and every other restaurant in its dataset, then it ranks its distances and returns the top K nearest neighbor restaurants as the most similar restaurant recommendations.

Now our training data has very high dimensionality so the performance of the k-NN model will suffer from the curse of dimensionality if it uses “Euclidean distance” in its objective function. Euclidean distance is unhelpful in high dimensions because all vectors are almost equidistant to the search query vector. Therefore, we will use cosine similarity for the nearest neighbor search.

Finally, we fit the model.

knn_recomm = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
knn_recomm.fit(restaurant_features_matrix)

Now that we have trained our k-NN model, it's time to test it. Since we have a huge number of records, let us go ahead by taking any random restaurant, and based on the user reviews and interest we’ll get recommendations.

randomChoice = np.random.choice(restaurant_features.shape[0])
distances, indices = knn_recomm.kneighbors(restaurant_features.iloc[randomChoice].values.reshape(1, -1), n_neighbors = 11)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for Restaurant {0} on priority basis:\n'.format(restaurant_features.index[randomChoice]))
    else:
        print('{0}: {1}'.format(i, restaurant_features.index[indices.flatten()[i]]))

Let us execute the script again with a random restaurant choice:

Another example

Congratulations!!! we have successfully build a collaborative filtering-based restaurant recommendation engine using a k-NN machine learning algorithm.

You can also find code on my Github. https://github.com/nageshsinghc4/Restaurant-recommendation-engine

Conclusion

That was something really cool, isn't it ?? ;)

If we have enough data, the collaborative filtering technique provides a powerful way to recommend new items to users accurately. If you have proper and well-documented data about your items then you can achieve much higher results using this technique.

Well, guys, It comes to an end of this two-article series of “How to build a Restaurant Recommendation Engine”. At this point, I think now we are comfortable with all the basics of building a Recommendation Engine(Simple recommenders, Content-based recommenders, and Collaborative filtering engines).

You can try something different also like building a movie recommendation engine, book recommendation engine, product recommendation, etc.

I hope you guys have liked reading this article, please share your suggestions/views/questions in the comment section.

You can also reach me out over LinkedIn for any query.

Thanks for reading !!!