• Nagesh Singh Chauhan

How to build a Restaurant Recommendation Engine (part-2)

In part-1 of this article series, we saw how we could use simple correlational techniques to create a measure of similarity between restaurants based on their rating records.


In this article, we will learn how to build a Collaborative filtering Restaurant Recommendation Engine based on a user’s past experience using k-NN machine learning algorithm.

Credits: theinfatuation.com


Collaborative filtering


According to recommender-systems.org Collaborative filtering, also referred to as social filtering, filters information by using the recommendations of other people. It is based on the idea that people who agreed in their evaluation of certain items in the past are likely to agree again in the future.

Credits: Wikipedia


Key idea here is that similar users share the same interest and that similar items are liked by a user.

There are basically two categories of Collaborative filtering:

  • User-based: measure the similarity between target users and other users

  • Item-based: measure the similarity between the items that target users rates/ interacts with and other items

Typically, the workflow of a collaborative filtering system is:

  1. A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user’s interest in the corresponding domain.

  2. The system matches this user’s ratings against other users’ and finds the people with most “similar” tastes.

  3. With similar users, the system recommends items that the similar users have rated highly but not yet been rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)

What we are going to build next is an Item-based Collaborative filtering approach.


Let’s Build A Restaurant Recommender Engine

Credits: cryptorated.com


Now, we are going to use collaborative filtering technique discussed above and build our Engine using k-NN machine learning algorithm.


K-NN or K-Nearest Neighbors: KNN is one of many ML algorithms used in data mining and machine learning, it’s a classifier algorithm where the learning is based similarity of a data point from others. It is one of the most famous classification algorithms as of now in the industry simply because of its simplicity and accuracy.


In K-NN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. K is generally an odd number if the number of classes is 2.

If you are not familiar with the k-NN algorithm I would suggest you go through this link before moving forward.


We will use Yelp datasets and it can be downloaded from kaggle.

Credits: yelp.com


Yelp is a business directory service and crowd-sourced review forum, and a public company of the same name that is headquartered in San Francisco, California. The company develops, hosts and markets the Yelp.com website and the Yelp mobile app, which publish crowd-sourced reviews about businesses — Wikipedia.


In total, there are : 1. 5,200,000 user reviews. 2. Information on 174,000 businesses but we are focussed only of Restaurant and Food industry. 3. The data spans 11 metropolitan areas around the world.


Start by importing all the required libraries:

import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

The data consists of two files: yelp_review and yelp_business.

review_df = pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/recommend/yelp_review.csv', encoding="latin-1")
review_df = review_df[['user_id', 'business_id', 'stars']]
review_df = review_df.rename(columns = {'stars':'usr_rating'})
review_df.dropna(inplace=True)business_df = pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/recommend/yelp_business.csv', encoding="latin-1")
business_df = business_df[['business_id', 'name', 'city', 'stars', 'review_count', 'categories']]
business_df = business_df.rename(columns = {'stars':'restaurant_rating'})
business_df.dropna(inplace=True)
review_df.head()

review_df

business_df.head()

business_df


Let us see what are the top 25 business listed on Yelp.

fig, ax = plt.subplots(figsize=[5,10])
sns.countplot(data=business_df[business_df['categories'].isin(
    categories['categories'].value_counts().head(25).index)],
                              y='categories', ax=ax)
plt.show()

Business listed in yelp.com


So we can observe clearly that thecategories column contains all kinds of business categories but for us only categories such as Restaurants/Food/Dining/Pizza are important, so filter out the rest of the categories.

business_df = business_df[business_df['categories'].str.contains("Food|Coffee|Tea|Restaurants|Bakeries|Bars|Sports Bar|Pubs|Nighlife")]
business_df

Filtering business apart from Restaurants/Food


Perfect, now let us do some data exploration

  1. Distribution of the restaurant rating

plt.figure(figsize=(12,4))
ax = sns.countplot(business_df['restaurant_rating'])
plt.title('Distribution of Restaurant Rating');

Bar graph of restaurant rating distribution


2. Top 10 most reviewed Restaurants on Yelp.

business_df[['name', 'review_count', 'city', 'restaurant_rating']].sort_values(ascending=False, by="review_count")[0:10]
business_df['name'].value_counts().sort_values(ascending=False).head(10)
business_df['name'].value_counts().sort_values(ascending=False).head(10).plot(kind='pie',figsize=(10,6), 
title="Most Popular Cuisines", autopct='%1.2f%%')
plt.axis('equal')

3. Cities with most reviews and best ratings for their Restaurants

city_business_reviews = business_df[['city', 'review_count', 'restaurant_rating']].groupby(['city']).\
agg({'review_count': 'sum', 'restaurant_rating': 'mean'}).sort_values(by='review_count', ascending=False)
city_business_reviews['review_count'][0:20].plot(kind='bar', stacked=False, figsize=[10,10], \
                                                 colormap='winter')
plt.title('Top 20 cities by reviews')

Top 20 cities by reviews


Now, coming back to building our k-NN model, we group by names(Restaurant Names) and create a new column for the total rating count.

joined_restaurant_rating = pd.merge(business_df, review_df, on='business_id')
restaurant_ratingCount = (joined_restaurant_rating.
     groupby(by = ['name'])['restaurant_rating'].
     count().
     reset_index().
     rename(columns = {'restaurant_rating': 'totalRatingCount'})
     [['name', 'totalRatingCount']]
    )
restaurant_ratingCount.head()

As of now, if we do restaurant_ratingCount.shape we have 3644997 number of records.

Next, We combine the rating data with the total rating count data, this gives us what we need to find out which restaurants are popular and filter out lesser-known restaurants.

rating_with_totalRatingCount = joined_restaurant_rating.merge(rest_ratingCount, left_on = 'name', right_on = 'name', how = 'left')

rating_with_totalRatingCount


Next, let’s calculate the number of votes, populatity_threshold, received by a restaurant in the 90th percentile. The pandas library makes this task extremely trivial using the .quantile() method of a pandas Series:

populatity_threshold = rating_with_totalRatingCount['totalRatingCount'].quantile(0.90)
#output
2350.0

Filter out the restaurants that have totalRatingCount below the populatity_threshold .

rating_popular_rest = rating_with_totalRatingCount.query('totalRatingCount >= @populatity_threshold')
rating_popular_rest.shape

Now after filtering out less popular restaurants, at this point, we are reduced to 364816 number of records.


I’ll filter the records to only top 10 cities just to avoid any computation and memory-related interruption. Top 10 cities I’m selecting are Las Vegas, Pheonix, Toronto, Scottsdale, Charlotte, Tempe, Chandler, Cleveland, Madison and Gilbert.

us_city_user_rating = rating_popular_rest[rating_popular_rest['city'].str.contains("Las Vegas|Pheonix|Toronto|Scottsdale|Charlotte|Tempe|Chandler|Cleveland|Madison|Gilbert")]

Next, we join user data which contain the rating data and total rating count data.

us_city_user_rating = us_city_user_rating.drop_duplicates(['user_id', 'name'])
restaurant_features = us_city_user_rating.pivot(index = 'name', columns = 'user_id', values = 'restaurant_rating').fillna(0)

Transform the values(restaurant_rating) of the matrix dataframe into a scipy sparse matrix for more efficient calculations.

restaurant_features_matrix = csr_matrix(restaurant_features.values)

To implement item-based collaborative filtering, k-NN is a perfect choice and also a very good baseline for recommender system development. K-NN is a non-parametric, lazy learning method, it uses a database in which the data points are separated into several clusters to make inference for new samples.


K-NN does not make any assumptions on the underlying data distribution but it relies on item feature similarity. When k-NN makes inference about a restaurant, k-NN will calculate the “distance” between the target restaurant and every other restaurant in its dataset, then it ranks its distances and returns the top K nearest neighbor restaurants as the most similar restaurant recommendations.


Now our training data has very high dimensionality so the performance of k-NN model will suffer from the curse of dimensionality if it uses “Euclidean distance” in its objective function. Euclidean distance is unhelpful in high dimensions because all vectors are almost equidistant to the search query vector. Therefore, we will use cosine similarity for nearest neighbor search.


Finally, we fit the model.

knn_recomm = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
knn_recomm.fit(restaurant_features_matrix)

Now that we have trained our k-NN model, its time to test it. Since we have a huge number of records, let us go ahead by taking any random restaurant and based on the user reviews and interest we’ll get recommendations.

randomChoice = np.random.choice(restaurant_features.shape[0])
distances, indices = knn_recomm.kneighbors(restaurant_features.iloc[randomChoice].values.reshape(1, -1), n_neighbors = 11)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for Restaurant {0} on priority basis:\n'.format(restaurant_features.index[randomChoice]))
    else:
        print('{0}: {1}'.format(i, restaurant_features.index[indices.flatten()[i]]))

Let us execute the script again with a random restaurant choice:


Another example


Congratulations!!! we have successfully build a collaborative filtering based restaurant recommendation engine using k-NN machine learning algorithm.


You can also find code on my Github. https://github.com/nageshsinghc4/Restaurant-recommendation-engine


Conclusion

That was something really cool, isn't it ?? ;)

If we have enough data, the collaborative filtering technique provides a powerful way to recommend new items to users accurately. If you have proper and well-documented data about your items then you can achieve much higher results using this technique.

Well, guys, It comes to an end of this two-article series of “How to build a Restaurant Recommendation Engine”. At this point, I think now we are comfortable with all the basics of building a Recommendation Engine(Simple recommenders, Content-based recommenders, and Collaborative filtering engines).

You can try something different also like building a movie recommendation engine, book recommendation engine, product recommendation, etc.

I hope you guys have liked reading this article, please share your suggestions/views/questions in the comment section.

You can also reach me out over LinkedIn for any query.

Thanks for reading !!!

244 views0 comments