Sommelier numérique

Où je recommande les vins en fonction de leur sac de mots.

Code available on GitHub.


A Digital Sommelier

You’re the owner of a restaurant with an enormous wine cellar, literally thousands of bottles. A famed vinophile has heard of you, and wants to hear your suggestions for wines they might try, given that they really enjoyed one of the wines in your cellar. There’s thousands of wines, far more than your poor human sommelier can keep track of. What are you to do?

The main approach to dealing with recommendations from such a large library is to use recommender systems.

Recommender systems are one of the most commercially important applications of machine learning — Amazon attributes 35% of their sales to their recommendations, which is a whole lot of money. It’s also more of an art than a science. The Netflix prize, for example, led to big progress in predicting user ratings, but in practice they don’t rely just on predicted ratings.

The goal of the recommender we will develop is to offer up new wines that a vinophile might like based on a wine that we know they like. To find similar wines, we’ll need to mine the text of the descriptions provided in the dataset for words that are common enough to be useful, like “tart cherry” or “dry” that come up in describing wine a lot, while excluding words that are quite common, like “wine”. This will allow us to find wines in our large cellar with similar traits as candidates to recommend. We’ll then use a second recommendation to suggest the best wines of similar price out of the longer list of candidates. This will help our harried sommelier to find five new wines to offer to our vinophile.

But first, we’ll start with the text mining.


Bottle-of-words

Bottle-of-words (more commonly known as bag-of-words, but we’re talking about wine, so…) is a natural language processing technique to find similarities between two texts. The collection of all the texts is called a corpus, although maybe in this case we should call it a cellar. The idea is to create a vector for each document where each component is a word in the corpus. So, for example, if our corpus were

it was the best of times 
it was the worst of times 
it was the age of wisdom
it was the age of foolishness
it was the epoch of belief
it was the epoch of incredulity

We have 13 unique words that become the vocabulary of the bag-of-words, represented as components of a vector

[it, was, the, best, of, times, worst, age, wisdom, foolishness, epoch, belief, incredulity]

and therefore, for example, the first sentence, "it was the best of times" would be encoded as

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

and the third sentence, "it was the age of wisdom", would be encoded as

[1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0]

Bag-of-words uses the Cartesian dot product of these vectors to find texts in the corpus which are similar. For example, the dot product of the first two sentences, $$(\textsf{it was the best of times}) \cdot (\textsf{it was the worst of time}) = 5$$ while the dot product of the first and third, $$(\textsf{it was the best of times}) \cdot (\textsf{it was the age of wisdom}) = 4$$, suggesting that the first two sentences are more similar than the first and third, which is obviously true.

To prevent common words, such as “is” or “a” or what have you, from bolstering the dot product artificially, we frequently want to remove stop words. In the case of wines in particular, I augmented the stop words with common words that don’t bring much information to the wine review, words like “wine” and “bottle”, to clean the data


Appearance

To build out our cellar, we start with the dataset scrubbed by Kaggle user zachthoutt from his Wine Enthusiast Kaggle dataset. There are a few files worth of data availabe; I package the 150k csv file in the git repo for this project for ease of access. We won’t even use that full body, because of memory considerations in my analysis.

The dataset has a little over 150,000 reviews, with columns for the country, description, designation, points, price, province, region_1, region_2, variety, winery. Points is a rating scale from 0-100, although the lowest score is about 80. region_2 is sometimes used and sometimes not, suggesting it may not be a great feature compared to region_1. Most of the information is in the description text, which contains such descriptions as:

'Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vanilla. This is full, layered, intense and cushioned on the palate, with rich flavors of chocolaty black fruits and baking spices. A toasty, everlasting finish is heady but ideally balanced. Drink through 2023.'

- Carodorum Selección Especial Reserva

The first thing that jumps out is that words like “aromas” are going to come up a lot, but contain very little actual information about the wine itself. The region and variety will also contain information on what the wine is like. Another thing to notice is that there are important multi-word phrases, called grams, like “chocolaty black fruits” that appear. We’ll want to include those grams as specific features.

We also have province, which is less descriptive than region1, the winery, which probably won’t inform our decision too much if we’re going for similar traits, the designation, which is the output of the recommender, the points, which may tell us about quality, and the price. All good information to have, and for this project I’ll use the descriptions, varieties, and regions of origin to build a list of candidate recommendations, and then a combination of price, rating, and a little random discovery to select the final candidates.


In glass

Now that we know what the data looks like, it’s time to map the cellar into a bottle-of-words format for the recommender system. This involves taking the text of each description and mapping it to a (in this case, sparse) vector of the vocabulary, and then using one-hot encoding to create vectors for the variety and region.

I used the scikit-learn CountVectorizer tool to create the bottle-of-words out of my cellar of descriptions. In the notebook sommelier_numerique.ipynb I manually added words to an exclude_words list, as well as words to flatten because other words derive from them, for example “acid” and “acidity” will be used to describe the same thing, come up a lot in the descriptions, and should mean the same thing in the context of describing wine.

To exclude words that are too rare to be useful, I set a min_df value of .01%, so words have to appear in at least 5 descriptions to be included. To exclude words that are too common to be useful, I set a max_df of 90%, so words that appear in more than 90% of the descriptions get discarded. I also excluded English language stop-words, made sure that hyphenated words were counted as one word, and included up to 3-grams in the description. Finally, I removed a set of common but undescriptive words, like “wine”, to shrink the vocabulary a little. This leaves a vocabulary of a little over 40,000 1- to 3-grams describing the wine.

The end result is that a description gets boiled down to something like

['juicy', 'fruit', 'ripe', 'delicious', 'white', 'pear', 'aromatic', 'structure', 'citrus', 'elegance', 'gorgeous', 'ranks', 'whites', 'opens', 'sublime', 'yellow', 'spring', 'flower', 'herb', 'orchard', 'scents', 'creamy', 'combines', 'peach', 'almond', 'savory', 'mineral', 'grace', 'lingering', 'elegance structure', 'yellow spring', 'spring flower', 'aromatic herb', 'orchard fruit', 'fruit scents', 'creamy delicious', 'juicy white', 'white peach', 'peach ripe', 'ripe pear', 'pear citrus', 'citrus white', 'white almond', 'savory mineral', 'orchard fruit scents', 'juicy white peach', 'white peach ripe', 'ripe pear citrus', 'citrus white almond']

with a reasonable distribution of lengths of words in each description:

The distribution of lengths of wine vectors computed by the total number of words in the vector.

The distribution of lengths of wine vectors computed by the total number of words in the vector.

Now that we have mapped the wine descriptions to wine vectors, we can start working on a our actual recommender system.


In Mouth

The first thing we will want to do is find wines similar to a wine we are basing the recommendation on — this is so-called “content-based recommendation”, as we are recommending wines that are similar to a wine the vinophile liked, as compared to “collaborative-based recommendation”, where we look for the likes of different vinophiles that are similar to our vinophile, and make a recommendation based on that.

Conceptually, we’ll generate a long list of candidates by looking for similar wines and sorting on similarity. We compute similarity by taking the dot product of the base wine with every other wine in our cellar, and sorting with the largest magnitude first. We’ll then rank these candidates based on a different ranking approach, discussed later.

But what about the terroir?

Up to now we’ve focused on the bottle-of-words for the descriptions. However, there is some information in the origin of the wine — Napa is very different from Granada, for example — and the variety — a Chardonnay is quite different from a Sangiovese. To account for this, we will extend the wine vector to include a one-hot encoding of all the varieties as well as the regions. This way, the wine vector has 40,146 components from the bottle-of-words, 540 components for the variety, and another 1,007 components from the region.

One thing to be careful here is the normalization. In a first attempt, I normalized the bottle-of-words vectors to unit length, so the dot products are telling us angles. One consequence of this is that the candidates were heavily tilted towards having the same region and variety — a Willamette Pinot Noir will lead to many Willamette Pinot Noir candidates, which is not necessarily what you want. This makes sense — the most weight that the bottle-of-words data could contribute to the dot product was 1/3, while region and variety each get 1/3 to themselves. We want to give the bottle-of-words the most weight, because it contains the most unique descriptive information about the wines. But we also want to include the variety and region, since those are also important.

In the end, I opted to leave the bottle-of-words un-normalized.

Putting it all together

If we now combine all these, we can generate a body of candidates to recommend. Let’s look at one example:

In this fruit-forward vintage, it's not surprising to find that what shines here are the bright, spicy and tart flavors of strawberries, raspberries and cherries. The Sangiovese was co-fermented with 7% Syrah, and aged in 30% new American oak. As the wine breathes open the flavors of oak and char come up, and the tannins show some extra grip.

- Ledger David winery Sangiovese, from Rogue Valley, Oregon

Plugging this in and sorting, our top ten candidates include a Santa Lucia Highlands Pinot described as

Fresh raspberries, red cherries and even blackberries arise on the juicy nose of this wine from Adam Lee, but it's lifted by wild mint, hummingbird sage and damp coyote scrub. The palate is very herbally spiced, with thyme, marjoram, black cardamom and charred pine, laid across dried strawberries, cranberries and sour cherries, all tied together by serious grip. Drink 2019–2025.

and a Columbia Valley red blend described as

Dark, dusty, strongly scented with barrel toast, coffee grounds and incense, this is the most substantial and complete version to date. Balancing cherry and plum fruit against the pretty barrel-infused tannins, it glides gracefully across the palate into a seamless finish. Saggi is the Long Shadows collaboration with Tuscany's Ambrogio and Giovanni Folonari. the blend in 2007 is 43% Sangiovese, 36% Cabernet Sauvignon and 21% Syrah. With each new vintage, the percentage of Sangiovese climbs, putting more Tuscany in this new world super-Tuscan wine. Dark, dusty, strongly scented with barrel toast, coffee grounds and incense, this is the most substantial and complete version to date. Balancing cherry and plum fruit against the pretty barrel-infused tannins, it glides gracefully across the palate into a seamless finish.

These seem to be pretty good candidates to recommend to someone who really enjoyed our Rogue Valley Sangiovese. Now that we have some candidates, it’s time to sort them in a way that makes sense as a recommendation to our vinophile.

To get an idea of how much coverage our vocabulary has, we can look at non-zero dot products between wine vectors. We’d like there to be a fair amount of coverage, but probably not universal coverage. This plot of a 5,000 wine subset of our full cellar, with black for non-zero dot products and white for zero, shows solid coverage with our bottle-of-words, so it’s reasonable to suspect we won’t have to worry about coming up with too few candidates to recommend.

Coverage of dot products for a subset of wines — each row and column is a wine, black shows a non-zero dot product between the wine vectors, white shows a zero dot product.

Coverage of dot products for a subset of wines — each row and column is a wine, black shows a non-zero dot product between the wine vectors, white shows a zero dot product.


Finish

Now that we have a sorted list of candidates, we need to make a recommendation. There are two pieces of data we’ve left in the database — price and rating. It’s hard to include this in the candidate selection, because we wanted to just look for wines with similar traits, not similar prices. So now we can sort the candidates assuming our vinophile wants the best wine possible at a similar price to the bottle they’re asking about. So we will sort on a function that rewards high ratings and punishes prices that differ too greatly from the original bottle.

We’ll use as our function

$\textrm{sommelier} = \frac{\textrm{rating}}{100} - \left \{ 1 - \exp \left [ - \left ( \frac{\textrm{price} - \textrm{price}_0}{\textrm{price}_0} \right )^2 \right ] \right \}$

This gets the rating maximum, while punishing for getting too far away from the price of the original bottle, without punishing too harshly if the price is close.

Using this to sort a hundred candidates based on our Ledger David Sangiovese and cutting off the top five rated, we find the top rated wine to be a South African Syrah from The Ridge winery, with a 90 point rating and a cost of $23, described as

'Ripe, rich and overall yummy, this Syrah offers bright notes of red plum and cherry laced with a touch of gaminess and soft foliage. The medium-weight mouthfeel boasts fine tannins and prominent acidity that provides a nice lift to the long, spicy finish.'

This doesn’t sound like a terrible recommendation at all.


Final Thoughts

There are a few things about this project that are worth discussing further, where I made some choices and could have made other choices.

Foremost is that, when trying to compute candidates, I pre-computed the dot products of every wine vector with every other wine vector. I did this to produce the plot of the dot products to see how sparse commonality was, but this also put a big memory restriction on the problem. In particular, a $150,000 \times 150,000$ element symmetric matrix of 32-bit floats takes up 45 gigabytes of memory. More generally, a symmetric 32-bit float matrix of $N$ wine vectors will take up $N(N-1)/2 \times 4$ bytes ($N^2$ for the $N \times N$ matrix, minus $N$ for the diagonal since we don’t care about how similar a wine is to itself, divided by $2$ because it’s symmetric, and $4$ bytes in a 32-bit float). This obviously overloaded my computer, and I cut down to 50,000 sub-elements, taking up 5 gigabytes of memory, which is less than the 8 GBs RAM I have on my laptop. I did this mostly for the visualization, and in practice one would not want to pre-compute everything.

One way around this would be to randomly select, without replacement, from the cellar, and compute the dot product of the various wines in the cellar, keeping only dot products that are above some threshold value of similarity until we get to some large number of candidates. Then we would use our recommendation algorithm to select from that smaller set of candidates. This adds a little randomness to the process, which is good for helping our vinophile discover new bottles of wine. The other advantage of this approach is that it easily scales as our cellar gets bigger, since the random selection algorithm is $\mathcal{O}(N)$ instead of $\mathcal{O}(N^2)$.

Speaking of our vinophile, we did nothing with our recommendation based on their preferences, only a wine we knew they liked. In practice, if we were a big restaurant with a large database, we might generate candidates from wines liked by similar vinophiles, so-called collaborative-based recommendations. The trouble with this is that we may try recommending a Chardonnay after they bought a Pinot Noir, which may not make sense in context. In this way, we may have two recommenders — one which suggests wines similar vinophiles liked, and one which suggests wines similar to the wine the vinophile just liked. Alternatively, we could make our recommendation algorithm out of the Venn diagram of candidates selected using both the collaborative-based recommender (similar vinophiles) and the content-based recommender (what we just described). This would depend on the context of how we’re using our digital sommelier.

$\setCounter{0}$
Previous
Previous

Bayesian Basics

Next
Next

Solving Multi-Armed Bandits