The Measure of a Metric

May 19

Written By Stephen

Wherein we take a look at the various ways you can measure your business.

“Measure what can be measured, and make measurable that which is not so.”
- Antoine-Augustin Cournot and Thomas-Henri Martin, not Galileo Galilei

It’s extremely hard to operate any enterprise where you aren’t measuring how well you’re doing what you’re doing.

To take an extreme example, imagine a business that doesn’t keep track of revenue, or a political campaign that doesn’t take any polling.

This is why businesses started hiring analysts and keeping metrics. And they are crazy about them. Key performance indicators are meant to quantify productivity, and move the needle on important business performances, or something. Basically, it’s hard to know if you’re doing the right thing if you aren’t measuring whether the thing you did changed anything. And this is why businesses love metrics.

But metrics are tricky, and just because you’re measuring something doesn’t mean it’s the right thing to measure, or that you even know what it is you’re actually measuring. I wrote this blog post to discuss the nuances of metrics, to highlight some of the details you have to get right to make metrics useful.

For the sake of being concrete, let’s say you’re an e-commerce business about to deploy a Referral Program and you want to keep an eye on how it’s going. You want to know how many users are using it, and whether it’s driving up long term revenue. And you’re going to do this with the three kinds of business measurements: Metrics, Figures of Merit, and Scores.

tl;dr

Metrics are direct measurements of events for which you have data stored, and are the statistical measures that you can change through product changes
Figures of Merit are quantities, usually tied to some sort of model, which are calculated using event data, and are often why you are trying to change a metric
Scores are meant to provide a summary at a glance of status, usually based on thresholds of what’s “normal” or not
Metrics measure things that have already happened; Figures of Merit (usually) forecast what is likely to happen; Scores describe if what is happening is expected

Primary sources — Metrics

Alright, your Product Manager hit the “Go” button on your Referral Program and things are happening. Users are sending referral links to their friends, and unique referral IDs are being generated left and right, and some of them are even getting used.

All this activity is populating rows in data tables, one row for each event because your whole team has read about Tidy Data and follows the recommendations closely so when it’s time for your data scientists to show up the scene doesn’t look like this:

Data Scientists excited to start building models firing up SQL queries.

One of the first things you’ll want to do is probably describe the data as it’s flowing in, so you can get an idea of how healthy the new Referral Program is. You can’t look at every row, because that’s just too microscopic, so you start summarizing the raw data produced by these events. New referrals generated per day and week, number of referrals used per day and week, percent of generated referrals used, referrals generated by user and the distribution around that (some users don’t bother, some users are nuts for it, some users are suspiciously extremely nuts for it), daily/weekly/week-over-week revenue from referrals.

These are all what I’m going to call metrics.

Metrics offer a glance of how a product is performing by compiling and summarizing the data generated by events related to that product.

Something notable about all the Metrics I listed, and a distinction I want to make: Metrics are deterministic summaries of data that has been collected. This is to distinguish it from firgures of merit, which I’ll talk about below, in that Metrics make no assumptions or approximations, they describe the data as it is.

This means that “referred revenue” means “a new user showed up at your e-commerce website and bought something using the referral code, and this is the revenue from just those purchases”. You measured a discrete event (making a purchase) that you know happened that involved another discrete event (receiving a referral) with 100% certainty. I’ll make this distinction more clear when we get to Figures of Merit next.

For this reason — Metrics reflect a concrete thing that actually happened — Metrics should be the target to move for any experiments or product tests. Why? Because you don’t have to infer anything to tell if the Metric moved. Now, whether moving that Metric is important or not is another issue, but there’s no point in guessing at whether the lever matters and whether the lever actually moved or moved because of some modeling artifact.

Making a Good Metric

It is possible to build Metrics for everything, but there are good Metrics and bad Metrics and which are which will depend on your business case. But there are a few guidelines.

Metrics that measure rates tend to be more meaningful than aggregates. Nobody cares how much revenue your business has made since its inception. They’re much more interested in how much revenue your business made last week/month/quarter/year. Sun Microsystems made a lot of net revenue over the course of their history, but its 2007-2008 revenue numbers weren’t great.

Similarly, you want to know if these rates are meaningful for your business. If you had weekly sales revenue for your e-commerce site of \$1m/wk, that might be fantastic. If Jeff Bezos was informed Amazon made \$1m in sales revenue last week, the response would be a bit more muted. So it can be useful to look at relative change in week-over-week total revenue, for example, so that you understand if the number is going up or down or basically staying stable over time.

This leads to draw a distinction between volume Metrics versus scaled Metrics. A volume Metric would be Daily New Referrals, while a scaled Metric would be Daily New Referrals Per User. If Daily New Referrals goes up, is it because you have lots of new users or are users making more referrals? Usually scaled Metrics tell you more about what a typical user is doing, while volume Metrics tell you what happens when all the users in aggregate start doing things. Which is more important depends on the Metric: you may want to increase Annual Revenue per User, but that’s because your investors are really interested in Annual Revenue.

Finally, a good Metric should tell you to do something if something changes. If Daily New Referrals were to precipitously drop, you might want to check in on the web page used to generate the referral links to make sure it’s not misbehaving. Similarly for if Daily Referred Purchases were to abruptly drop. Anomalies in volume Metrics like this can indicate that some key infrastructure is acting up, or that someone is doing something malicious.

The Golden Rule of Metrics

The most important thing is to know what a Metric is measuring and what it’s telling you about your product. Also, what it’s not measuring.

This seems obvious. Go ahead and chuckle. But I’ve heard of companies where four different teams had four different definitions of “Conversion Rate” for the same product. Imagine sorting out if things were going well or not in that environment.

For example, Daily Revenue Per User seems pretty straightforward. But, it could measure revenue from a day divided by the total number of user accounts created, the total number of “active” user accounts (define “active”, go on), or the total number of recognized users who visited your site, or the total number of recognized users who logged in. Does it include the “Continue as Guest” purchases or are those segmented out?

This is why I like descriptive names, even if they’re a little clunky. Finished Purchases Per Product Search sounds a little clunky and less satisfactory, but a Product Search is a definite event, and a Finished Purchase is a definite event, so it’s easy to define unambiguously. If you have tracking cookies, you can look at certain individual user behavior, but you have to be certain that you understand that this only describes users who allow you to use tracking cookies and aren’t blocking them. Know the caveats with that: in early 2021 Apple changed their privacy policy and I guarantee a lot of companies saw precipitous drops in iPhone traffic to their websites.

Coming up with unambiguously defined Metrics is hard, and understanding what they’re measuring and what they’re missing is also hard. But it’s the only way to understand what’s actually happening with your business.

A few examples to whet your appetite

For our Referral Program, we might keep track of a few Metrics to tell us if the product is performing as expected:

Daily New Referrals/Week-over-week: Tells us, in aggregate, how many referrals are going out on a given day, and how that’s changed since same day last week
Daily Referred Revenue/WoW
Referred Revenue Per Referral Purchase: Are referral purchases usually bigger or smaller than typical purchases? By how much? Useful to know!
New Referrals Created Per Active User: Oh, you’re trying a new layout to encourage users to send more referrals? Well, this is the Metric it should move.

Meritorious Figures

These summaries of raw data can only tell you so much, however. They tell you what has actually happened, which is great, but they tell you what actually happened, which means that they’re trailing indicators. They also are raw aggregates, meaning that while they can tell you what happened, or the distribution of what happened, they can’t help you figure out exactly who, as a group, did the thing. They can’t, because you can’t measure market segmentation, or know with total certainty the customer lifetime value of a new customer.

This is what Figures of Merit are for.

Figures of Merit describe features of the business that come from models that make predictions about performance or behavior.

This is the distinction I want to make between Metrics and Figures of Merit. A Metric is a direct measurement of a discrete event. A Figure of Merit uses a collection of discrete events to either describe something that might be difficult to measure directly, or to predict future behavior at an individual or aggregate scale.

Consider the case of going viral. You can measure how many of your referrals are blowing up, but to really know how good the Referral Program is going you need to compute the number of new users signed up per user sending out referrals, aka the “viral coefficient”.
This represents the borderlands of Metrics versus Figures of Merit — you can measure the number directly, but the reason you care about the number is based on a predictive model.

A Figure of Merit much further from this edge case might be Customer Lifetime Value. When a new user signs up, they aren’t going to fill out a form telling you how much they plan to spend with you. They probably have no idea themselves! But you want to know how much that new user is planning to spend so you have some idea how valuable it will be to retain them. Otherwise you could spend hundreds of dollars retaining someone who is going to spend tens of dollars on new goods.

Customer Lifetime Value is a Figure of Merit for a user, based on models predicting their behavior by trying to group them in with other users who have been around a while where you have a pretty good idea what the trajectory might look like. This can be based on cohort analysis, user segmentation, some sort of Markov Chain Monte Carlo analysis, etc. But it arises from a model.

Another example is Churn Probability. Especially if you aren’t a subscription service, you need to be really careful how you define “churn” in the first place, because as an online retailer people aren’t going to buy the same t-shirt once a month on the same day then cancel on you. Maybe the answer is a model for “Time Until Next Purchase” or “New Purchase Probability” as a function of time from today. Like with Metrics, you need to be very clear about what your Figure of Merit is predicting, and what assumptions went into that prediction.

Be careful what you test for

One thing to understand about Figures of Merit is that you don’t actually have a lever to move them around directly. They change because the individual Metrics that contribute to them changed, and therefore your lever is in the Metrics, not the Figures of Merit.

Therefore, you can say “I want to reduce Churn Probability” but Churn Probability is based on Most Recent User Login and 3 Month User Purchase Count or whatever the other features are that are measurable events. So to move Churn Probability you have to start with a hypothesis — “If I can get the User to make two purchases in the next three months, their Churn Probability will drop in half” — and then start testing whether you can affect the actual Metric that corresponds to user behavior — “I can get users in this segment to make two additional purchases in the next three months using this retention pipeline of incentives”.

It may seem like a subtle difference, but it’s important to understand that you have little to no control over Figure of Merit directly, but you do have control over the Metrics that measure the events that go into calculating the Figure of Merit. This is where understanding what your Figure of Merit is sensitive to and how it fits into your business picture gets really, really important.

Making the most of Figures of Merit

What makes a good Figure of Merit? Much depends on how you use it, but before going to the effort of calculating a Figure of Merit you want to base decisions on, answer these questions three:

What activity does this Figure of Merit describe under what assumptions? (revenue growth, new users, etc.)
What Metrics describe the inputs of the Figure of Merit?
What levers can I pull to affect this Figure of Merit, and how would that change be reflected in the business?

An amuse-bouche of Figures of Merit

Our Referral Program probably has a few Figures of Merit that are worth defining clearly and tracking, such as:

Referral Virality Coefficient — given some number of users who sent referrals last week, how many new users signed up?
User Annual Value — How much money are these users going to be worth for a year?
Order Return Probability — If you have a nice return policy, what are the odds a user’s order gets sent back? That costs money and is something you probably want to minimize, but is likely some big hairy model depending on purchase price, user segment, and purchase history.
Time To Next Purchase — How long until a user makes their next purchase, if ever? This will be a function of user history, recent behavior, and how long ago that was

Business at a Glance — Scores and other indicators

Now we arrive at Scores.

I really don’t like Scores.

Not because there’s anything intrinsically wrong with them, but because they’re so far removed from the individual events and measurements that comprise Metrics or Figures of Merit that interpreting them can get hairy. They generally involve value judgements, whether something is good or bad, but they are also quantitative, and this can combine to give objectivity vibes for something that’s actually quite subjective.

Here’s an example of what I mean: I heard tell of a company that had a User Health Score. It was some weird function of purchases in the previous three months, combined with user logins in the past week, plus some other stuff I can’t remember. It spit out a number between 0 and 100. It was untethered from any measurable user behavior, so it was impossible to validate or even interpret. It was implemented in SQL, and was un-debuggable, so of course it was discovered that the Health Score would go down by 15-20 points on Mondays, then slowly return to a baseline value by Wednesday.

The Platonic Ideal of a Health Score.

Good Scores can be useful, especially for people who aren’t super data-literate, or for dashboards where you just want to make sure things are acting fine. If you know the range of website views you expect on a given day, you can have a score that centers on 0, and when it hits +/- 1 the button turns yellow because now you have either abnormally high or abnormally low traffic, and at +/- 2 the button turns red because something is clearly cattywumpus. These scores are tied to confidence bounds. If you have a model describing how your expected revenue should progress through the day or week, you can score your revenue based on whether you’re within bounds, exceeding, or well off your pace for that day/week.

A Score gives busy people who don’t have the time or numeracy to dig into the numbers a quick glance way to assess if somebody needs to dig in and figure out what the hell is going on.

A good Score is connected to a good Metric or Figure of Merit, and is a value judgement on when something has gotten out of whack with what you expect. A good Score is something your Product Manager can check in the morning and get their attention if something is off, so they can dig into the details, but otherwise ignore the metrics if everything is green.

A bad Score leads to unclear next actions beyond investigating “why is the Score bad?”. A bad Score sends them on wild goose chases, or fails to narrow their focus. A bad Score gives the illusion of objectivity when actually someone made a bunch of subjective decisions that are now mathematically encoded in the Score.

So again, there’s nothing wrong with Scores as a concept, but because you’re making a value judgement, you need to be very careful that they are easy to interpret and easy to dig into to understand why they’re acting up if something changes. A good Score:

is tied to a specific business activity
is easily explained in terms of Metrics or Figures of Merit
is not taken literally, after all it’s a value judgement of what’s “normal”

Caveat Metrica

Metrics are obviously important, it’s very hard to run a business without knowing what’s happening, but there are a few caveats to keep in mind as you go about measuring everything that happens in your business.

Pressing your thumb on the scale

Beware of Goodhart’s Law:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
- Charles Goodhart

or, more colloquially,

Once a metric becomes a target, it ceases to be a good metric.
- Marilyn Strathern

A couple of case studies illustrate the point:

You decide that the ideal way to make your Referral Program a rousing success is for users to send out as many invitations as possible, so you give them a discount just for sending out invitations. Suddenly, users are sending hundreds of invitations, and hardly any of them are converting, but you’re giving out discounts like candy.
In an effort to figure out who the most productive software developers are, a company starts measuring the number of lines of code they get approved in pull requests. This encourages devs to use the longest possible solution to any software problem to look “more productive”, when a shorter and more elegant solution would be easier to maintain.
As discussed by Edward Zitron here, in an effort to increase the amount of time users spend scrolling Google searches, Google made a series of changes that actually made the search quality worse and more frustrating.

None of the metrics in question are bad metrics, per se, but once you start trying to improve that metric by instituting targets, it is no longer telling you as much about the business because you’ve incentivized manipulating the numbers.

But of course, you have to try to improve metrics somehow or else what are you doing? But then if you start moving the metrics around, can you actually rely on them telling you what you think they are? Which leads us to…

Metrics Brain

There is a lot out there about data-driven decision making. And, generally speaking, organizations think with their gut a lot. But then there’s the opposite end of that, where every decision is filtered through some metric, and no actual thought is given about what the metric is saying in a broader context. And you know what I think about abdicating thinking in favor of data.

This is when your organization begins suffering from metrics brain. This is the opposite of Goodhart’s Law, where the metrics lead your organization by the nose, without thought for what the metrics mean or what they’re actually telling you or not telling you. Metrics and experiments cannot make decisions, they can just suggest the likely outcomes of those decisions under the constraints of the experiment. You have to provide the ideas you’re testing, and you have to interpret mixed results, and you have to understand why you’re doing what you’re doing. When you forget this, and just follow what a single number tells you, you’re suffering from Metrics Brain.

The cure to Metrics Brain is to take a step back and consider the broad context of a decision you’re making. What are the trade-offs? What are the likely outcomes? What happens if we make the “wrong” decision, and how will we even know it was wrong? Did we follow the right process in making the decision? Metrics can guide these discussions and ground the quantitative parts of it in numbers, but metrics are not a substitute for human intelligence.

You see Metrics Brain take over a lot with forecasts.

Pick your metric, pick your business

If you decide your online retailer has a very high Lifetime Customer Value, you are likely going to end up targeting higher income customers with more expensive goods, because they can afford to buy them and rack up the high LCV. On the other hand, if you’re more worried about raw Revenue and scaling isn’t a concern, you likely end up targeting lower income customers with less expensive goods, because you can make up for the lower LCV with pure volume.

Deciding which of those two metrics to prioritize can affect how you make decisions based on experiments. Therefore, be careful that your metrics reflect your business, and not that your business reflects your metrics. This is part of careful design for metrics that you are okay with getting Goodhart-ed: make sure “metric is good” reinforces your overall strategy and that the metrics you monitor flow from quantifying the success of your business objectives, and not because you just wanted to measure another thing for the sake of having an extra slide in the quarterly earnings call.

Which brings me to the last thing…

Vanity metrics

Vanity metrics are metrics that make you feel good, but which provide no actionable information. The way you avoid vanity metrics is that every time you start working on some sort of new metric, you ask the question: if this metric changes, what will we do?

If the answer is a series of inquiries to determine what’s happening that leads to actionable responses that are worth the effort in isolating and correcting, it is not a vanity metric. If the answer is ¯\_(ツ)_/¯, it is a Vanity Metric.

Excelsior!

Metrics, Figures of Merit, and Scores all play vital roles in understanding how your business is performing. It’s extremely hard to run a business without Metrics, but it’s also hard to understand why Metrics are important without Figures of Merit, and it’s also hard to take in all the Metrics at once with business context without Scores. Understanding how they fit together, what they say, and what they can’t say, is important for getting the most out of the measures of your business.

$\setCounter{0}$

Stephen