The Five Phases of a Data Science Project
If you’re working with the Titanic dataset on Kaggle, like every other data scientist on earth has at one point or another, you’re asked to predict which passengers died. There’s a cleaned CSV file with all the relevant data and a description of exactly the output format required to submit your model for scoring. The first thing you do after reading the prompt is to start messing with the dataset with some EDA, throw some ML models at it, and see what you can get. 95%! Great! Good job! You nailed a Kaggle problem.
But this shortchanges the biggest parts of a real data science problem.
The difference between a Kaggle problem and a real professional data scientist problem, which is the same difference between a homework problem and a research project, is that the Kaggle problem is nicely posed. You know (i) what the question is you’re trying to solve; (ii) what dataset you have available; (iii) the format of the answer; and (iv) what the final right answer is.
When you start off on a proper data science problem, you will have none of these. What you’ll probably have is something like “The sales team would like to do a better job filtering and prioritizing their leads.” This is a terrible Kaggle problem, but will be your bread and butter as a career data scientist. Most Kaggle projects take the sequential form of
Problem Prompt → Exploratory Data Analysis → Machine Learning code
with a little back and forth between exploration and ML coding when you discover you should have used deep neural networks instead of a random forest, or find that one of your features is basically useless.
What I’ve found is there are a few clear phases to a real data project — phases, not steps, because steps are sequential and you’ll find yourself doubling back kind of a lot when dealing with a problem with real value and depth — that you’ll go through to deliver a successful data project.
Framing the Right Problem
The first step in the whole process is to make sure you understand what the problem is, what data is available that’s relevant to the problem, what a final deliverable will look like, and how you’ll track whether it’s delivering what it should. This is the job the people posting a Kaggle challenge have already done for you, but more likely than not you’ll have to do it for yourself in most business settings.
Understanding the Problem
Let’s use the sales lead example to give this some specificity. You’ve been presented with a statement:
“The sales team would like to do a better job filtering and prioritizing their leads.”
Now, what does that mean? Are they chasing down a lot of bad leads and want to disqualify those? Or do they have a lot of good leads and want to make sure they get to the best ones first? If it’s the former, your question may be something like “Can we identify leads that are unlikely to convert out of a pool of fresh leads?” If it’s the later, you need to ask what makes one lead better or more important than another — is it the expected value of the lead? the time it takes to convert the lead? the risk that it will fall into being a bad lead if it’s not handled right?
Data Quality
You’re going to need to sit down with someone from the sales team and unpack all this. You’re also going to need to take a look at the data available on your leads and see what you have. Are you getting a lot of information in consistent formats about the leads, like revenue or company size? Are you relying on a lot of notes written down by sales people? Where are you getting the leads from, and if it’s from multiple sources what data is consistent across sources? Are you able to tie leads to converted accounts so you can actually measure conversion in the first place?
These are all data quality issues, and will affect everything downstream in the project. They may even be bad enough (“We have a bunch of handwritten notes on post-it cards. Can you work with that?”) that the real project is more data infrastructure than lead qualification. It’s important to find these things out early on, because it may place limits on the type of problem you can solve.
Narrowing Scope & Framing the Deliverable
At this point, you probably have written a really simple SQL query, just to take a look at what’s in the tables. You’ve mostly taken a lot of notes and talked to people to find out what’s important to them. You also need to determine two really important things at this point: (1) who is going to use your outputs?; and (2) how are you going to measure success? For (1), this is going to affect the outputs of your model, because if you’ve been asked to predict the lifetime value of the leads in your pipeline, that’s very different from predicting the probability of success for each individual lead. For (2), this is a big difference from the Kaggle competition — it’s not enough to get really good performance on the test and validation sets, you need to make sure the model continues to do well once it’s deployed.
Let’s say, for the sake of specificity, that there are a ton of leads, with a very low conversion rate, but each converted lead is very high value. So you’re trying to find the Glengarry leads, and the sales team would like you to get rid of as many of the dead ends as possible while minimizing the number of good leads you’re throwing out. Let’s also say that your company does a stellar job keeping track of new customers, and ties any converted lead to the customer ID and you can clearly follow the entire lifecycle of a customer (not because this is likely, but because I don’t feel like having to write about that project). The question you are now trying to answer would be along the lines of:
“Can we use data on past lead performance to remove leads that are very unlikely to convert?”
Notice that I haven’t said anything about a hypothesis or anything like that, just a very simple question of whether a thing is possible. The reason for this is the iterative nature of a data science project — if you knew it was possible, you’d just do it; if you discover you can’t do it, you may want to revisit this section with an answer that you can provide (“Well, I can’t reduce the number of leads, but I actually have a very good model for lifetime value if a lead converts.”) and see if this is useful to the sales team. Depending on the problem and the culture at your work, you may want to go through two or three “if we can’t do that, does this add value?” back-up plans before proceeding. I personally would rather shoot for the ideal and discover what’s possible along the way rather than overly narrow the scope at the outset, but there’s also value in the alternative — prevents scope creep, keeps you out of rabbit holes, helps define deliverables to stakeholders — that may outweigh the value of a more open-ended exploration. Just depends.
This sounds like a lot, and it is. This phase is the hard part of the data science project, right at the beginning before you really dive into a single bit of data. I could probably write a whole post just about how to transform a vague business request into a concrete data science project. This part likely involves several meetings with the stakeholders, and a lot of thought to make sure what you deliver brings value.
Which brings me to one last point — what are the dollar stakes in this project? If you can double the sales team’s conversion rate, that has a very clear dollar value you can assign, and at some point someone is going to ask you about it. Understanding ROI on a project can be just as important as understanding the project itself, because the project needs a justification for the resources you’re about to commit to it. If you think you could save a hundred hours of pointless effort per week with your lead qualification model, that’s probably worth pursuing. If it’s going to be one hour? Ehhhh… probably should prioritize something higher impact.
Now we have our question and an idea of what we need to deliver. So now what’s next?
Exploratory Data Analysis & Feature Engineering
Now that we know what you’re trying to accomplish, and that it’s worthwhile to engage with, and that there’s reliable data you can use to both measure the effect and have good features for some sort of model, you should start looking at the data with an eye towards the problem that’s been posed.
The first thing to look at is the average of the key metric you’re looking to optimize over the full dataset. This gives you an idea of where you currently stand, and what you’ll need to do to improve. For our sales team, it’s probably conversion percentage on leads. Eventually you’ll want to compare whatever models you develop to dumb random chance — a null model that assigns the output values to each instance randomly from the distribution of the possible outputs — to make sure you’ve done statistically better than guessing, and to make sure there’s a there there with your models.
There’s a ton of literature out there on EDA, feature engineering, and ditching outliers, so I won’t spend too much time here. Mostly there are three main goals in this phase: (i) understand the distribution of input features, so you know if certain parts are going to be under-represented and if you’ll have to deal with an imbalanced data set; (ii) identify which features in the data are correlated, because these will be redundant and can likely be removed using something like PCA or an autoencoder; and if it’s a supervised learning problem (iii) identify how the labels vary with the features.
There’s some really great content out there on EDA and feature engineering from Google’s ML Guide or Machine Learning Mastery, among many, many others, so I won’t spend much more time here. One thing to beware of, which I did not see in either of these guides, is to avoid engineered features that are constant across most of the data.
To run with our lead example, what if we want to look at the average first purchase value as a way of prioritizing leads? If the average first purchase value for new customers is basically fixed, then that’s a useful insight for forecasting (since now you only need to forecast volume of purchase) but a terrible feature for a sales lead conversion model (because it doesn’t differentiate the leads). Always be thinking about what you’re trying to accomplish and if a feature will help lead your model there.
Sometimes at this point you start to find trends that may lead you towards the model you’ll want to apply, like if you want to do K-means clustering versus a random forest classifier, for example. Other times, depending on the bar for success, something as simple as a linear correlation between a feature and a label can wrap the project up. But if not, it’s time to…
Build a Machine Learning Model
Now we’re at the point where there’s just a million articles on this. I won’t spend too much time on this, aside from to inject some rules from my own practical experience.
#1 — Always try at least three models,
and one of those models should be “what if random chance?” If your sales team converts leads at a 10% rate, you should have a model that randomly guesses “converted” 10% of the time, and you should test all your models against that. The null model is useful to make sure that there’s actually information in your features. This could be obvious during EDA — if you see a nice correlation between a feature and an output, your model should capture that and you should beat the null model. But sometimes it can be hard to visualize — think image processing or NLP projects, where the feature space is a super-high-dimensional manifold embedded in an ultra-high-dimensional space — and EDA won’t reveal a simple trend from 2D cross-correlation plots.
The reason you’ll want at least three models is that one is null, and the other two will have either different hyperparameters for the same model type — think neural networks with different hidden layers — or be just totally different model types — think SVD versus random forest. Some model types perform better in different specific problems, and it’s useful to look at not just which model works best, but where each model fails and also the model sensitivities.
You’ll be doing this on the validation set, which you of course held out from the train-test dataset as well to pick from among your models which performs best, right?
#2 — If possible, use the metric you are trying to optimize as the loss function for your model.
Consider our sales lead qualification metric. A default classifier (“I put the leads into the convert or not convert buckets”) will use the accuracy
[# of accurately classified instances] / [# of instances]
as the loss function. But if, in framing your problem, you discover that the main goal is to reduce the time the sales team is spending on false leads, you may want to minimize something like
[False Positives] / [False Positives] + [True Positives]
which corresponds to the fraction of time the sales team spends on false leads. If you know that false positives take less time than true positives to convert, you may want to weight these by the time each takes.
But you also need to look out for perverse incentives in the loss function — you can save a LOT of time by just excluding all of the leads. So if you do handle a loss function like this, it’s good to have competing metrics that go up as another goes down. This is just something to think about, and will depend on each problem, but you can often get improved results out of a simpler model with a judicious choice of loss function.
#3 — Pick the simplest models that solve your problem.
There’s a few reasons for this, all of them having to do with what follows with the model.
The business reason is that you’re going to want to understand what the model is sensitive to, and that can be hairier to suss out with a deep neural network than with logistic regression. The deep neural network may be slightly more accurate, but an analysis of the model’s sensitivity may rely on some esoteric combination of input features that are very hard to translate into business decisions about, say, where to target ads. If you have a simple model with human-understandable features as inputs, using something like scikit-learns permutation feature importance functionality you can identify which features are most important — say that converted leads that came through Advert Campaign A did much better than those that came through Advert Campaign B. It’s much harder to do this if your features are some convoluted manifold of features, and you should avoid trying to rely on this until the simple models reach their limits.
From a deployment perspective, bigger models are more costly to deploy than smaller models, both in terms of the dollar cost of the computing time to run the model and the computing time itself. How often is the model going to be called, and what’s the execution time? If we’re talking about a second to run once a day, you’re good. If it takes a few seconds to execute and you want to run it every fifteen seconds, you might run into some issues. Similarly with retraining the model — what time window do you have to retrain the model, how often is it run, and how long does it take to train? CNNs are enormous models that can take a long time to train, while a relatively shallow neural network with ReLU activation functions can train a lot faster.
#4 — Don’t be afraid to pivot.
When you get into it, your lead qualification model might be trained on data from the sales team that’s just too riddled with pre-existing heuristics before you got called in, and for strategic reasons you want to mix it up a bit. If you’re working with an already-existing pool of data, that data is going to reflect any previous models, heuristics, or rules that influenced how it was produced. If the sales team previously ranked importance of leads by the size of the company, there’s a very good chance you’ll discover that in your lead conversion model, and it may have nothing to do with the actual chance of a random lead converting, but on the sales team’s level of effort to convert that specific lead. If you have a referral program and the referred customers go into the lead data, that’s going to skew things as well.
Maybe you want to pivot from thinking about the lead qualification as a classifier to thinking of it as a recommender system, that would then have a higher level of serendipity to diversify the sorts of leads being kicked up. You may even have to revisit the first phase and reframe the whole problem with the stakeholders based on what you find. It’s better than accomplishing nothing, but don’t just come back and be all “Model didn’t work ¯\_(ツ)_/¯”. Come back with “here’s what I tried, here’s why it did not work, but here’s what we could do instead to deliver something similar.” Involve your stakeholders in the process, and help them learn how you think about things in the same way that you’re learning how they think about things. In the long run, that helps with communication and can make these parts of the project run faster and smoother.
Deployment & Monitoring
Engineering Deployment
Not much to say here — you’ll work with your data engineer or leverage existing infrastructure to deploy the model to whatever resources your company uses. One thing to keep in mind is to develop at least one integration test of the model — given a small dataset and a fixed random number seed, does the model reach the same output every time? Otherwise it can be difficult to tell if something’s changed with the model once it’s in the wild unless it’s dramatic. The maxim that “untested code is broken code” doesn’t just apply to the dev team.
Monitoring
Alright, you’ve deployed your model! It’s doing things! People are using it!
You’re done, right?
Haha, nope.
Your ML model is an internal product, and like any product you probably want a dashboard exposing metrics and performance to make sure the product is healthy. The obvious metrics to monitor are whatever you used for the loss function, and whatever KPI you’re trying to improve. In our lead qualification example, (False Positives / False + True Positives), as well as the conversion percentage and perhaps some other metrics like time spent on bad leads (which could be hard to measure) or any changes in the demographic make-up of the leads that are converting. You’ll want to monitor if these change over time, as this can indicate a change in the distribution of data being fed to it, and you might want to retrain the model if things drift too far.
This is also a good chance to run an A/B test and see if your model is actually responsible for improving performance. You have a hypothesis based on your model’s performance on the validation set — say you were able to exclude half of the false leads without eliminating a single converted lead, then you have an estimate for what the sales team’s new conversion rate should be. You have a real hypothesis now. So run an A/B test and compare the conversion from leads qualified by your model versus however it was being done before.
This can validate that you’ve been able to do something, but models built on historical data will probably not behave exactly as expected once deployed. Digging into why it’s failing to perform as expected can be useful for the next step, which is…
Post-mortem and Iteration
At this point, you want to take a look at what you learned and how you might apply it going forward. Did you uncover some data quality issues that you could work around but want to clean up to make future projects run smoother? Did you find unexpected correlations in the data? Did your model perform as expected and, since it won’t, where is it falling down? In the process of doing this project, did you see some stuff that’s just boilerplate, and can be streamlined by building a small library to leverage on future projects? How would you want to improve your model going forward, and is it worth improving at this point? Did you discover anything unexpected that might inform business strategy?
Part of the job as a data scientist is to identify the real problem you want to solve, and usually new insights come from solving a different problem and having a moment where you went “…huh.”. That’s why you’re a data scientist.