Sales forecasting is an essential part of running a business.

In a typical scenario, sales managers meet with their teams weekly to determine which of their sales opportunities are likely to close before the end of the quarter. Their estimates are propagated up the management chain, aggregated, and presented to the company stakeholders.

Sales forecasting mistakes can be very costly. Investors punish stocks of companies that miss their earnings forecasts.

Sales forecasting mistakes are also very common. The forecasting process is greatly influenced by the fallible human nature.

The same fallible humans are also responsible for managing data in the CRM system. Consequently, forecasts that rely exclusively on CRM data tend to be unreliable.

Fortunately, there is light at the end of this tunnel.

Enterprise selling is going digital, and digital sales engagements provide troves of “clean” data that can be used to significantly increase the accuracy of sales forecasting.

I illustrate this point by building a machine learning model that produces a

fairly accurate sales forecast based on nothing but the history of recent sales meetings.

A complete Python notebook for this project is available here.

**If you would like to test ideas presented in this article at your organization, check out Morebell Revenue Intelligence and sign up for a free consultation.**

## Synthesizing Data

In a recent article, I demonstrated how one could use the history of sales meetings to reconstruct the shape of a sales process. Here I do the opposite. I begin with the definition of a sales process and use it to generate a log of sales meetings.

The process definition consists of a list of tuples. In each tuple, the first element is the subject of a meeting. The second element is the probability of transitioning to the next meeting. The third element is the average number of days that elapse between this meeting and the previous one.

Based on this definition, with a bit of stochastic simulation, we can produce two data sets: Opportunities and Meetings.

## Prediction Engineering

To train a machine learning model that can predict the future, we must produce a training data set that shows what the future must have looked like in the past with the benefit of our present knowledge.

For instance, we know now that opportunity A was closed won during quarter Q. While A was active, our sales team met every Monday to decide if A should be included in the forecast.

To train our model, we will produce a table with a row for A for each forecast.These rows will be labeled with 1’s in all forecasts that happened in quarter Q and labeled with 0’s in all forecasts that were produced in quarters before Q.

We also know that opportunity B was closed lost. Consequently, all forecast entries for B will be labeled with 0’s, that is, it must have never appeared in the forecasts.

The forecast dates in this table are the so-called cutoff dates. When learning to predict a forecast decision for A on a certain date, our algorithm will only be allowed to use information about sales meetings that occurred before that date.

## Feature Engineering

Feature engineering is the process of transforming and formatting historical data in a way that helps the machine to learn a mapping function from historical data to predictions.

A feature describes an entity whose state we would like to predict. For instance, the dollar value of a sales opportunity is a feature. The more features we can come up with, the more accurate our predictions are going to be.

Feature engineering has traditionally been a very labor-intensive process. A team of data scientists could easily spend several weeks or even months preparing features for solving a single prediction problem.

Thanks to the latest advances in automated machine learning, this process can now be effectively automated.

For this project, I used the featuretools library that implements the Deep Feature Synthesis algorithm developed by researchers at MIT.

The library automatically traverses relationships between data sets and combines data into a single feature matrix by applying aggregation functions, such as count, time_since_last, time_since_first, and many others.

In a huge boost to productivity, the library takes the cutoff dates into account when calculating the aggregate values.

## Baseline Modeling

Now that we have a feature matrix labeled with predictions, we can use it to train and evaluate a machine learning model.

Determining whether a sales opportunity can be included in a quarterly forecast is a typical binary classification problem.

Logistic regression is one of the machine learning algorithms that can be used to solve this class of problems.

Unfortunately, as the ROC curve and the confusion matrix demonstrate, the logistic regression model trained on our data is barely usable. It misses the bulk of successful opportunities, 75% of them.

At this point, we could try tuning the logistic regression model. We could also try using a different binary classification algorithm.

The search space of possible algorithm and hyperparameter combinations is truly immense. Doing this search manually could take a very long time.

Fortunately, AutoML can do this search for us.

## AutoML Modeling

With the help of an AutoML library, such as evalml, it takes just a few lines of code to find and tune the best machine learning pipeline for our binary classification problem.

Compared with the baseline logistic regression model, the ROC curve and the confusion matrix produced by the best pipeline demonstrate remarkable improvement.

The model correctly identifies nearly 80% of successful and 97% of failed opportunities.

## Next Steps

Based on the results presented here, our sales forecasting model will likely beat the majority of human experts. It is remarkable that we obtained these results with a single data set that is readily available at every organization -- the history of sales meetings.

The performance of the model can likely be improved by including additional information in the feature matrix:

- Dates of future meetings that are already scheduled in the calendar
- Calendar properties of the meeting dates: weekdays, quarters, etc.
- Configurations of products and services offered in each opportunity
- Geographic locations of sales teams and their prospects
- Names and roles of people involved in each meeting
- The log of customer engagement online and in digital sales rooms
- The sentiment scores of sales interactions
- The customers’ past purchasing history
- The industry sector, company size, profitability, etc. of the customer account