Predictive Modeling

Predictive Modeling Definition

Predictive Modeling is a statistical technique in which probability and data mining are applied to an unknown event in order to predict outcomes.

Predictive modeling diagram showing the various processes involved.
Image from Plutora


What is Predictive Modeling?

Predictive modeling, a tool used in predictive analytics, refers to the process of using mathematical and computational methods to develop predictive models that examine current and historical datasets for underlying patterns and calculate the probability of an outcome. The predictive modeling process starts with data collection, then a statistical model is formulated, predictions are made, and the model is revised as new data becomes available.

Predictive modeling is generally categorized as either parametric or nonparametric models. Within these two camps are several different varieties of predictive analytics models, including Ordinary Least Squares, Generalized Linear Models, Logistic Regression, Random Forests, Decision Trees, Neural Networks, and Multivariate Adaptive Regression Splines.

Dr. Max Kuhn, Director of Non-Clinical Statistics at Pfizer Global R&D, and Dr. Kjell Johnson, co-founder of Arbor Analytics and former Director of Statistics at Pfizer Global R&D, published a popular and extensive text on the practice of predictive data modeling in their 2013 book Applied Predictive Modeling. Kuhn and Johnson provide intuitive explanations on the process of building, visualizing, testing, and comparing predictive modeling in R, a programming language and free software environment for statistical computing, graphics and data science.

What are Predictive Modeling Techniques?

In determining how to choose a predictive model, data scientists perform data sampling in order to analyze a representative subset of data points from which the appropriate predictive model can be developed. Some popular predictive modeling examples include:

  • Logistic regression: a statistical analysis method that predicts the parameters of a logistic model based on prior observations of a data set 
  • Decision trees: a flowchart-like tree structure in which each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label
  • Time series analysis: refers to methods for illustrating and analyzing time series data in order to extract meaningful statistics

How to Make a Predictive Model

Regardless of the types of predictive models in place, the process of predictive model deployment follows the same steps:

  • Clean up data by treating missing data and eliminating outliers
  • Determine whether parametric or nonparametric predictive modeling is most effective
  • Reprocess the data into a format appropriate for the modeling algorithm
  • Specify a subset of data to be used for training the model
  • Train model parameters from the training dataset
  • Conduct predictive model performance monitoring tests to assess model efficacy
  • Validate predictive modeling accuracy on data not used for calibrating the model
  • Deploy the model for prediction

How to Evaluate a Predictive Model

A popular technique to employ in predictive model validation and evaluation is cross-validation. Datasets are split at random into training datasets, test datasets, and validation datasets. Training data is used to build the model, then the trained model is run against test data to evaluate performance, and the validation dataset ensures a neutral estimation of predictive model accuracy. 

Each time a subset of historical data is used as test data, remaining subsets are used as training data. As tests continue, a former test dataset will become one of the training datasets, and one of the former training datasets will become a test dataset, until every subset has been used as a test set. This allows the use of every data point in a historical dataset for both testing and training, which facilitates a less random and more effective, thorough method for evaluating data and testing model accuracy. See more on Big Data Analytics here.

What is Predictive Modeling Used For?

Predictive modeling, often associated with meteorology, is leveraged throughout a wide variety of disciplines. Some popular predictive modeling applications that utilize customer prediction models and CRM (Customer Relationship Management) predictive modeling include: 

  • Predictive modeling in healthcare: identify highest risk patients in poor health that will benefit most from intervention; identify patients most likely to skip appointments without notice; predict clinical outcomes and inform clinical trial designs, predict product safety and optimize dosing; and gain insights from patterns in patient data in order to develop effective campaigns. 
  • Predictive modeling in insurance: turn data collected by insurers into actionable insights for pricing and risk selection purposes; identify customers at risk of cancellation; identify risk of fraud and outlier claims; triage claims and anticipate the insured’s needs, ultimately improving satisfaction and optimizing budget management; and identify potential markets.
  • Financial prediction models: predictive modeling in the investment management industry has proven to be unreliable on its own in such notable instances as the financial crisis of 2007-2008. Bond ratings, predictors of the risk of default based on historical macroeconomic data and borrower variables, were failed by bond rating agencies on the mortgage-backed Collateralized Debt Obligation (CDO) market.
  • GIS predictive modeling: describe spatial environment factors that constrain and influence the location of events by spatially correlating environmental factors that represent those constraints and influences with occurrences of historical geospatial locations.

Forecasting vs Predictive Modeling

Forecasting refers to the process of predicting future events based on analysis of trends and past and present data, whereas predictive modeling is based on probability and data mining. Forecasting pertains to out-of-sample observations, whereas prediction pertains to in-sample observations. Predicted values are calculated for observations in the sample used to estimate the regression. However, forecasting is made for the same dates beyond the data used to estimate the regression, so the data on the actual value of the forecasted variable are not in the sample used to estimate the regression.

Explanatory Modeling vs Predictive Modeling

Explanatory modeling refers to the application of statistical models to data for the purpose of testing causal hypotheses on theoretical constructs. The goal of explanatory modeling is to establish causal relationships by identifying variables that have a statistically and scientifically significant relationship with an outcome.

While predictive modeling addresses what might happen, explanatory modeling addresses what can be done about it, focusing on variables the user can control for the purposes of potential intervention. Explanatory modeling is the dominant statistical model in empirical research in Information Systems (IS) and typically relies on models in the generalized linear models (GLM) family, whereas predictive analytics models and methods rely on more powerful, algorithmic, non-linear techniques.

While prediction and explanation play different roles, both are vital in developing and testing theories.

Predictive Analytics vs Predictive Modeling

The terms “Predictive Modeling,” “Predictive Analytics,” and “Machine Learning” may sometimes be used interchangeably due to their largely overlapping fields and similar objectives, however there are some differentiating factors, such as practical applications. Data analytics predictive modeling is a tool leveraged in predictive analytics and is used throughout a range of industries, including meteorology, archaeology, automobile insurance, and algorithmic trading. When deployed commercially, predictive modeling is often referred to as predictive analytics.

Does HEAVY.AI Offer a Predictive Modeling Solution?

Predictive modeling is a solution to the data discovery challenge in the continuously expanding data deluge in big data management systems. HEAVY.AI's Data Science Platform provides an always-on dashboard for monitoring the health of ML models in which the user can visualize predictions alongside actual outcomes and see how predications diverge from real life.