The Crucial Role of Preprocessing in Revenue Forecasting: Lessons from Larson and Overton | The Center for Analytics and Data Science

By Sarah Larson and Michael Overton

Revenue forecasting is an essential component of financial planning for state and local governments. Forecasts directly influence budgeting, policy decisions, and overall economic stability. Traditionally, debates around the “best” forecasting techniques have revolved around comparing classical statistical methods with modern machine learning techniques. However, new insights from Sarah E. Larson and Michael Overton’s (2024) research challenge this narrative, emphasizing a less-discussed yet critical aspect: data preprocessing.

Shifting the Focus: Why Preprocessing Matters More

The study explored various revenue forecasting methods for sales tax revenues, comparing traditional techniques like ARIMA and exponential smoothing with advanced machine learning approaches such as k-nearest neighbors (KKNN) and extreme gradient boosting (XGBOOST). Surprisingly, the research found that the choice of forecasting model—while important—had less impact on accuracy compared to how data was prepared beforehand.

Preprocessing refers to cleaning data prior to analysis and includes steps like detrending data, seasonally adjusting values, or applying transformations like logarithms or the inverse hyperbolic sine (IHS). These steps address inherent challenges in time-series data, such as seasonality and long-term trends that if left unaddressed will bias forecasts.

Key Findings: Consistency Across Time Intervals

Larson and Overton analyzed over 16 years of monthly, quarterly, and annual sales tax data from Texas cities. The findings underscored that:

Preprocessing Steps Drive Accuracy: Transformations like IHS and logarithms consistently improved accuracy across all forecasting intervals. These methods normalized the data and reduced the influence of outliers.
Time-Series Characteristics Matter: Adjustments for seasonality and trends significantly enhanced model performance, particularly for monthly and quarterly forecasts.
Model Selection Is Secondary: While machine learning models like KKNN and XGBOOST excelled in certain contexts, benchmark methods, such as the Drift and Seasonal Naïve models, often performed comparably after robust preprocessing.

Machine Learning vs. Traditional Methods

The promise of machine learning often lies in its ability to uncover complex patterns in large datasets. Yet, Larson and Overton’s findings reveal that machine learning models are not a magic solution. Their accuracy can be hindered if the input data lacks proper preprocessing, leading to overfitting or errors from unaddressed trends and seasonality.

Interestingly, when applied to well-preprocessed data, traditional models matched or exceeded the performance of advanced algorithms. This result calls into question the rush to adopt sophisticated tools without first mastering fundamental data preparation practices.

Practical Implications for Forecasters

For practitioners and policymakers, the message is clear: the focus should shift from “which model to use” to “how to preprocess the data effectively.” Key recommendations include:

Prioritize Normalization: Methods like IHS and logarithmic transformations should be routine, especially for data with high variability.
Account for Trends and Seasonality: Detrending and seasonal adjustments should be standard practices for time-series analysis.
Match Techniques to Time Intervals: Different preprocessing steps and models work better for monthly, quarterly, or yearly forecasts. Tailoring the approach to the data interval ensures better results.
Be Cautious with Inflation Adjustments: While important, inflation adjustments do not always enhance accuracy and should be applied judiciously.

The Road Ahead: Rethinking Forecasting Research

Larson and Overton’s study challenges entrenched assumptions in the field. It suggests that future research should explore the interplay between preprocessing and forecasting methods more deeply rather than fixating solely on model innovation. Additionally, replicating this study in other states or revenue contexts could further validate its findings and refine best practices.

References

Larson, S., & Overton, M. (2024). Modeling Approach Matters, But Not as Much as Preprocessing: Comparison of Machine Learning and Traditional Revenue Forecasting Techniques. Public Finance Journal, 1(1), 29–48. https://doi.org/10.59469/pfj.2024.8