The Crucial Role of Preprocessing in Revenue Forecasting: Lessons from Larson and Overton

By Sarah Larson and Michael Overton

Revenue forecasting is an essential component of financial planning for state and local governments. Forecasts directly influence budgeting, policy decisions, and overall economic stability. Traditionally, debates around the “best” forecasting techniques have revolved around comparing classical statistical methods with modern machine learning techniques. However, new insights from Sarah E. Larson and Michael Overton’s (2024) research challenge this narrative, emphasizing a less-discussed yet critical aspect: data preprocessing.

Shifting the Focus: Why Preprocessing Matters More

The study explored various revenue forecasting methods for sales tax revenues, comparing traditional techniques like ARIMA and exponential smoothing with advanced machine learning approaches such as k-nearest neighbors (KKNN) and extreme gradient boosting (XGBOOST). Surprisingly, the research found that the choice of forecasting model—while important—had less impact on accuracy compared to how data was prepared beforehand.

Preprocessing refers to cleaning data prior to analysis and includes steps like detrending data, seasonally adjusting values, or applying transformations like logarithms or the inverse hyperbolic sine (IHS). These steps address inherent challenges in time-series data, such as seasonality and long-term trends that if left unaddressed will bias forecasts.

Key Findings: Consistency Across Time Intervals

Larson and Overton analyzed over 16 years of monthly, quarterly, and annual sales tax data from Texas cities. The findings underscored that:

Preprocessing Steps Drive Accuracy: Transformations like IHS and logarithms consistently improved accuracy across all forecasting intervals. These methods normalized the data and reduced the influence of outliers.
Time-Series Characteristics Matter: Adjustments for seasonality and trends significantly enhanced model performance, particularly for monthly and quarterly forecasts.
Model Selection Is Secondary: While machine learning models like KKNN and XGBOOST excelled in certain contexts, benchmark methods, such as the Drift and Seasonal Naïve models, often performed comparably after robust preprocessing.

Machine Learning vs. Traditional Methods

The promise of machine learning often lies in its ability to uncover complex patterns in large datasets. Yet, Larson and Overton’s findings reveal that machine learning models are not a magic solution. Their accuracy can be hindered if the input data lacks proper preprocessing, leading to overfitting or errors from unaddressed trends and seasonality.

Interestingly, when applied to well-preprocessed data, traditional models matched or exceeded the performance of advanced algorithms. This result calls into question the rush to adopt sophisticated tools without first mastering fundamental data preparation practices.

Practical Implications for Forecasters

For practitioners and policymakers, the message is clear: the focus should shift from “which model to use” to “how to preprocess the data effectively.” Key recommendations include:

Prioritize Normalization: Methods like IHS and logarithmic transformations should be routine, especially for data with high variability.
Account for Trends and Seasonality: Detrending and seasonal adjustments should be standard practices for time-series analysis.
Match Techniques to Time Intervals: Different preprocessing steps and models work better for monthly, quarterly, or yearly forecasts. Tailoring the approach to the data interval ensures better results.
Be Cautious with Inflation Adjustments: While important, inflation adjustments do not always enhance accuracy and should be applied judiciously.

The Road Ahead: Rethinking Forecasting Research

Larson and Overton’s study challenges entrenched assumptions in the field. It suggests that future research should explore the interplay between preprocessing and forecasting methods more deeply rather than fixating solely on model innovation. Additionally, replicating this study in other states or revenue contexts could further validate its findings and refine best practices.

References

Larson, S., & Overton, M. (2024). Modeling Approach Matters, But Not as Much as Preprocessing: Comparison of Machine Learning and Traditional Revenue Forecasting Techniques. Public Finance Journal, 1(1), 29–48. https://doi.org/10.59469/pfj.2024.8

What Goes into an Election Model?

By: Kevin Reuning and Adrian Dooley

In the digital age, predicting the outcome of U.S. elections has become both an art and a science, captivating millions of Americans every election cycle. Although political scientists have a long history of attempting to predict and explain elections, Nate Silver’s FiveThirtyEight blog brought these models to many more Americans in 2008. That year, Silver predicted an Obama victory in the popular vote by 6.1 percentage points and an end electoral college vote of around 350, which ended up just below Obama’s actual margin of 7.3 percentage points and 365 electoral college votes. Silver’s website also garnered over 5 million visits on election day.[1]

By our count, there are at least seven of these forecasting models including models developed by The Economist and ABC News. Each of these models combines a wide array of data to generate a specific quantitative prediction for their respective forecasts (note that the way these predictions are displayed varies). In this blog we outline how these different forecast models work, the type of information they use, and the assumptions they make (implicitly or explicitly).

It is worth noting what forecast models are not. Forecast models do not include polling aggregators like The New York Times and Real Clear Politics’s aggregators. Unlike forecast models, polling aggregators are designed to provide a snapshot of the election as it stands. They do not explicitly project forward who is likely to win. Forecast models are also different than the expert intuition models such as Keys to the White House, Sabato’s Crystal Ball, and Cook Political Report. The output of these expert models varies significantly, with some focusing on state-level predictions and some overall election-level predictions. The inputs to these models also vary significantly and are often not explicitly stated. Instead, an expert uses different cues to identify who they think will win based on past experience. To clarify the landscape of election forecasting, it’s essential to distinguish between the different types of models available. Table 1 outlines the primary categories, detailing their methodologies and providing examples of each.

Table 1: Types of Election Models

TYPE	DESCRIPTIONS	EXAMPLES
Forecasting	Utilizes a combination of polling data and other quantitative metrics to produce specific predictions on election outcomes. These models often incorporate statistical techniques to adjust for biases and trends.	The Economist, 538, Decision Desk HQ
Polling Aggregation	Aggregates multiple polls to provide a current snapshot of the race without projecting future outcomes. These aggregators typically average poll results to show the present state of voter preferences.	New York Times, Real Clear Polling
Expert Intuition	Relies on the qualitative and quantitative insights of political analysts and experts. These models depend on the experts’ experience and judgment to assess various indicators and predict likely winners.	Keys to the White House, Sabato’s Crystal Ball, Cook Political Report

What is in a forecasting model?

Polling

To better understand election forecasts we collected information from a variety of prominent forecasts; the full list of models examined is in Table 2. All these models incorporate polling data, but a critical initial decision is determining which polls to include or exclude. Surprisingly, most election models lack explicit criteria for poll inclusion. FiveThirtyEight provides the most comprehensive rules, requiring methodological transparency and that pollsters meet ethical standards as outlined by the American Association for Public Opinion Research (AAPOR). Silver Bulletin uses similar rules (although has a slightly different list of “banned” pollsters). In addition, several (JHK and possibly The Economist) rely on FiveThirtyEight’s database of polls. In contrast to this, Race to the White House and Decision Desk both maintain their own polling database with very little transparency about how polls are added to it. Overall though it appears that in many cases the polling data used is the same across the different forecasts.

Although the forecasts start with much of the same polling data they immediately begin to diverge significantly in how they deal with the variation within pollsters. Polls can vary in many ways including the size of the sample, their modality (phones with a live caller, phones with automated messages, online, etc.), how they invite individuals to participate (phone, text, online panel), decisions made on how to weight the sample, and the funding of the poll. Most forecasts attempt to correct polls for these different effects by estimating models of the polls to calculate the bias induced by different polling decisions. After corrections are made the polls are averaged with several forecasts (FiveThirtyEight, Silver Bulletin, and JHK) weighting each poll as a function of the sample size, recency, and FiveThirtyEight pollster rating. One interesting exception is DecisionDesk, which simply fits a cubic spline to the polls.

Fundamentals

Nearly every forecasting model incorporates non-polling data, referred to as ‘fundamentals,’ which aim to capture the underlying factors influencing the election. This can include a wide array of data such as economic indicators, historical partisan lean, and consumer outlook. There is a significant variation in both how transparent models are with their fundamentals and which fundamentals they include. For example, FiveThirtyEight includes 11 different economic variables and a handful of political variables at the state or national level. Race to the White House in contrast includes 10 economic variables, that often overlap with FiveThirtyEight’s. For example, both include the University of Michigan’s consumer sentiment index and inflation measured by the CPI. They also have small differences such as the Race To the White House including the NASDAQ Composite Index while FiveThirtyEight includes the S&P 500 and FiveThirtyEight including slightly different measures of income. There are bigger differences for political variables. FiveThirtyEight mainly focuses on historic partisan lean while also accounting for specifics of the candidate (incumbency and home state advantages). Race to the White House though also pulls fundraising data and information from special elections.

Many of the other remaining models are not as transparent in the data that is included. The Economist provides important details in the process used to select the variables at the national level[2] which they say identifies a set of variables that are similar to Alan Abramowitz’s Time for Change Model (which includes presidential approval, change in real GNP, and an indicator for incumbency). Other models often only identify that they have created an “economic index” though they often go into deep detail on assumptions over incumbency. It isn’t clear the degree that different measures of economic indicators matter, although The Economist’s approach does demonstrate that there are principled ways to build these models of fundamentals.

State Correlation

Another important component for forecasts are the state-level estimates. As should be clear, the US presidential election is not decided by popular vote but is instead decided by electors tied to specific states. This means that the national level popular vote is only somewhat informative over who will win (see Hilary Clinton in 2016 and Al Gore in 2000). State-level estimates are then necessary, but state-level polling is significantly less common than national-level polls. In response to this most forecasts attempt to connect polling from one state to polling from other states (as well as to national polls). They do this by incorporating state-level correlations either explicitly in their model of polls or in their simulations. Most forecasts use ancillary data to estimate similarities across states. For example, Decision Desk HQ uses a “comprehensive set of demographic, geographic, and political variables” to identify 10 regions. In their end simulations, each region has its own error which moves state-level polls in the same direction for states in those regions. Others, like The Economist, directly incorporate state-level relationships into their model. Similar to Decision Desk HQ though The Economist also relies on a set of demographic and political factors to measure how similar two states are.

Table 2: Forecast Models

Publisher	Poll Sources	Economic Fundamentals	State Correlation
24cast	Relies on 538	Yes	No
Cnalysis	Relies on JHK average	No	?
Decision Desk HQ	No explicit guidelines	No	Yes
FiveThirtyEight	Specific requirements	Yes	Yes
JHK Forecast	Relies on 538	Yes	Yes
Princeton Election Consortium	Relies on 538	No	No
Race to the WH	No explicit guidelines	Yes	No
Silver Bulletin	Specific requirements	?	Yes
The Economist	No explicit guidelines, possibly relies on 538	Yes	Yes

Conclusion

Election forecasting models have grown both in complexity and profitability over the past two decades. This growth presents a paradox: as models become more sophisticated, the need for transparency increases to understand the underlying assumptions and methodologies. Currently, FiveThirtyEight is the only model among those discussed that openly shares a significant portion of its data, model structure, and outputs. The reason is likely a function of the second trend. Other models like Silver Bulletin, Decision Desk HQ, and The Economist operate behind paywalls, which, while understandable for funding purposes, hinder comparative analysis and public understanding. Given that it is unlikely for these paywalls to fall it would be a positive step for the people behind these models to commit to making their work public after the election. This is a big ask. It is likely that some of these models will be proven wrong, but we can better learn from them only if their details are made public.

[1] https://www.nytimes.com/2008/11/10/business/media/10silver.html

[2] They estimate an elastic-net regularization model using leave-one-out-cross-validation to identify the optimal amount of shrinkage.

Software Engineering Needs Philosophy Majors

Artificial intelligence is expected to take over coding. This means that engineers with expertise in English, psychology and philosophy are highly valued for their critical thinking and communication skills. (Cantor 2024)

In the Harvard Business Review, Marco Argenti, Chief Information Officer at Goldman Sachs, made this argument. Argenti told his daughter that if she wanted to be an engineer, she needed to focus on learning philosophy. He wrote, “Coming from an engineer, that might seem counterintuitive, but the ability to develop crisp mental models around the problems you want to solve and understanding the why before you start working on the how is an increasingly critical skillset, especially in the age of AI.”

AI will start taking over the brunt work of coding and software engineers will need to be able to prove that they are still needed. Knowing how to ask the right questions and think outside the box will be important skills for the upcoming generation of software engineers. Creativity is one of the most important skills people can have.

Cantor, Melissa. ‘Goldman Wants Philosophy Majors.’, LinkedIn News, 22 Apr. 2024, http://www.linkedin.com/news/story/goldman-wants-philosophy-majors-6692170/.

Concluding DataFest 2024

The Center for Analytics and Data Science is happy to announce a close to this DataFest season. DataFest’24 was made possible by our sponsors Benchmark Gensuite and Fifth Third Banking.

Overall there were 80 participants from six different schools who competed this year. We would like to thank students for attending from the following schools:

Miami University
BGSU
College of Wooster

Xavier University
University of Cincinnati
Capital University

Winning Teams

Teams were ranked using a score based system. The winning teams were:

Top Score: Data-Holic Redhawks
Best Insight: Drake’s DataFesters
Best Visualization: The Derivative Falcons

Data Visualization (Without Coding)

This is the second bootcamp in The Center for Analytics and Data Science Faculty Fellow Bootcamp Series

Shenyue Jia

Join Shenyue Jia, Center for Analytics and Data Science Faculty Fellow and assistant professor of Geography, as she leads an activity-based bootcamp about data visualization in various platforms and the usage of GitHub for non-coders.

Over a three day bootcamp about data visualization and GitHub, Jia will help the audience take their knowledge of data visualization and apply it in multiple platforms, such as: Excel, Google Sheets, DataWrapper, and Tableau. She will teach the audience how to use GitHub for maintaining a project portfolio.

4/18: Data visualization in Excel, Google Sheets, and DataWrapper. Part one of a GitHub mini-lesson for non-coders
4/25: Part two of a GitHub mini-lesson for non-coders. Beginners Guide for Tableau.
5/2: Beginners Guide for Tableau continues. Uploading bootcamp project to GitHub.

This entry in the CADS Faculty Fellow Bootcamp Series presumes basic knowledge of data visualization. Due to the interactive elements of this bootcamp, please bring your laptop.

Bootcamp horizontal flyer that displays the name of the workshop

The Center for Analytics and Data Science is proud to be able to bring unique views into the arena of data science through its Faculty Fellow program. Thanks to the wide variety of talent offered by these gifted academics, CADS is able to provide examples of data science principles as they apply to the research of an array of disciplines. We thank all of our Faculty Fellows for their hard work and willingness to share.

If you have a topic that you would like to see covered as part of the Faculty Fellows Bootcamp Series, or any other question please contact the Center for Analytics and Data Science at cads@miamioh.edu

Population Bias In ML Based Medical Research

Unequal outcomes in medical research has been an ongoing issue, but a new study indicates that machine learning may not be an automatic solution to this problem. (Conev, et al. 2024)

A team of researchers from Rice University in Houston, Texas have recently published a study examining how the utilization of a biased dataset within a machine learning model can result in a disparity of immunotherapy treatments across different income classifications and geographic populations.

In an analysis of available datasets the team found that these datasets were “biased toward the countries with higher income levels.” Several solutions are suggested, including a conscious effort to expand data collection to under-represented geographic populations as well as creating models that train on the characteristics of each individual patient.

Conev, A., Fasoulis, R., Hall-Swan, S., Ferreira, R., Kavraki, L. (2024) ‘HLAEquity: Examining biases in pan-allele peptide-HLA binding predictors’, iScience, 27(1), https://www.sciencedirect.com/science/article/pii/S2589004223026901

DataFest@Miami ’24 Registration Opens

DataFest returns to Miami University from April 5th – April 7th for a data-packed 48 hours.

DataFest returns to Miami University from April 5th – April 7th for a data-packed 48 hours

DataFest, now in its eighth year at Miami, brings together teams of 3 – 5 analysis-minded undergraduates as they compete to extract a narrative from real-world datasets. These datasets are provided in cooperation with the American Statistical Assocation as part of the broader, international, event.

This year’s DataFest will find teams working in the new McVey Data Science Building, taking advantage of its numerous open-concept study spaces as they condense their insights into a short presentation. Along the way, students will have the opportunity to bounce ideas off a group of “roving consultants” – subject matter experts who volunteer their time so that students can leverage the benefit of real-world experience.

All of this leads to Sunday afternoon, when teams will showcase their understanding of the data by presenting to a group of expert judges. After deliberation, three teams will be chosen as winners across a variety of categories.

New this year, the Center for Analytics and Data Science will be hosting an information session on February 26th. Intended for students who have never participated in DataFest, we welcome any undergraduate student with questions about how this year’s competition might be different than years past.

Social Network Analysis in R

The Center for Analytics and Data Science Faculty Fellow Bootcamp Series is kicking off for Spring Semester.

Join Kevin Reuning, Center for Analytics and Data Science Faculty Fellow and assistant professor in Political Science, as he leads an activity-driven exploration of the (sometimes hidden) connections that link us as a society.

Over three consecutive Wednesday evenings in McVey 168, Reuning will help the audience take their knowledge of R and data analysis as it pertains to more traditional data sets and apply it to the interconnected web that is the foundation of a social networking modelling.

March 6th: Introduction to network terminology and data
March 13th: Visualizing networks
March 20th: Calculating basic network statistics

This entry in the CADS Faculty Fellow Boot Camp Series presumes at least some level of R proficiency and working knowledge of basic data analysis principles. Due to the interactive nature of the exploration, please bring your laptop.

Coming to an Excel Near You…

Python. Yes, the ever-popular Python programming language is set to be featured in Excel in the near future. Microsoft is partnering with Anaconda, a leading enterprise grade Python repository. Python in Excel leverages the Anaconda distribution for Python running in Azure, which includes the most popular Python libraries such as pandas for data manipulation, statsmodels for advanced statistical modeling, as well as Matplotlib and seaborn for data visualization. For now Python in Excel is only available in public preview for those in the Microsoft 365 Insiders program Beta Channel. However, this feature will roll out to Excel for Windows first, starting with build 16818, and then to the other platforms at a later date. But soon you will be able to do advanced data analysis in the familiar Excel environment by accessing Python directly from the Excel ribbon. No set-up or installation will be required. Read more about Python in Excel.

Can ChatGPT help us in Data Science?

By now you have probably heard about or tried ChatGPT. ChatGPT is a Natural Language Processing “chatbot,” developed by OpenAI and released to the public in November of 2022 for testing. ChaptGPT is currently free to use on the web for anyone with an OpenAI account, which is currently also free.

In a nutshell, you ask ChatGPT a question in a “conversational way” and it can use “the whole Internet” to help it provide a response – although its knowledge cutoff is currently 2021 (ask it about an event that occurred or a celebrity that passed away in the last year and you’ll see what I mean). The release of NLP chatbots like this is huge in a lot of ways, and scary in a lot of ways. Like… if AI like this can do your job in less time for less cost, is your job in danger? I’ve been hearing that one for years, and have somehow been able to stay under the radar. 😉

Anyway, how can ChatGPT be of use in analytics and data science? Well, let’s ask!

Q. How can ChatGPT help a data analyst?

A. “ChatGPT can assist a data analyst in several ways, such as:

Natural Language Processing (NLP) tasks, such as text summarization and sentiment analysis.
Generating human-like text based on the data provided, such as automatically generating reports or descriptions of the data.
Helping to clean and prepare data for analysis by identifying and extracting relevant information from unstructured text data.
Automating repetitive tasks such as data entry or data cleaning.
Answering questions about the data, by training on specific data sets, allowing the data analyst to query the model and receive insights on the data.

Overall, ChatGPT can save time and effort for a data analyst by automating certain tasks and providing insights and information quickly.“

I input the question above into ChatGPT and the answer you see was provided in a matter of seconds – I didn’t write that answer. Interestingly enough, ChatGPT will not give you a duplicate answer if you input the same question twice. In fact, there is a “Regenerate Response” button at the bottom of the page which provides a new response in case you didn’t like the first one. Let’s try that…

A2. “ChatGPT can assist a data analyst in a number of ways, including:

Data cleaning and preprocessing: ChatGPT can be used to automate tasks such as removing outliers, filling in missing values, and transforming data into a format that is ready for analysis.
Text summarization: ChatGPT can be used to summarize large amounts of text data, such as customer feedback or social media posts, making it easier for analysts to identify key trends and patterns.
Natural language processing: ChatGPT can be used to extract insights from unstructured text data by identifying entities, sentiment, and topics.
Generating reports and visualizations: ChatGPT can be used to generate automated reports and visualizations based on the data analyzed, making it easier for analysts to communicate their findings to stakeholders.
Automating repetitive tasks: ChatGPT can automate repetitive tasks such as data entry and data validation, freeing up data analysts to focus on more complex and important tasks.“

And you can ask ChatGPT to elaborate or give a more specific example of any part of the response you get. You can keep tweaking or rephrasing your question until you get the response that is most helpful or relevant.

ChatGPT has a number of applications in writing code as well. You can ask ChatGPT to write the code needed to accomplish a task in virtually any programming language or even in common applications like Excel. For example, if you’re not familiar with all of the functions in Excel (and who is?), you can ask ChatGPT to write a formula or function needed to accomplish the task. You may need to input the coordinates (rows and columns) the formula or function should act upon and then you can likely just copy & paste the code into the formula bar or cells in Excel. Or if you are trying to automate a task, you can ask ChatGPT to write an Excel macro, then copy/paste the Visual Basic code into Excel – [Alt+F11] Win or [Opt+F11] Mac. There are also applications for learning to code: Let’s say you are reasonably fluent in R but are trying to learn Python. You can input some code in R and ask ChatGPT to give you the equivalent code in Python, or vice-versa. And ChatGPT may ask clarifying questions to help debug the code. And this is just the tip o’ the iceberg, as they say. There are limitations and thing to watch out for. You can find these and more information at the OpenAI > ChatGPT website. Very cool, try it while it’s still free!