Author Archives: David Scoville

What Goes into an Election Model?

By: Kevin Reuning and Adrian Dooley

In the digital age, predicting the outcome of U.S. elections has become both an art and a science, captivating millions of Americans every election cycle. Although political scientists have a long history of attempting to predict and explain elections, Nate Silver’s FiveThirtyEight blog brought these models to many more Americans in 2008. That year, Silver predicted an Obama victory in the popular vote by 6.1 percentage points and an end electoral college vote of around 350, which ended up just below Obama’s actual margin of 7.3 percentage points and 365 electoral college votes. Silver’s website also garnered over 5 million visits on election day.[1]

By our count, there are at least seven of these forecasting models including models developed by The Economist and ABC News. Each of these models combines a wide array of data to generate a specific quantitative prediction for their respective forecasts (note that the way these predictions are displayed varies). In this blog we outline how these different forecast models work, the type of information they use, and the assumptions they make (implicitly or explicitly).

It is worth noting what forecast models are not. Forecast models do not include polling aggregators like The New York Times and Real Clear Politics’s aggregators. Unlike forecast models, polling aggregators are designed to provide a snapshot of the election as it stands. They do not explicitly project forward who is likely to win. Forecast models are also different than the expert intuition models such as Keys to the White House, Sabato’s Crystal Ball, and Cook Political Report. The output of these expert models varies significantly, with some focusing on state-level predictions and some overall election-level predictions. The inputs to these models also vary significantly and are often not explicitly stated. Instead, an expert uses different cues to identify who they think will win based on past experience. To clarify the landscape of election forecasting, it’s essential to distinguish between the different types of models available. Table 1 outlines the primary categories, detailing their methodologies and providing examples of each.

Table 1: Types of Election Models

TYPEDESCRIPTIONSEXAMPLES
ForecastingUtilizes a combination of polling data and other quantitative metrics to produce specific predictions on election outcomes. These models often incorporate statistical techniques to adjust for biases and trends.The Economist, 538, Decision Desk HQ
Polling AggregationAggregates multiple polls to provide a current snapshot of the race without projecting future outcomes. These aggregators typically average poll results to show the present state of voter preferences.New York Times, Real Clear Polling
Expert IntuitionRelies on the qualitative and quantitative insights of political analysts and experts. These models depend on the experts’ experience and judgment to assess various indicators and predict likely winners.Keys to the White House, Sabato’s Crystal Ball, Cook Political Report

What is in a forecasting model?

Polling

To better understand election forecasts we collected information from a variety of prominent forecasts; the full list of models examined is in Table 2. All these models incorporate polling data, but a critical initial decision is determining which polls to include or exclude. Surprisingly, most election models lack explicit criteria for poll inclusion. FiveThirtyEight provides the most comprehensive rules, requiring methodological transparency and that pollsters meet ethical standards as outlined by the American Association for Public Opinion Research (AAPOR). Silver Bulletin uses similar rules (although has a slightly different list of “banned” pollsters). In addition, several (JHK and possibly The Economist) rely on FiveThirtyEight’s database of polls. In contrast to this, Race to the White House and Decision Desk both maintain their own polling database with very little transparency about how polls are added to it. Overall though it appears that in many cases the polling data used is the same across the different forecasts.

Although the forecasts start with much of the same polling data they immediately begin to diverge significantly in how they deal with the variation within pollsters. Polls can vary in many ways including the size of the sample, their modality (phones with a live caller, phones with automated messages, online, etc.), how they invite individuals to participate (phone, text, online panel), decisions made on how to weight the sample, and the funding of the poll. Most forecasts attempt to correct polls for these different effects by estimating models of the polls to calculate the bias induced by different polling decisions. After corrections are made the polls are averaged with several forecasts (FiveThirtyEight, Silver Bulletin, and JHK) weighting each poll as a function of the sample size, recency, and FiveThirtyEight pollster rating. One interesting exception is DecisionDesk, which simply fits a cubic spline to the polls.

Fundamentals

Nearly every forecasting model incorporates non-polling data, referred to as ‘fundamentals,’ which aim to capture the underlying factors influencing the election. This can include a wide array of data such as economic indicators, historical partisan lean, and consumer outlook. There is a significant variation in both how transparent models are with their fundamentals and which fundamentals they include. For example, FiveThirtyEight includes 11 different economic variables and a handful of political variables at the state or national level. Race to the White House in contrast includes 10 economic variables, that often overlap with FiveThirtyEight’s. For example, both include the University of Michigan’s consumer sentiment index and inflation measured by the CPI. They also have small differences such as the Race To the White House including the NASDAQ Composite Index while FiveThirtyEight includes the S&P 500 and FiveThirtyEight including slightly different measures of income. There are bigger differences for political variables. FiveThirtyEight mainly focuses on historic partisan lean while also accounting for specifics of the candidate (incumbency and home state advantages). Race to the White House though also pulls fundraising data and information from special elections.

Many of the other remaining models are not as transparent in the data that is included. The Economist provides important details in the process used to select the variables at the national level[2] which they say identifies a set of variables that are similar to Alan Abramowitz’s Time for Change Model (which includes presidential approval, change in real GNP, and an indicator for incumbency). Other models often only identify that they have created an “economic index” though they often go into deep detail on assumptions over incumbency. It isn’t clear the degree that different measures of economic indicators matter, although The Economist’s approach does demonstrate that there are principled ways to build these models of fundamentals.

State Correlation

Another important component for forecasts are the state-level estimates. As should be clear, the US presidential election is not decided by popular vote but is instead decided by electors tied to specific states. This means that the national level popular vote is only somewhat informative over who will win (see Hilary Clinton in 2016 and Al Gore in 2000). State-level estimates are then necessary, but state-level polling is significantly less common than national-level polls. In response to this most forecasts attempt to connect polling from one state to polling from other states (as well as to national polls). They do this by incorporating state-level correlations either explicitly in their model of polls or in their simulations. Most forecasts use ancillary data to estimate similarities across states. For example, Decision Desk HQ uses a “comprehensive set of demographic, geographic, and political variables” to identify 10 regions. In their end simulations, each region has its own error which moves state-level polls in the same direction for states in those regions. Others, like The Economist, directly incorporate state-level relationships into their model. Similar to Decision Desk HQ though The Economist also relies on a set of demographic and political factors to measure how similar two states are.

Table 2: Forecast Models

PublisherPoll SourcesEconomic FundamentalsState Correlation
24castRelies on 538YesNo
CnalysisRelies on JHK averageNo?
Decision Desk HQNo explicit guidelinesNoYes
FiveThirtyEightSpecific requirementsYesYes
JHK ForecastRelies on 538YesYes
Princeton Election ConsortiumRelies on 538NoNo
Race to the WHNo explicit guidelinesYesNo
Silver BulletinSpecific requirements?Yes
The EconomistNo explicit guidelines, possibly relies on 538YesYes

Conclusion

Election forecasting models have grown both in complexity and profitability over the past two decades. This growth presents a paradox: as models become more sophisticated, the need for transparency increases to understand the underlying assumptions and methodologies. Currently, FiveThirtyEight is the only model among those discussed that openly shares a significant portion of its data, model structure, and outputs. The reason is likely a function of the second trend. Other models like Silver Bulletin, Decision Desk HQ, and The Economist operate behind paywalls, which, while understandable for funding purposes, hinder comparative analysis and public understanding. Given that it is unlikely for these paywalls to fall it would be a positive step for the people behind these models to commit to making their work public after the election. This is a big ask. It is likely that some of these models will be proven wrong, but we can better learn from them only if their details are made public.


[1] https://www.nytimes.com/2008/11/10/business/media/10silver.html

[2] They estimate an elastic-net regularization model using leave-one-out-cross-validation to identify the optimal amount of shrinkage.

Image of Python group in Excel Ribbon

Coming to an Excel Near You…

Python. Yes, the ever-popular Python programming language is set to be featured in Excel in the near future. Microsoft is partnering with Anaconda, a leading enterprise grade Python repository. Python in Excel leverages the Anaconda distribution for Python running in Azure, which includes the most popular Python libraries such as pandas for data manipulation, statsmodels for advanced statistical modeling, as well as Matplotlib and seaborn for data visualization. For now Python in Excel is only available in public preview for those in the Microsoft 365 Insiders program Beta Channel. However, this feature will roll out to Excel for Windows first, starting with build 16818, and then to the other platforms at a later date. But soon you will be able to do advanced data analysis in the familiar Excel environment by accessing Python directly from the Excel ribbon. No set-up or installation will be required. Read more about Python in Excel.

Can ChatGPT help us in Data Science?

By now you have probably heard about or tried ChatGPT.  ChatGPT is a Natural Language Processing “chatbot,” developed by OpenAI and released to the public in November of 2022 for testing. ChaptGPT is currently free to use on the web for anyone with an OpenAI account, which is currently also free. 

In a nutshell, you ask ChatGPT a question in a “conversational way” and it can use “the whole Internet” to help it provide a response – although its knowledge cutoff is currently 2021 (ask it about an event that occurred or a celebrity that passed away in the last year and you’ll see what I mean). The release of NLP chatbots like this is huge in a lot of ways, and scary in a lot of ways. Like… if AI like this can do your job in less time for less cost, is your job in danger? I’ve been hearing that one for years, and have somehow been able to stay under the radar.  😉

Anyway, how can ChatGPT be of use in analytics and data science? Well, let’s ask!

Q. How can ChatGPT help a data analyst?

A.  “ChatGPT can assist a data analyst in several ways, such as:

  1. Natural Language Processing (NLP) tasks, such as text summarization and sentiment analysis.
  2. Generating human-like text based on the data provided, such as automatically generating reports or descriptions of the data.
  3. Helping to clean and prepare data for analysis by identifying and extracting relevant information from unstructured text data.
  4. Automating repetitive tasks such as data entry or data cleaning.
  5. Answering questions about the data, by training on specific data sets, allowing the data analyst to query the model and receive insights on the data.

Overall, ChatGPT can save time and effort for a data analyst by automating certain tasks and providing insights and information quickly.

I input the question above into ChatGPT and the answer you see was provided in a matter of seconds – I didn’t write that answer. Interestingly enough, ChatGPT will not give you a duplicate answer if you input the same question twice. In fact, there is a “Regenerate Response” button at the bottom of the page which provides a new response in case you didn’t like the first one. Let’s try that…

A2.  “ChatGPT can assist a data analyst in a number of ways, including:

  1. Data cleaning and preprocessing: ChatGPT can be used to automate tasks such as removing outliers, filling in missing values, and transforming data into a format that is ready for analysis.
  2. Text summarization: ChatGPT can be used to summarize large amounts of text data, such as customer feedback or social media posts, making it easier for analysts to identify key trends and patterns.
  3. Natural language processing: ChatGPT can be used to extract insights from unstructured text data by identifying entities, sentiment, and topics.
  4. Generating reports and visualizations: ChatGPT can be used to generate automated reports and visualizations based on the data analyzed, making it easier for analysts to communicate their findings to stakeholders.
  5. Automating repetitive tasks: ChatGPT can automate repetitive tasks such as data entry and data validation, freeing up data analysts to focus on more complex and important tasks.

And you can ask ChatGPT to elaborate or give a more specific example of any part of the response you get. You can keep tweaking or rephrasing your question until you get the response that is most helpful or relevant. 

ChatGPT has a number of applications in writing code as well. You can ask ChatGPT to write the code needed to accomplish a task in virtually any programming language or even in common applications like Excel. For example, if you’re not familiar with all of the functions in Excel (and who is?), you can ask ChatGPT to write a formula or function needed to accomplish the task. You may need to input the coordinates (rows and columns) the formula or function should act upon and then you can likely just copy & paste the code into the formula bar or cells in Excel. Or if you are trying to automate a task, you can ask ChatGPT to write an Excel macro, then copy/paste the Visual Basic code into Excel – [Alt+F11] Win or [Opt+F11] Mac. There are also applications for learning to code: Let’s say you are reasonably fluent in R but are trying to learn Python. You can input some code in R and ask ChatGPT to give you the equivalent code in Python, or vice-versa. And ChatGPT may ask clarifying questions to help debug the code. And this is just the tip o’ the iceberg, as they say. There are limitations and thing to watch out for. You can find these and more information at the OpenAI > ChatGPT website. Very cool, try it while it’s still free!