Insights from winning ASA’s DataExpo

When I was a prospective student touring colleges, I assumed research always involved test tubes and lab rats. However, as I reflect on the research project I completed this year, I am happy to say I did not spend any time in a lab. Instead, all I needed was my computer to participate in the DataExpo, a data science competition sponsored by the American Statistical Association (ASA).

Each year, students are given a government dataset and guiding questions, and then are expected to develop their own research project and present findings to judges at a statistics conference in the summer. This year, participants were asked to analyze data from the Global Historical Climatology Network, which contains weather records for the entire world since the 18th century. My project examining the relationship between public perception of climate change and county-level weather trends in the United States won first place in the competition. I learned a lot about data-driven research projects through the DataExpo and thought I would share some of my insights from the process.

  • Set boundaries to avoid becoming overwhelmed by an open-ended project.

My original idea for this project was to look at the social impact of climate change around the world. This is clearly very different from my final research question regarding public perception of climate change in the United States. Why did my research question change so drastically? It was important to narrow the scope of my analysis to something specific and manageable.

In order to analyze my original question, I would have needed to define social impact as it relates to climate change, found data that related to my definition of social impact, and then developed a method for quantifying social impact based on the data I had gathered. All of this would have been in addition to a similar process of quantifying climate change around the world with the GHCN data. As a full-time college student with about 6 months to complete the project, this was not feasible.

Rewriting my research question not only made my life easier, but it helped me create a better and more compelling story. For example, narrowing the scope of the analysis to the United States in the past 50 years was compelling from both a storytelling and data quality perspective. The United States was the best represented country in the dataset, and measurements are more accurate over the past 50 years as opposed to the past 100 years. I also knew the audience would almost entirely consist of people from the United States, and many middle-aged and older viewers would have been alive for most if not all of the period of analysis.

  • Recognize that performing analysis is a long and iterative process.

Even after I narrowed the scope of the analysis, the data processing for this project presented a significant challenge. I initially had 50 datasets (one per year), each with one row per weather station, per day. I needed to manipulate this data to get a new dataset with one row per county where the columns captured county-level temperature change over the past 50 years.  Recognizing there was a lot of work to be done, my advisor Dr. Tom Fisher and I broke the process down into a series of smaller steps. We transformed the data to get one row per station per year, and then to get one row per county per year. Finally, we measured the change in each county over time to obtain a dataset with one row per county.

A lot of this project involved thinking about the next step forward from where we were standing. Over the course of many months, we incrementally changed the dataset to obtain the final product. Some steps were repeated multiple times, and some steps were simplified or reduced to fit better with our end goal. For example, we originally processed the data for the last 120 years, and then only decided to use the last 50. Of all the summary statistics we calculated, only two of them turned out to be useful. I also reran the analysis to include precipitation measures late in the analysis process. It can be frustrating to feel like you’ve wasted your time on unused data processing or analysis, but it’s all part of the journey towards the final product.

  • Statistics is all about the story.

I think one of the most frustrating parts of any data project is that the parts of the project you dedicate the most time to are generally not what you share with the audience. Instead, the success of the project is judged based on the results of your project and how you deliver them.  After processing the climate data, I joined the final results to global warming survey data collected by Yale in 2019.  Using the full dataset, I built a series of maps and graphs to examine the relationship between observed climate change and public perception of climate change. Unfortunately, there wasn’t a strong correlation between the two sets of variables.

While my research question was answered, the end of the story just wasn’t satisfying. So rather than end the story there, I looked into several demographic factors to see if any of them were highly correlated with belief on climate change. Unsurprisingly, political ideology (represented by data from the 2016 presidential election) had the strongest correlation with belief on climate change. This extra step created a much better ending to my story. Rather than ending with a disappointing lack of correlation, I was able to construct a narrative about how Americans are more guided by political ideology and belief than empirical data, a suggestion that is especially relevant today in a world where public opinion doesn’t always align with scientific findings. 

Conclusion

I want to close with my biggest life lesson from this project, which is to always say yes to opportunities for practical experience in a field you’re passionate about. When I first joined the DataExpo team as an observer my sophomore year, I struggled to complete basic coding tasks. However, my experience shadowing that year not only led to my successes with this year’s project, but it also resulted in the opportunity to work more closely with members of the statistics faculty and students, in addition to getting me involved with CADS. I really appreciate Dr. Tom Fisher, Dr. Karsten Maurer, Matthew Snyder, Alison Tuiyott and Ben Schweitzer for making the DataExpo such a great experience. All of these opportunities have greatly improved my technical skills and prepared me for life after college. As I look back on my time at Miami, I will always be grateful for these experiences and for all of the people who helped me along the way.

About the Author

Lydia Carter is a senior at Miami University majoring in Statistics and Analytics. She has interned for CADS since Fall 2019.