The Center for Analytics and Data Science is happy to announce a close to this DataFest season. DataFest’24 was made possible by our sponsors Benchmark Gensuite and Fifth Third Banking.
Overall there were 80 participants from six different schools who competed this year. We would like to thank students for attending from the following schools:
Miami University
BGSU
College of Wooster
Xavier University
University of Cincinnati
Capital University
Winning Teams
Teams were ranked using a score based system. The winning teams were:
DataFest, now in its eighth year at Miami, brings together teams of 3 – 5 analysis-minded undergraduates as they compete to extract a narrative from real-world datasets. These datasets are provided in cooperation with the American Statistical Assocation as part of the broader, international, event.
This year’s DataFest will find teams working in the new McVey Data Science Building, taking advantage of its numerous open-concept study spaces as they condense their insights into a short presentation. Along the way, students will have the opportunity to bounce ideas off a group of “roving consultants” – subject matter experts who volunteer their time so that students can leverage the benefit of real-world experience.
All of this leads to Sunday afternoon, when teams will showcase their understanding of the data by presenting to a group of expert judges. After deliberation, three teams will be chosen as winners across a variety of categories.
New this year, the Center for Analytics and Data Science will be hosting an information session on February 26th. Intended for students who have never participated in DataFest, we welcome any undergraduate student with questions about how this year’s competition might be different than years past.
To be completely honest, when I hear data science the first thing that comes to mind is complex data sets being manipulated by very smart people with very smart computers, which isn’t a bad thing! However, there are many other applications and uses for data science and analytics that attract different people with different interests. One of those applications that is particularly interesting to me and hopefully to you as well, is the use of analytics and data science in the sport industry.
My name is Kade Peterson and I am a sophomore marketing major and sport management minor. I recently started working with CADS as a marketing intern this fall semester, to gain marketing experience and understand how Miami University uses data science around campus. I saw an opportunity to connect what I have been learning through CADS with my love of sports to investigate the use of data science in athletics. I had a conversation with my roommate, who works for the university’s baseball team as a student manager. He told me about the different data collection methods they have for the baseball team. As well as the processes they go through to analyze this data and present it to the rest of the organization to recommend action and show what is effective for the team.
After having this discussion, I wanted to further investigate the use of data analytics in the sport industry in different areas of the market. I was drawn towards researching the use of sports analytics in golf because that is one of the sports I am most passionate about. One of the biggest stories of the year in the golfing community was the development of a much larger, much longer hitting Bryson DeChambeau. He has earned himself the nickname of The Mad Scientist, because of his devotion to tracking data and the mechanics of the body. Upon returning to competition after the COVID-19 pandemic, Bryson had added 30 pounds to his frame from daily workouts and strength training. He has become a much larger player which has resulted in him hitting the ball a LOT further. He is averaging 330 yards off the tee this season, which is up almost 40 yards from his previous seasons average. This massive increase has put him in first place for average driving distance on tour, nearly 10 yards ahead of his closest competitor (Source).
Following the return to competition, there were lots of questions surrounding BeChambeau and if this change was worth it. He silenced all of those questions this year, placing in the top 10 in 9 of the 17 events he played. As well, he earned his first major title (a major is one of the 4 biggest tournaments each year U.S Open, The Open, The Masters, and PGA Championship) winning the U.S Open this year. Bryson DeChambeau is a breath of fresh air in a sport that has long been stuck in its ways and lacking diversity. He is a self-described nerd in the game of golf and has turned towards data in a game that is generally feel based. When promoting the launch of an app that helped players pick the best golf ball for themselves, Bryson said “ the data analytics aspect of golf has helped me understand, from a percentage standpoint, where to hit shots, how to play a course, what clubs to use based on conditions, etc…” (Source). Bryson has turned to data analytics to try and give himself an advantage over his competition, combined with his physical transformation he has become one of the most talked about players on tour.
We are at a point where nearly every sport being played has some sort of data tracking and analytics aspect involved with it. The reach of data analytics in sports has increased over time. For example, The University of Connecticut just hosted a Sports Analytics Symposium with over 300 participants including students at the undergraduate and high school level. The purpose of this conference is to provide information to students who are just beginning to work with sports analytics or do not know much about it but know they are interested in sports and math. The symposium had four keynote speakers including Brian MacDonald the director of sports analytics at ESPN (Source). Universities are starting to offer more sports analytics opportunities to students. Miami University gives students the ability to gain a sports analytics certificate as well as a sports analytics minor. The crossroads of data analytics and sports has become more prominent in the sport industry and schools are moving to accommodate this industry shift.
In conclusion, the sport industry is usually not the first thing that comes to mind when discussing data analytics. However, it is a quickly growing segment in the sport industry that offers a unique use of it. This shows the wide spread of data analytics and how it impacts so many different industries in the world.
When I was a prospective student touring colleges, I assumed
research always involved test tubes and lab rats. However, as I reflect on the
research project I completed this year, I am happy to say I did not spend any
time in a lab. Instead, all I needed was my computer to participate in the DataExpo,
a data science competition sponsored by the American Statistical Association
(ASA).
Each year, students are given a government dataset and
guiding questions, and then are expected to develop their own research project
and present findings to judges at a statistics conference in the summer. This
year, participants were asked to analyze data from the Global Historical
Climatology Network, which contains weather records for the entire world since the
18th century. My project examining the relationship between public
perception of climate change and county-level weather trends in the United
States won first place in the competition. I learned a lot about data-driven
research projects through the DataExpo and thought I would share some of my
insights from the process.
Set boundaries to avoid becoming overwhelmed by an open-ended project.
My original idea for this project was to look at the social
impact of climate change around the world. This is clearly very different from
my final research question regarding public perception of climate change in the
United States. Why did my research question change so drastically? It was
important to narrow the scope of my analysis to something specific and
manageable.
In order to analyze my original question, I would have needed
to define social impact as it relates to climate change, found data that related
to my definition of social impact, and then developed a method for quantifying
social impact based on the data I had gathered. All of this would have been in
addition to a similar process of quantifying climate change around the world
with the GHCN data. As a full-time college student with about 6 months to
complete the project, this was not feasible.
Rewriting my research question not only made my life
easier, but it helped me create a better and more compelling story. For
example, narrowing the scope of the analysis to the United States in the past
50 years was compelling from both a storytelling and data quality perspective.
The United States was the best represented country in the dataset, and
measurements are more accurate over the past 50 years as opposed to the past
100 years. I also knew the audience would almost entirely consist of people
from the United States, and many middle-aged and older viewers would have been
alive for most if not all of the period of analysis.
Recognize
that performing analysis is a long and iterative process.
Even after I narrowed the scope of the analysis, the data
processing for this project presented a significant challenge. I initially had
50 datasets (one per year), each with one row per weather station, per day. I
needed to manipulate this data to get a new dataset with one row per county
where the columns captured county-level temperature change over the past 50
years. Recognizing there was a lot of
work to be done, my advisor Dr. Tom Fisher and I broke the process down into a series
of smaller steps. We transformed the data to get one row per station per year,
and then to get one row per county per year. Finally, we measured the change in
each county over time to obtain a dataset with one row per county.
A lot of this project involved thinking about the next step
forward from where we were standing. Over the course of many months, we
incrementally changed the dataset to obtain the final product. Some steps were
repeated multiple times, and some steps were simplified or reduced to fit
better with our end goal. For example, we originally processed the data for the
last 120 years, and then only decided to use the last 50. Of all the summary
statistics we calculated, only two of them turned out to be useful. I also
reran the analysis to include precipitation measures late in the analysis
process. It can be frustrating to feel like you’ve wasted your time on unused
data processing or analysis, but it’s all part of the journey towards the final
product.
Statistics
is all about the story.
I think one of the most frustrating parts of any data
project is that the parts of the project you dedicate the most time to are
generally not what you share with the audience. Instead, the success of the
project is judged based on the results of your project and how you deliver them. After processing the climate data, I joined
the final results to global warming survey data collected by Yale in 2019. Using the full dataset, I built a series of maps
and graphs to examine the relationship between observed climate change and
public perception of climate change. Unfortunately, there wasn’t a strong
correlation between the two sets of variables.
While my research question was answered, the end of the
story just wasn’t satisfying. So rather than end the story there, I looked into
several demographic factors to see if any of them were highly correlated with
belief on climate change. Unsurprisingly, political ideology (represented by
data from the 2016 presidential election) had the strongest correlation with
belief on climate change. This extra step created a much better ending to my
story. Rather than ending with a disappointing lack of correlation, I was able
to construct a narrative about how Americans are more guided by political
ideology and belief than empirical data, a suggestion that is especially
relevant today in a world where public opinion doesn’t always align with
scientific findings.
Conclusion
I want to close with my biggest life lesson from this project, which is to always say yes to opportunities for practical experience in a field you’re passionate about. When I first joined the DataExpo team as an observer my sophomore year, I struggled to complete basic coding tasks. However, my experience shadowing that year not only led to my successes with this year’s project, but it also resulted in the opportunity to work more closely with members of the statistics faculty and students, in addition to getting me involved with CADS. I really appreciate Dr. Tom Fisher, Dr. Karsten Maurer, Matthew Snyder, Alison Tuiyott and Ben Schweitzer for making the DataExpo such a great experience. All of these opportunities have greatly improved my technical skills and prepared me for life after college. As I look back on my time at Miami, I will always be grateful for these experiences and for all of the people who helped me along the way.
My first experience as a CADS intern was standard to many. I
worked with two other students and a faculty advisor on a project for a
corporate client. The project followed a typical and expected process from
introduction of the problem, lots and lots of industry research, applying
analytical solutions to said problem, and then a final recommendation and
presentation to the client. Once finished, I was excited and looking forward to
a similar experience the following semester. However, before the semester
ended, my team’s faculty advisor alluded that my skills may be put to the test
next semester on a project with the chemistry department. Little did I know
that this opportunity would teach me more about chemistry, analytics, data
science, and the intersection of them, than I could have ever imagined.
For a little background, I am a current senior at Miami
University where I am studying finance and business analytics. I was introduced
to CADS and knew this was something I wanted to be involved with. It was an
opportunity to use the skills and knowledge I had gained in the classroom,
along with developing new ones, to fun and interesting projects. When one of my
professors, Dr. Weese, mentioned she wanted me to be involved in a project for
Miami University’s chemistry department, I was immediately intrigued. Never did
I imagine I could apply my skills to a problem faced by my university’s
chemists. That is, until our analytics team met with the chemistry team when we
all realized the amount of untapped potential this partnership held. This
partnership consisted of undergraduate students, graduate students, PhD
candidates, professors, and even a department head from the Chemistry and
Information Systems & Analytics departments at Miami University.
When you think about it, much of the typical chemist’s work
is repetitive and manual. Compounds are researched, tested, and experimented
with all by hand for the most part. Computers and robots can automate some of
this if you have enough resources, but the point is that most every part of
this process normally has to be done by hand, either a human’s or a robot’s.
The advent of machine learning and artificial intelligence has already
transformed many industries by eliminating, or at the minimum reducing, much of
these tedious tasks. Thanks to Dr. Zishuo “Toby” Cheng and his curious mind,
the question “Why can’t we apply machine learning to our beta lactamase
inhibitor research?” was posed. What I loved most about this proposition is
that nobody had tried anything exactly like it before. There was every reason
for this partnership to work; we had the data, the smarts, and the desire, just
nothing to go off of. However, this wasn’t an issue or disadvantage at all;
instead, it forced us to think outside the box and think of every possible way
to do something and see what worked and, many times, what didn’t. Not that
having something to model after is ever bad, but it’s just human tendency to
latch on to what was done before as the correct way. In our work, just about
everything we did was “right”, only because there wasn’t anything to prove
otherwise.
After several months being on the job, I think it is safe to
say this partnership has been a huge success. By throwing numerous data science
and analytical methods at the problem, we were able to dwindle down the search
space of unknown compounds from over 70,000 to just 3,000. When you consider
how in a normal situation every one of these 70,000 compounds would have to be
tested, it becomes quickly clear how important this was feat was. No longer do
you have to take a complete shot in the dark and hope you find a good compound;
instead, you are able to look through only the compounds that have the highest
probability of being successful per our models. Pending the results of the high
throughput screening of these 3,000 compounds, we could eventually apply our
analyses and models to a database of millions of unknown compounds.
It was in these times where the partnership really shined.
As analytics students with no background in chemistry more advanced than high
school chemistry, all of our results meant little to nothing to us. However, with
our knowledge of what the numbers were showing and the chemists’ knowledge of
what the numbers represented, we were able to uncover some incredible insights.
For example, we strategically employed models with some form of
interpretability that gave insight into what features of a compound make a good
beta lactamase inhibitor. A couple of the most important variables made sense
and were already well known as important features to the chemists. However,
there were several features of good inhibitors according to our models that had
never been considered before. The chemists determined these features still made
logical sense, but simply were things not seen in past research. Although it
isn’t the discovery of the next greatest beta lactamase inhibitor yet, it is
insights like these that validate we are on the right track and give a glimpse
in to the incredible potential for interdisciplinary teams like ours.
What’s next? For our team, we will continue to explore better methods for supporting the chemists’ research of beta lactamase inhibitors, hopefully leading to further insights into these important compounds. On a much larger scale, I hope to see many more partnerships like this one arise around Miami University. I can imagine successful partnerships with areas all over Miami. Thanks to CADS, these partnerships aren’t a matter of if they will ever happen, it’s simply a matter of when.