Monthly Archives: October 2020

Insights from winning ASA’s DataExpo

When I was a prospective student touring colleges, I assumed research always involved test tubes and lab rats. However, as I reflect on the research project I completed this year, I am happy to say I did not spend any time in a lab. Instead, all I needed was my computer to participate in the DataExpo, a data science competition sponsored by the American Statistical Association (ASA).

Each year, students are given a government dataset and guiding questions, and then are expected to develop their own research project and present findings to judges at a statistics conference in the summer. This year, participants were asked to analyze data from the Global Historical Climatology Network, which contains weather records for the entire world since the 18th century. My project examining the relationship between public perception of climate change and county-level weather trends in the United States won first place in the competition. I learned a lot about data-driven research projects through the DataExpo and thought I would share some of my insights from the process.

  • Set boundaries to avoid becoming overwhelmed by an open-ended project.

My original idea for this project was to look at the social impact of climate change around the world. This is clearly very different from my final research question regarding public perception of climate change in the United States. Why did my research question change so drastically? It was important to narrow the scope of my analysis to something specific and manageable.

In order to analyze my original question, I would have needed to define social impact as it relates to climate change, found data that related to my definition of social impact, and then developed a method for quantifying social impact based on the data I had gathered. All of this would have been in addition to a similar process of quantifying climate change around the world with the GHCN data. As a full-time college student with about 6 months to complete the project, this was not feasible.

Rewriting my research question not only made my life easier, but it helped me create a better and more compelling story. For example, narrowing the scope of the analysis to the United States in the past 50 years was compelling from both a storytelling and data quality perspective. The United States was the best represented country in the dataset, and measurements are more accurate over the past 50 years as opposed to the past 100 years. I also knew the audience would almost entirely consist of people from the United States, and many middle-aged and older viewers would have been alive for most if not all of the period of analysis.

  • Recognize that performing analysis is a long and iterative process.

Even after I narrowed the scope of the analysis, the data processing for this project presented a significant challenge. I initially had 50 datasets (one per year), each with one row per weather station, per day. I needed to manipulate this data to get a new dataset with one row per county where the columns captured county-level temperature change over the past 50 years.  Recognizing there was a lot of work to be done, my advisor Dr. Tom Fisher and I broke the process down into a series of smaller steps. We transformed the data to get one row per station per year, and then to get one row per county per year. Finally, we measured the change in each county over time to obtain a dataset with one row per county.

A lot of this project involved thinking about the next step forward from where we were standing. Over the course of many months, we incrementally changed the dataset to obtain the final product. Some steps were repeated multiple times, and some steps were simplified or reduced to fit better with our end goal. For example, we originally processed the data for the last 120 years, and then only decided to use the last 50. Of all the summary statistics we calculated, only two of them turned out to be useful. I also reran the analysis to include precipitation measures late in the analysis process. It can be frustrating to feel like you’ve wasted your time on unused data processing or analysis, but it’s all part of the journey towards the final product.

  • Statistics is all about the story.

I think one of the most frustrating parts of any data project is that the parts of the project you dedicate the most time to are generally not what you share with the audience. Instead, the success of the project is judged based on the results of your project and how you deliver them.  After processing the climate data, I joined the final results to global warming survey data collected by Yale in 2019.  Using the full dataset, I built a series of maps and graphs to examine the relationship between observed climate change and public perception of climate change. Unfortunately, there wasn’t a strong correlation between the two sets of variables.

While my research question was answered, the end of the story just wasn’t satisfying. So rather than end the story there, I looked into several demographic factors to see if any of them were highly correlated with belief on climate change. Unsurprisingly, political ideology (represented by data from the 2016 presidential election) had the strongest correlation with belief on climate change. This extra step created a much better ending to my story. Rather than ending with a disappointing lack of correlation, I was able to construct a narrative about how Americans are more guided by political ideology and belief than empirical data, a suggestion that is especially relevant today in a world where public opinion doesn’t always align with scientific findings. 

Conclusion

I want to close with my biggest life lesson from this project, which is to always say yes to opportunities for practical experience in a field you’re passionate about. When I first joined the DataExpo team as an observer my sophomore year, I struggled to complete basic coding tasks. However, my experience shadowing that year not only led to my successes with this year’s project, but it also resulted in the opportunity to work more closely with members of the statistics faculty and students, in addition to getting me involved with CADS. I really appreciate Dr. Tom Fisher, Dr. Karsten Maurer, Matthew Snyder, Alison Tuiyott and Ben Schweitzer for making the DataExpo such a great experience. All of these opportunities have greatly improved my technical skills and prepared me for life after college. As I look back on my time at Miami, I will always be grateful for these experiences and for all of the people who helped me along the way.

About the Author

Lydia Carter is a senior at Miami University majoring in Statistics and Analytics. She has interned for CADS since Fall 2019.










Chemists and Analytics, A Surprising but Fruitful Partnership

My first experience as a CADS intern was standard to many. I worked with two other students and a faculty advisor on a project for a corporate client. The project followed a typical and expected process from introduction of the problem, lots and lots of industry research, applying analytical solutions to said problem, and then a final recommendation and presentation to the client. Once finished, I was excited and looking forward to a similar experience the following semester. However, before the semester ended, my team’s faculty advisor alluded that my skills may be put to the test next semester on a project with the chemistry department. Little did I know that this opportunity would teach me more about chemistry, analytics, data science, and the intersection of them, than I could have ever imagined.

For a little background, I am a current senior at Miami University where I am studying finance and business analytics. I was introduced to CADS and knew this was something I wanted to be involved with. It was an opportunity to use the skills and knowledge I had gained in the classroom, along with developing new ones, to fun and interesting projects. When one of my professors, Dr. Weese, mentioned she wanted me to be involved in a project for Miami University’s chemistry department, I was immediately intrigued. Never did I imagine I could apply my skills to a problem faced by my university’s chemists. That is, until our analytics team met with the chemistry team when we all realized the amount of untapped potential this partnership held. This partnership consisted of undergraduate students, graduate students, PhD candidates, professors, and even a department head from the Chemistry and Information Systems & Analytics departments at Miami University.

When you think about it, much of the typical chemist’s work is repetitive and manual. Compounds are researched, tested, and experimented with all by hand for the most part. Computers and robots can automate some of this if you have enough resources, but the point is that most every part of this process normally has to be done by hand, either a human’s or a robot’s. The advent of machine learning and artificial intelligence has already transformed many industries by eliminating, or at the minimum reducing, much of these tedious tasks. Thanks to Dr. Zishuo “Toby” Cheng and his curious mind, the question “Why can’t we apply machine learning to our beta lactamase inhibitor research?” was posed. What I loved most about this proposition is that nobody had tried anything exactly like it before. There was every reason for this partnership to work; we had the data, the smarts, and the desire, just nothing to go off of. However, this wasn’t an issue or disadvantage at all; instead, it forced us to think outside the box and think of every possible way to do something and see what worked and, many times, what didn’t. Not that having something to model after is ever bad, but it’s just human tendency to latch on to what was done before as the correct way. In our work, just about everything we did was “right”, only because there wasn’t anything to prove otherwise.

After several months being on the job, I think it is safe to say this partnership has been a huge success. By throwing numerous data science and analytical methods at the problem, we were able to dwindle down the search space of unknown compounds from over 70,000 to just 3,000. When you consider how in a normal situation every one of these 70,000 compounds would have to be tested, it becomes quickly clear how important this was feat was. No longer do you have to take a complete shot in the dark and hope you find a good compound; instead, you are able to look through only the compounds that have the highest probability of being successful per our models. Pending the results of the high throughput screening of these 3,000 compounds, we could eventually apply our analyses and models to a database of millions of unknown compounds.

It was in these times where the partnership really shined. As analytics students with no background in chemistry more advanced than high school chemistry, all of our results meant little to nothing to us. However, with our knowledge of what the numbers were showing and the chemists’ knowledge of what the numbers represented, we were able to uncover some incredible insights. For example, we strategically employed models with some form of interpretability that gave insight into what features of a compound make a good beta lactamase inhibitor. A couple of the most important variables made sense and were already well known as important features to the chemists. However, there were several features of good inhibitors according to our models that had never been considered before. The chemists determined these features still made logical sense, but simply were things not seen in past research. Although it isn’t the discovery of the next greatest beta lactamase inhibitor yet, it is insights like these that validate we are on the right track and give a glimpse in to the incredible potential for interdisciplinary teams like ours.

What’s next? For our team, we will continue to explore better methods for supporting the chemists’ research of beta lactamase inhibitors, hopefully leading to further insights into these important compounds. On a much larger scale, I hope to see many more partnerships like this one arise around Miami University. I can imagine successful partnerships with areas all over Miami. Thanks to CADS, these partnerships aren’t a matter of if they will ever happen, it’s simply a matter of when.

About the Author

Mitch Fairweather is a Miami University senior studying Finance and Business Analytics

From Miami into the workforce… and my experiences along the way

Hi! My name is Sophie Armor. I graduated from Miami in May 2020 with a B.S. in Finance with minors in business analytics and entrepreneurship. Upon graduation, I joined Fifth Third Bank as an associate data scientist in the Decision Science Group (DSG) at the bank.

My Miami Experience 

I moved to Oxford from just an hour away in Cincinnati. Both my mom and my older sister attended Miami, so I was pretty familiar and very excited to eat lots of skippers in four years. In terms of my education, I was always very into math and was drawn to studying finance but had an interest in the growing field of sports analytics that drew me to immediately declare business analytics as one of my minors. 

I started interning at CADS first semester of my junior year after taking ISA 291 with Dr. Weese. The first semester I was able to work on an experiential learning project with a bank. The purpose of the project was to perform a segmentation to identify differentiable, actionable and accessible customer segments. We were then to develop business strategies based on the insights. 

We leveraged factor analysis using the survey data to identify key questions for our analysis before clustering.  Our team had the tremendous opportunity to present out our analysis and insights to several individuals from the industry team. I would definitely say this played a big part in getting me to where I am today! After this project, I got to work on two more challenging projects with groups of amazing students that allowed me to gain valuable experience with workplace teams. 

Getting into certain classes at Miami can be hard… scheduling is stressful. There are several courses I took that I found to be extremely valuable. First off, ISA 291: Applied Regression Analysis and ISA 491: Data Mining. My current role includes conducting advanced analytics by utilizing predictive analytics, machine learning and optimization to deliver insights or develop analytical solutions to achieve business objectives. This is exactly what I learned in class and am able to apply at work. Getting the opportunity to deliver results, especially meaningful results, is crucial. I would also recommend saving code as applicable. At Miami, I primarily used R while I am now using Python – learning it as I go and spending time on DataCamp. I will say that knowing R was extremely helpful and (mostly) transferable to Python. 

My entrepreneurship classes really challenged me and made me grow as a student. ESP 252 is a class I believe everyone should take. The course is centered around leadership. Overall, the courses at Farmer paired with my three semesters as a CADS intern put me on a great path for discovering what I wanted to do with my post grad life. 

Beginning a Career in Data Science 

Like many others, I was unsure what I wanted and what life would be like after graduation. I spent the summer between junior and senior year in Chicago at W.W. Grainger in corporate finance, so not exactly the most comparable experience for my current role, but it was exactly what I needed in an internship. 

I was nervous, intimidated, excited and full of emotion as my start date got closer and closer this past July. My role as an associate data scientist is to assist in making data driven decisions (using data, statistical analysis, algorithms) throughout the bank. The projects that the Decision Science Group work on provide value and recommendations to support business growth. Am I qualified to take part in decisions that may change how the bank operates or markets to customers? I don’t know. 

As a new employee in the data science world, my days consist of working with others on my team to identify a problem with core questions to be answered, creating an analytical plan and then performing the analysis with check-ins with others throughout. The variety of projects that my team works on is very cool! I am learning each day and the Decision Science Group has a great environment for working together while getting any questions answered along the way. 

Working from home brought a new challenge as I transitioned into the working life as a new employee. I had to learn to manage my time as I work from my apartment. There is a lot of independence and a need for self-motivation. I have always put an emphasis on building relationships with those I meet and work with. This was both a challenge and opportunity in my new role as I begin to connect with people both in the Decision Science Group and other departments at the bank. 

Some Advice to Leave You With

Lastly, I want to give some advice (if anyone wants advice from a 22-year-old). 

  • Gain experience wherever you can 
  • Ask questions
  • Network!! 

Love and Honor,

Sophie Armor, Class of 2020