In a world increasingly governed by data, it is critical that it is applied properly. As data-driven methods gain popularity for driving public policy, business operations, hiring, basic science, medical decisions, and virtually every other aspect of our lives, it’s important that people understand the foundations of data science and apply it appropriately. If not, bias, spurious correlations, and other statistical landmines can distort results, with what could now be grave human and societal consequences.
Joint Computer Science and Statistics Professor Rebecca Willett helps neuroscientists, physicians, astronomers, climate researchers, and even farmers avoid these missteps and maximize the discovery potential of data. Through a combination of fundamental research and diverse interdisciplinary collaboration, Willett has advanced the practice of data science into new fields and deeper insights. After faculty positions at Duke University and the University of Wisconsin, she joined UChicago in summer 2018 to both continue her work and help build new data science research and education initiatives.
“I was really excited that the university was investing in data science in a large scale way,” Willett said. “I thought there was a lot of opportunity for growth and building and I was enthusiastic to play a central role in those efforts.”
Shortly after her arrival at UChicago, Willett led a multi-university team awarded an NSF grant to find new “El Nino”-like weather patterns, using data science for improved seasonal forecasting using growing quantities of climate measurements. The project is emblematic of her research portfolio, which features many projects where her group develops new fundamental methodology and theory that are inspired by challenges faced by domain scientists with difficult, data-intensive questions, including image analysis, signal processing, and using machine learning for prediction and optimization.
For example, she recently helped a neuroscience group develop algorithms to segment and parameterize images of neural tissue, in order to test a new method for controlling the growth of stem cells. In another project, she helped develop the image processing methods within a smartphone app farmers can use to quickly measure corn kernels for dairy cow feed and make critical, on-the-fly decisions about harvesting methods to improve cow nutrition.
Other projects help researchers deal with high-dimensional data, where there is abundance of features associated with each data point. For example, data science methods help avoid false conclusions from linking medical and genetic data, where the sheer scale of possible connections can create misleading correlations.
“A pervasive theme is how to draw reliable conclusions from data,” Willett said, “especially when data are high-dimensional. For instance, we record vast quantities of data about each patient’s health history, including test results, treatments, demographic information, family history, imaging data, genetic information, and physician notes. Such large numbers of features makes it difficult to tease out risk factors for health conditions that were previously unrecognized. It becomes even more challenging as we strive to ensure methods are robust to errors in health records, lab tests not conducted, or treatments that were untried.”
“In general, mitigating the challenges associated with high-dimensional data is a key research thrust in data science, and relies upon developing novel geometric representations of data and incorporating physical models as much as possible”
Willett’s co-appointment reflects the combination of skills needed to address these questions in a practical way. While many of the methods to analyze data are steeped in statistics, computer science helps her understand how effective methods can be computed in reasonable time, and what could potentially go wrong.
“For some projects, I have developed novel software and tools that practitioners or researchers in other fields can use on their data ,” Willett said. “For others I have examined methods that are already in use and developed theory to better characterize whether these methods are reasonable, whether there are some pitfalls we should be aware of, and where there may be room for improvement”
That philosophy aligns with UChicago initiatives to launch new programming and collaborations in data science that combine efforts from the Departments of Statistics and Computer Science, including a course debuting this fall co-taught by department chairs Dan Nicolae and Michael Franklin.
“The fact that CS and Stats work so well together and the unified vision of the two departments means a lot,” Willett said. “I think that’s going to allow us to establish ourselves as a world-class machine learning and data science group.