By Rob Mitchum // May 6, 2014
As computers grow more and more powerful, they become more useful as collaborators for scientists studying a diverse range of topics. Computers are good at repetitious tasks that would drive a human researcher mad, accomplishing tasks that are akin to not just finding a needle in a haystack, but painstakingly cataloging each piece of hay along the way. It may be a stretch to call this work “intelligence,” but as computational methods get better at classifying, analyzing, and predicting information, the lines between human and automated insight blur.
To highlight these advances and their relevance to disciplines ranging from science to the humanities, the Knowledge Lab and the Computation Institute started the Computational Intelligence talks, which will continue through the rest of 2014. The series fits snugly within the Knowledge Lab’s mission to apply the latest computational techniques, such as text mining and machine learning, to the rapidly growing stores of digital information in order to better understand the history and future of knowledge creation. As Knowledge Lab director James Evanswrote in Science last year, algorithmic “robots” may be the scientists of the future, and Computational Intelligence speakers will preview these powerful methods.
The first two Computational Intelligence talks, held at the CI’s UChicago location last week, provided glimpses of what is newly possible with these advances. Dashun Wang of IBM talked about predicting the future: how many citations a new research paper will accumulate, and more broadly forecasting the success of a scientist’s career. Yuening Hu of the University of Maryland discussed how to organize the past, using topic modeling methods to make sense of huge document archives.
A Crystal Ball for Citations
In his talk on quantifying long-term scientific impact, Wang drew a contrast between the complex mathematical models used to predict natural events, such as hurricanes, and the largely “gut feeling” methods used in hiring for jobs. While scientists have figured out the behavior of severe storms well enough to reasonably predict their path, the art of predicting who will have a successful career from their past accomplishments remains an inexact science. In science, even singling out which journal articles will go on to be important to their field remains difficult, with crude measures such as journal “impact factor” failing to predict the impact of individual papers.
So Wang set out to build a mathematical model that could describe the dynamics of how often papers are cited in the future, to try and find factors that could be used in predictions. When he charted the citation rates of papers from the archives of the American Physics Society and Web of Science, he found a noisy “mess,” with papers following a wide variety of citation patterns. Some articles enjoyed an early burst of citations before falling off to almost nothing, while others — some “‘truly outstanding exceptional papers,” Wang said — were largely ignored for years before experiencing a late burst of recognition.
Despite this variability, Wang designed a predictive model for future citations using just three factors: preferential attachment (more visible papers tend to be cited more), aging (a decline in citations over time), and novelty — how different the paper is from other papers. Given five years of citations for a paper, the model could predict with high accuracy its long-term future impact. The projections also held good news for scientists worried about rejections from the top journals, Wang said — as time goes on, the difference between journals grows smaller and smaller for a paper of a given fitness.
As for his own future, Wang hopes to expand his findings to predicting the overall career path of scientists and other professionals, studying the impact of moving between institutions on a scientist’s success, and examining the dynamics of virality in social media. While Wang didn’t claim there would ever be a 100% success rate for such predictions, he compared the work to advances that have raised the probability of success in fields such as oil drilling.
“We may never find the recipe for success, but we may be able to give you some theories backed with a real data set that give meaning to data to help you improve,” Wang said. “For innovators and entrepreneurs, we can give you a more predictable success rate.”
Tandem Text-Mining By Humans and Computers
While Wang ended with an example of how computers may help humans, Yuening Hu’s talk largely focused on how human feedback can improve a popular computational method: topic modeling. Commonly used for machine learning and text mining, topic modeling analyzes a large corpus of text and organizes the words into themed “topics” based on how often they appear. So if a document contains a high frequency of words such as “insurance,” “hospitals,” “doctors,” and “mortality,” topic modeling would likely classify it as being about health care.
What sounds relatively straightforward for one document gets much more complicated when applied to the huge torrent of data published in Twitter or Facebook every second. In these environments, topic modeling works pretty well as a way of sorting out the messages or documents that a researcher are interested in for later analysis. Still, there are flaws, and the pure statistical methods that computer algorithms use to determine topics don’t always make decisions that would be obvious to humans, leading to some illogical topics.
So Hu developed a method called “interactive topic modeling,” which builds human input into the topic modeling process. After the topic modeling algorithm sorts a text, a human user has a chance to review the topics it created and add or subtract words as they see fit, through a simple user interface. That input is then taken into account for another round of topic modeling, leading to another round of human review, and so on, until the user is satisfied with the results.
In early studies, Hu and colleagues found slight improvements to categorization accuracy with the addition of human input, and they met their user interface goals of creating a system that was simple, flexible, fast, and smart. Applications of this approach could help create better translation algorithms, as users fluent in multiple languages can verify whether topic modeling is appropriately matching words and phrases from two languages to create better, more nuanced translations where other methods fail.
For us humans, the success of Hu’s method is perhaps reassuring that we still have some role to play in science. But as computers get better and better at mimicking intelligence, thinking the way their users do, more and more of these time-consuming tasks can be outsourced to our CPU lab assistants, freeing up humans to make discoveries even faster.
[Image by Alejandro Zorrilal Cruz, via Wikimedia Commons]