A Deeper Web for Science | Computation Institute

By Rob Mitchum // May 23, 2014

Much has been made of the potential of large and complex datasets to revolutionize scientific discovery in life sciences, social science, and the humanities. But what about the ability of data and data-driven tools to disrupt the practice of science itself? Despite the sophisticated methods increasingly used by researchers of all disciplines, many processes critical to scientific progress remain archaic, slow, and paper-based. Given outdated systems for how scientists find new research collaborators, how funding agencies and foundations select and track grantees, and how scientific findings are published and distributed, there are massive opportunities for data to accelerate and improve how science is done around the world.

The promise of these advances fueled the discussion at Information, Interaction, and Influence, a two-day workshop at the University of Chicago co-organized by the Computation Institute and research technology company Digital Science. Researchers, entrepreneurs, software developers, foundation representatives, administrators, and IT professionals came together at Ida Noyes Hall for talks, panels, and discussion about harnessing novel data capabilities to remove common scientific obstacles and accelerate the pace of research and discovery.

“The practice of how knowledge is created has not been studied rigorously, in part because we didn’t have the data,” said Ian Foster, director of the Computation Institute, in his opening remarks. “Increasingly, interactions used to generate knowledge and advance science are digitally mediated, making it more possible to treat science itself as object of study.”

Towards Open and Reproducible Research

It’s not just scientists adding urgency to this mission. Policymakers and the general public demand more open science, with scientists sharing not just research results, but also the data and software used for projects. In 2013, the White House released a memo and executive order commanding federal funding agencies such as the NIH, NSF, and DOE to facilitate public access to data and publications generated by their funded projects.

That’s one of the factors placing science at the “tipping point” of transparency, public participation, and digitization, said Victoria Stodden, an assistant professor of statistics at Columbia University, in her keynote talk. While more scientists are publishing the data behind journal articles, it remains a minority practice, even in quantitative fields. To fully enshrine computational science as a third branch of the scientific method alongside theory and experimentation, it must develop comparable standards to the much older branches, Stodden said.

To do so means creating an environment of truly reproducible research, releasing not just data but also the software and algorithms used to find knowledge within that information. Stodden encouraged scientists to look at the open source movement in computer software and media for guidance on how to disseminate data and software without the complications introduced by traditional copyright. These new practices won’t just help other scientists, she said, but should also empower people outside of the traditional boundaries of science.

“Becoming digital is giving us greater transparency, which has obvious implications for improving our own scholarly discourse, but it also opens this up to a much larger group than academics going to conferences,” Stodden said. “We have this notion of crowdsourcing in science, public engagement in science, which are extremely exciting. We have the opportunity to open that pipeline that I think we’re just cusping on.”

The Potential and Obstacles of Data-Driven Science

At the center of all these changes in science are the scientists themselves. In a morning panel on the second day of the workshop, a collection of Computation Institute researchers discussed how they use data in their work, and the many different disciplines data-driven techniques now touch upon. Researchers in medicine, genomics, climate science, social science, and developers of tools for high-performance computing and data management all spoke about the transformative changes that data has brought to their work.

But beyond the promise of Big Data and supercomputing lie deeply ingrained obstacles in the process of science that remain unresolved. Samuel Volchenboum, director of the Center for Research Informatics and CI faculty and fellow, spoke about the critical importance of governance — establishing policies on the storage, delivery, and use of data — in creating a clinical data warehouse. CI senior fellow Mike Wilde of Swift echoed Victoria Stodden’s keynote in emphasizing the importance of a “model of computation,” standardizing how computation is used in research for easier reproducibility and expansion.

Cultural barriers also prevent the full utilization of data in science. James Evans, CI Senior Fellow and Director of Knowledge Lab, talked about negotiations with book and journal publishers over how to conduct research on their corpus of text without violating copyright and other legal concerns. Evans said it was important to establish Knowledge Lab as a neutral “Switzerland,” whose research adds value back to publishers instead of hurting their business model.

Alison Brizius, executive director of the Center for Robust Decision-Making on Climate and Energy Policy (RDCEP)described the struggle of multidisciplinary research in a transitional time when different fields have different perspectives on data.

“People in multiple disciplines with multiple cultures have different perspectives on what data sharing and access mean, what a model is, and might have incentives that are very misaligned towards keeping their data proprietary,” Brizius said. “A lot of the potential for our research is built on sharing taking place, so how can we align those incentives and get people to use these tools and share their work?”

Solving Problems Through Software

The market for building those solutions is just starting to grow. Small startup companies, many founded by academics or former scientists, race to provide the killer application that will realize the potential of digitization, open science, research impact metrics, and other data-driven advances. Workshop co-organizer Digital Science invests in the most promising of these companies, and brought a collection of their founders to the meeting to talk about their experience and their products.

Many of the companies designed their product around giving researchers, funders, and institutions better information about the impact of research and better tools for managing publications. Altmetric, a company founded by bioinformatician Euan Adie, expands impact measures beyond citation rate to include newer forms of scientific communication, such as blog posts, social media discussions, news articles, and government policy documents. Figshare provides publication services for researchers and institutions to post and share data sets, while ReadCube gives scientists new and better ways to manage the massive torrent of journal articles.

On the other side, companies such as UberResearch, Symplectic, and Wellspring Worldwide help funders and institutions track where grant money is going, for easier reporting and tracking of national and global research trends. A panel featuring several representatives from foundations funding scientific research emphasized their need for better metrics to measure the impact of their grants and to make smarter decisions about who to fund.

Even companies with a more external focus — such as the data-driven patient health-monitoring service of Qualia Health, founded by UChicago emergency physician David Beiser — must fold themselves into the data ecosystem of science and institutions. Beiser’s idea was to use the data streams generated by remote sensors and mobile computing to construct a new view of health and disease. But making his model reality required time-consuming efforts to obtain data that hospitals are not used to releasing, find alternative sources of funding in the entrepreneurial world instead of federal agencies, and justify the time taken away from his day job.

“I told my division head, this is really something you have never heard of, but I can assure you this will move our project forward. I don’t know this is going to necessarily directly result in grant funding, but I do know we are starting to create new knowledge. These data streams coming in through our platform now are data we haven’t had before as clinicians,” Beiser said. “This data will change the way we practice medicine.”

Smarter Profiles and Serendipitous Science

One deceptively simple way that research institutions can open themselves up to the scientific community and beyond are new, smarter profile systems. The traditional online university directory is built on a static template offering little more than a photo, a brief biography, and a list of publications. But using tools such as VIVO and Harvard’s Profiles, university IT departments are building deeper research networks that are customizable, searchable, and capable of delivering advanced metrics to scientists and administrators.

These new platforms are more than just online phonebooks, said members of a panel on research networking moderated by CI research scientist and conference co-organizer Tanu Malik, but can help facilitate new, serendipitous collaborations within and between institutions.

“The maximum number of research opportunities are possible when we can maximize the number of people discovering or engaging with our research,” said Simon Porter of the University of Melbourne, which recently launched its Find an Expert website built with VIVO.

Similar profile systems at Harvard and UCSF (and UChicago, whose system is in its early stages) have given researchers the ability to list new types of information on their page, such as social media and videos, or non-publication achievements such as slides from talks, grants, cases argued before the Supreme Court or asteroid discoveries. They also use network-building techniques to show where an individual researcher fits within the broader institutional ecosystem, based on shared research areas.

Administrators can also use these researcher webs to identify their strengths and weaknesses, pull together cross-departmental initiatives, and make informed decisions about promotions and tenure. In a subsequent panel on “the administrative perspective,” representatives from the University of Illinois at Chicago and UChicago talked about how profiles and other efforts to automate and digitize research information make their work easier — even as they discover new challenges for the management and preservation of large datasets. Elisabeth Long, Associate University Librarian for Digital Services at UChicago, talked about the library’s role in this new world where archiving and curating goes far beyond collecting journal articles.

“We are very interested in thinking in really broad terms about scholarly output,” Long said. “Beyond formal publications, from monograph to journal article to data to software to talks to websites. While there are our systems and solutions that capture part of that, what we’re really looking at is something to solve these problems for a much broader set of material, and how we can preserve it long term.”

[Photos by Caitlin Trasande/Digital Science]