By Rob Mitchum // April 23, 2014
The importance of moving data can’t be underestimated. In a time of increasingly far-flung scientific collaborations, advanced instruments, and rapidly ballooning datasets, getting data from point A to point B can be the primary brake on the speed of discovery. But once the data reaches its destination, there’s still much to be done, and many tasks that can be accelerated using cloud-based cyberinfrastructure: sharing data securely, analyzing it, even publishing datasets for other scientists to use and build upon.
That expanded vision of research data management was the narrative that threaded through GlobusWorld 2014 and its keynote talk, delivered by Globus co-founder and Computation Institute director Ian Foster. The Globus Online service (now just simply “Globus”) is already responsible for the movement of over 43 petabytes in 3 billion files by over 15,000 users. But the project is now building upon that success to target some of these further data challenges. From expanding the data transfer services to support campus and even national cyberinfrastructures to building cloud platforms for specific disciplines, such as genomics, Globus is broadening its scope.
“These are things that i think really exemplify what we want to support: allowing a brilliant scientist in their own lab without their own computing infrastructure to perform transformative work in a really accelerated manner,” Foster said.
One exciting area of expanded Globus use is on academic campuses — some form of Globus services is now used on at least 85 U.S. campuses, Foster said. Some institutions, such as Indiana University, Michigan State University, the University of Exeter, and the National Energy Research Scientific Computing Center (NERSC), have signed up for the new Globus Provider plan, which allows them to distribute endpoints and sharing subscriptions to campus users. The enhanced transfer and sharing capabilities allow campus IT services to help their users fulfill previously difficult policies for archiving and working on research data.
“We’ve got to the point where a couple of years ago this was a new technology that people were exploring, but at this point it’s become a best practice,” Foster said.
Another current culture shift in science is the push for publishing entire datasets for an experiment alongside the traditional journal article describing its results. Such open data would allow for increased transparency, faster replication of the findings by other scientists, and the opportunity for other researchers to discover their data and build upon the results. But just as moving large datasets between computers is not a trivial matter, finding a reliable and simple method for publishing experimental data can be difficult, slowing acceptance of this new approach.
“I believe strongly that making it possible for people to move their data around and automate sharing is a very important step towards data accessibility,” Foster said. “If it’s easy to move your data somewhere where other people can access it, you’re more likely to do it.”
To address this need, Globus will soon be launching data publication services, first announced by Foster using a very funny video here. The service will help automate the open publication and preservation of research data, and make it easier for users to search, browse, and access datasets that they might be interested in using. A prototype demonstration, by Kyle Chard and Ben Blaiszik, showed how an experiment’s data can be shared and approved in about 10 minutes. The data publication services will begin beta testing this summer for early volunteer users, Foster said.
“We believe that by adopting this approach we can provide very powerful capabilities to many more people, at a price point that is affordable…and that can be sustainable over the long term” Foster said.