Globusworld 2015: Data Management Milestones

By Rob Mitchum // April 22, 2015

One hundred petabytes is a lot of data. To put the number in perspective, consider that all of the written works of humanity to date add up to only an estimated 50 petabytes. Yet sometime in the next few months, CI research center Globus will surpass 100 PB of scientific data transferred.

At Globusworld 2015, held April 14-15 at Argonne National Laboratory, this milestone was celebrated with a contest to predict the exact time (to the microsecond) when the 100th petabyte transfers. But the conference was primarily focused on the future of Globus beyond that big accomplishment, showcasing the many innovative features built atop this perpetually accelerating data flow.

New tools for data sharing, data publication, and platform-as-a-service capabilities took center stage in CI Director Ian Foster’s keynote talk and tutorials from CI Deputy Director Steve Tuecke. Representatives from national agencies such as Compute Canada and the National Center for Atmospheric Research (NCAR), schools such as New York University and the University of California at Santa Cruz, and companies such as General Atomics spoke about the multitude of benefits Globus brings to their institutions. And roundtable discussions about XSEDE 2.0, campus IT, and data publication services connected Globus team members and users to discuss the most effective strategies for research data going forward.

This broad spread of activity remained under the umbrella of Globus’ core mission for 21st century research data management, described by Foster as to “Provide affordable, advanced capabilities to all researchers, delivering sustainable services that aggregate and federate existing resources.” By making it easier for researchers and research institutions to work with ever-growing pools of data, Globus frees up scientists to focus on their science and more rapidly acheive discoveries.

[Watch: Ian Foster’s Globusworld 2015 keynote address, “Managing Data Throughout the Research Lifecycle using Globus”]

The demand for that vision was amply demonstrated by some recent examples of heavy Globus usage. Earlier this year, scientists working on the Sloan Digital Sky Survey made 400,000 data transfers in two months — a new Globus record. In the year since the last Globusworld, Compute Canada — Canada’s national platform for advanced research computing — has transferred some 62 million files and 700 terabytes through Globus. And new, massive projects with the National Science Foundation’s Jetstream research cloud or the Materials Data Facility promise to further increase the traffic, potentially making these numbers seem quaint by comparison in the coming years.

But as flashy as those transfer stats may be, they only touch the surface of research data management. After removing common obstructions from the river of data flow, Globus has concentrated on building tributaries that make working with that data easier and more effective. Data sharing capabilities, for instance, which allow data owners to specify who can access or contribute to a particular data repository, frees up far-flung collaborations to build and work upon their datasets without concerns about security or curation quality.

Globus data publication is the next evolution of these services, allowing researchers to open up select data to the public, with a stable DOI or url identifier, so that scientists worldwide can discover and build upon their work. The GA release of Globus data publication services — demonstrated at 28:24 of the keynote — will be crucial for creating a more open, shared library of science, where researchers can access data as easily as they currently access journal articles, contribute their own new findings, and forge new collaborations.

In some places, Globus will accomplish this vision from behind the curtain. A pilot program with NCAR will use Globus to dramatically simplify the process by which scientists can find and download climate datasets from their research data archive, creating a sort of “shopping cart” for data. Jetstream’s goal of helping more researchers utilize high-performance computing incorporates Globus for transporting data to and from virtual machines built in their cloud, making it feel as similar to local computing as possible to remove barriers to entry.

But even on (relatively) smaller-scale projects, Globus is providing essential connective tissue. Troy Axthelm, from the Advanced Research Computing Center at the University of Wyoming, spoke about a project with the University of Utah to model all 288,000 square kilometers of the watershed of the Upper Colorado River Basin. Neither of the two schools alone could provide the compute power or storage needed to pull this massive simulation off, but Globus has connected their resources to make it possible.

“We have a very small team, so Globus means a lot to us,” Axthelm said. “They offer services to users that are beyond what we can offer ourselves. It makes the whole project practical.”