By Rob Mitchum // April 30, 2018
Most of us are now comfortable with cloud computing, enough to often take it for granted. Whether it’s saving our photos in cloud storage, accessing our email from multiple devices, or streaming a high-definition video on the bus, moving data to and from a distant computing center has become second nature.
But for science, embracing the cloud is far more complicated. Scientific data is typically larger, more complex, and more sensitive than consumer data, and researchers use it in experimental and advanced ways, applying deep analysis and machine learning methods to extract new discoveries.
For two decades, Globus has taken on this challenge by regularly introducing new services that make it easier for scientists to manage their data. From the early days of grid computing in the 90s to the cloud computing of today, Globus has reduced the peskiest barriers for data-driven research, including transfer, search, authentication, and analysis.
“A lot of people think of Globus as just file transfer, but there’s a lot more to it,” said co-founder Steve Tuecke at the 2018 edition of GlobusWorld, the group’s annual user meeting in Chicago. “Our mission is to increase the efficiency and effectiveness of researchers engaged in data-driven science and scholarship through sustainable software.”
So while the meeting in late April happened to coincide with an important file transfer milestone — 400 petabytes of data moved between Globus endpoints via the service since 2010 — many of the talks, tutorials, and user stories focused instead on what comes after data reaches its destination. Whether it’s enabling the discovery of promising new materials, helping coordinate multi-site research projects in neuroscience and molecular biology, or facilitating campus- and country-wide storage networks, Globus is increasingly a critical behind-the-scenes partner in some of today’s most exciting science.
Many of the new and coming-soon features announced by Tuecke deepened Globus data publication services which allow scientists to publish their datasets, control who can access the information, and use a specialized search platform to discover new data sources. The multiplicative power of these features was demonstrated through the Materials Data Facility (MDF), a project funded by the National Institute of Standards and Technology where materials scientists can publish and share research data.
Currently holding dozens of datasets from 150 authors and 29 institutions, the MDF has “started to enable new science,” said Ian Foster, Globus co-founder and Arthur Holly Compton Distinguished Service Professor of Computer Science at the University of Chicago. By combining these data sources and applying machine learning techniques, researchers have discovered new forms of metallic glass and added new measurements to already-completed molecular simulations. MDF has also provided a testbed for a new data science service being developed with Globus and the Argonne Leadership Computing Facility (ALCF) for researchers at Argonne National Laboratory.
These projects represent Foster’s concept of “data turbines,” where old datasets are revived and used to produce new knowledge.
“There is a lot of data out there — millions of systems hold research data, and a growing number of these are Globus storage endpoints,” Foster said. “At the moment, we mostly help people move and share their data, but we would like to add value to that data in other ways. Data turbines extract value from data, help you search it, and perform inference on it. Every Globus endpoint will become something active rather than something that sits there passive.”
Elsewhere, Globus makes science go faster simply by easing the friction of working with large — and constantly growing — datasets. Speakers from Stanford, Harvard, and the University of Michigan talked about how new scientific instruments such as high-throughput genome sequencers and electron microscopes are flooding laboratories with data and creating new demand for research computing centers and data storage. Many of these institutions, as well as multi-site collaborations and university libraries, increasingly steer their users to Globus for its ease of use and ability to handle terabytes of transfer.
These campus deployments of Globus may someday scale up to an even bigger data system: the Open Storage Network, a proposal presented by Alex Szalay of Johns Hopkins in the meeting’s guest keynote. To resolve the impasse between mounting academic data and cost-prohibitive commercial cloud storage, the project hopes to create a national distributed storage system spread across hundreds of universities connected by high-speed Internet2 infrastructure.
By creating a common system for academic data, researchers would be free to focus instead on analytics and preservation and build an ecosystem of shared data services on top of the network. If that sounds like a familiar message at GlobusWorld, it’s no accident; “Globus is at the heart of the system,” Szalay said. As science works towards a future where managing research data is as easy as managing family photos, Globus is providing many of the vital organs.
[Image: Middelgrunden wind turbine off Øresund Strait , Denmark. Photo by Kim Hansen.]