The CI @ SC14: Discovery Engines, Exascale, Cloud Computing, and More

By Rob Mitchum // November 14, 2014

Next week, the world’s experts on high-performance computing, networking, storage and analysis will gather in New Orleans for the 2014 edition of the international Supercomputing conference. Several Computation Institute researchers will be there, presenting papers, participating in panels, leading workshops and tutorials, and speaking at the Department of Energy booth. Below are some of the highlight CI events, covering exascale computing, data services for campuses and national laboratories, the mapping of microbial life, experimental cloud computing, and much more.

Talks

Ian Foster, CI Director
Discovery Engines for Big Data: Accelerating discovery in basic energy sciences
Tuesday, 1:00 p.m., DOE Booth #1939

The 17 national laboratories of the U.S. Department of Energy are important hubs for science, studying everything from clean energy and battery technology to the universe and subatomic particles. The massive scale of these projects and the powerful instruments used to execute the research make the national labs important hubs for scientific data, creating, analyzing, importing and exporting petabytes of information each day. To ensure that this growing data flood does not cause a traffic jam that slows the national pace of science, a project from Computation Institute (CI), Mathematics and Computer Science (MCS), and Argonne Leadership Computing Facility (ALCF) researchers are building and applying a new “data fabric” for seamless and shareable access to data.

Argonne’s Discovery Engines for Big Data project seeks to enable new research modalities based on the integration of advanced computing with experiments at DOE facilities. The infrastructure includes the Petrel online data store, Globus research data management services, the supercomputing resources of the ALCF, and the parallel scripting language Swift. The work points to a future in which tight integration of DOE’s experimental and computational facilities enables both new science and more efficient and rapid discovery.

Early users of the system include Argonne’s Advanced Photon Source (APS), a powerful x-ray facility used by thousands of scientists around the world to examine the elementary structures of materials important for medicine, engineering, and other fields. With Petrel, data collected by a visiting scientist at the APS can easily be moved back to the scientist’s home institution or shared with outside collaborators. Researchers can also use Globus data publication and discovery services to permanently store datasets and, if desired, make them public and discoverable for the scientific community.

Rick Stevens, CI Senior Fellow
Mapping the Protein Universe: A Big Data Approach to Integrated Analysis of Microbial Genomes and Environmental Samples
Tuesday, 11:30 a.m., DOE Booth #1939

Microbial life on earth is so diverse that there are an estimated number of 10^30 organisms and perhaps 10^34 proteins. Millions of protein families could exist with an unlimited number of novel sequences. However, only as few as 1% of organisms can be cultured in a laboratory. Apart from what we know from cultivated microbes, our understanding of microorganisms comes from sequencing DNA extracted directly from environmental samples. Yet, we often do not have direct knowledge of which sequence came from which organism.

In a collaboration of five U.S. national laboratories, we are building the tools to improve our ability to search and mine the collection of environmental proteins and to relate those proteins to those we can study in the context of their full genome. We are building a complete mapping of the proteins seen in environmental samples to proteins that occur in sequenced genomes. This enables us to identify those proteins in environmental samples that are closely related to families of proteins of interest to DOE applications and calibrate the methods for building these maps. Additionally, we determine rules that govern the co-occurrence of protein families and protein clusters within sequenced organisms to interpret the patterns in environmental samples in order to gain biological insight into the nature of the organisms and their communities. Last, we demonstrate how to scale data storage, data management and computational analysis methods for a future that will contain many millions of isolates genomes and environmental samples as well as the many billions or trillions of proteins that can be identified from these datasets.

By comparing the sequences from the environment to known sequences from isolate genomes, we can expand our knowledge about the millions of organisms that can’t yet be grown in a laboratory, and even improve our ability to culture some of them. Ultimately, this could enable the discovery of proteins in the environment that hold the key to science and engineering problems at DOE (e.g., biofuels, energy production, novel chemistry, novel structures).

Demos/Tutorials

Michael Wilde, CI Senior Fellow
Swift Parallel Scripting for Supercomputers and Clouds
Tuesday, 10:00 a.m. and 3:00 p.m., DOE Booth #1939

The Swift parallel programming language allows users to perform large-scale simultaneous runs of simulations and data analyses more efficiently and with less user effort. Users write what looks like ordinary serial scripts; Swift automatically spreads the work expressed in those scripts over as many parallel CPUs as the user has available. Swift efficiently automates critical functions which are hard, costly, and unproductive to do manually:

implicit parallelization using functional dataflow
data transport and distribution of work across diverse systems
failure recover and error handling

Swift is both portable and fast. It provides a uniform way to run scientific and engineering application workflows over diverse parallel multicore PCs and clouds. On supercomputers, Swift has achieved speeds of over a billion tasks / sec. Swift is used in applications in materials, chemistry, biology, earth systems science, power grid modeling, and simulations in architectural design and urban planning.

Pavan Balaji, CI Fellow
ARGO: An Exascale Operating System and Runtime
Wednesday, 11:00 a.m. and 12:00 p.m., DOE Booth #1939

Argo is a new exascale operating system and runtime system designed to support extreme-scale scientific computation.It is built on an agile, new modular architecture that supports both global optimization and local control. It aims to efficiently leverage new chip and interconnect technologies while addressing the new modalities, programming environments, and workflows expected at exascale. It is designed from the ground up to run future high-performance computing applications at extreme scales. At the heart of the project are four key innovations: dynamic reconfiguring of node resources in response to workload, allowance for massive concurrency, a hierarchical framework for power and fault management, and a “beacon” mechanism that allows resource managers and optimizers to communicate and control the platform. These innovations will result in an open-source prototype system that runs on several architectures. It is expected to form the basis of production exascale systems deployed in the 2018–2020 timeframe.

Steve Tuecke, Vas Vasiliadis, Raj Kettimuthu
Enhanced Campus Bridging via a Campus Data Service Using Globus and the Science DMZ
Monday, 1:30 p.m., Room 394

Existing campus data services are limited in their reach and utility due, in part, to unreliable tools and a wide variety of storage systems with sub-optimal user interfaces. An increasingly common solution to campus bridging comprises Globus operating within the Science DMZ, enabling reliable, secure file transfer and sharing, while optimizing use of existing high-speed network connections and campus identity infrastructures. Attendees will be introduced to Globus and have the opportunity for hands-on interaction installing and configuring the basic components of a campus data service. We will also describe how newly developed Globus services for public cloud storage integration and metadata management may be used as the basis for a campus publication system that meets an increasingly common need at many campus libraries.

The tutorial will help participants answer these questions: What services can I offer to researchers for managing large datasets more efficiently? How can I integrate these services into existing campus computing infrastructure? What role can the public cloud play (and how does a service like Globus facilitate its integration)? How should such services be delivered to minimize the impact on my infrastructure? What issues should I expect to face (e.g. security) and how should I address them?

Papers/Panels/Workshops

Second Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE 2)
Sunday, 9:00 a.m., Room 275-77
CI ORGANIZER: Daniel S. Katz

Workshop Title: 3rd Workshop on Extreme-Scale Programming Tools
Paper: Case Studies in Dataflow Composition of Scalable High Performance Applications
Monday, 2:40 p.m., Room 297
CI PRESENTERS: Daniel S. Katz, Michael Wilde, Ian T. Foster

Experimental Infrastructures for Open Cloud Research
Tuesday, 12:15 p.m., Room 388-90
CI PRESENTER: Kate Keahey

Visualization Showcase: Investigating Flow-Structure Interactions in Cerebral Aneurysms
Wednesday, 10:30 a.m., Room 386-87
CI PRESENTERS: Joseph A. Insley, Michael E. Papka

The Open Science Data Cloud and PIRE Fellowships: Handling Scientific Datasets
Wednesday, 12:15 p.m., Room 297
CI PRESENTER: Robert Grossman

Can We Avoid Building An Exascale “Stunt” Machine?
Wednesday, 1:30p.m., Room 383-85
CI PRESENTER: Rick Stevens

Experiences in Delivering a Campus Data Service Using Globus
Wednesday, 5:30 p.m., Room 298-99
CI PRESENTERS: Ian Foster, Steve Tuecke, Rachana Ananthakrishnan

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
Thursday, 2:00 p.m., Room 393-94-95
CI CO-AUTHOR: Daniel S. Katz

High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tesselation
Thursday, 3:30 p.m., Room 391-92
CI PRESENTERS: Tom Peterka, Dmitriy Morozov, Carolyn Phillips

Return of HPC Survivor: Outwit, Outlast, Outcompute
Friday, 8:30 a.m., Room 383-85
CI PRESENTER: Pete Beckman