eScience Reports, Day 1 | Computation Institute

By Rob Mitchum // October 10, 2012

The 2012 IEEE International Conference on eScience is taking place in Chicago this year, and we’ll be there Wednesday through Friday to report on talks about the latest in computational research. We’ll update the blog throughout the conference (subject to wifi and electrical outlet availability), and will tweet from the talks @Comp_Inst using the hashtag #eSci12.

Paving Future Cities with Open Data (Panel 2:00 – 3:00)

As the Earth’s population increases, the world is urbanizing at an accelerating rate. Currently, half of the people on planet live in cities, but that number is expected to grow to 70 percent in the coming decades. Booming populations in China and India have driven rapid urban development at a rate unprecedented in human history. Simultaneously, existing cities are releasing more data about their infrastructure than ever before, on everything from crime to public transit performance to snow plow geotracking.

So now is the perfect time for computational scientists to get involved with designing and building better cities, and that was the topic of a panel moderated by Computation Institute Senior Fellow Charlie Catlett. With representatives from IBM and Chicago City Hall and a co-founder of EveryBlock, the panel brought together experts who have already started digging into city data to talk about both the potential and the precautions inherent within.

For a local example, Chicago’s Chief Technology Officer John Tolva talked about Mayor Rahm Emanuel’s initiative to release more city data to the public through sites such as the Chicago Data Portal. The city has also taken steps to convert some of that data into useful web applications, such as tracking the result of a call to the city’s non-emergency 311 center, and applications that are more just entertaining, like the city’s snow plow tracker. Maybe even more important was the city’s admission that they don’t need to be the only one creating these apps, organizing “Hackathons” for web developers to design their own tools and release them to the public. Apps that remind Chicagoans to move their cars on street sweeping days or tell them when the next bus will arrive have proven immensely popular, and even

But Dan O’Neil, the executive director of the Smart Chicago Collaborative, said that merely releasing the data is not enough. An infrastructure for civic innovation needs to be built, combining public and private data to find inefficiencies that can be capitalized on by the market to drive a city’s economic development. O’Neil used the example of commercial real estate, where information from the government and private companies could be mashed up to turn the current “hunch-oriented” process of selecting the best site for a new business into a more scientific procedure. John Ricketts of IBM’s “Smarter Cities” program emphasized the need for cultural knowledge above and beyond the raw data, citing a collaboration with the city of Beijing where CO2 sensors in residences are designed to contact neighbors to provide help instead of emergency services in case of an alarm.

Catlett’s vision for the role of computation in cities takes a broader view, combining models on environment and climate with the flood of open city data to build complex simulations for city planning. Catlett used the example of Lakeside, a proposed development on Chicago’s South Side which will span 600 acres of residential and commercial buildings. Urban planners know how to design a 60-acre development, Catlett said, but when you increase the size of a development tenfold, “we’ve gone beyond the scale of human experience.”

Helping “The 95%” Discover Computation (Keynote 1:0o – 2:00)

As scientific conferences go, eScience is modestly sized, with the attendees all able to fit into one big room for keynote talks. Over the course of the week, these few hundred experts will share the latest in computational research methods and ideas with each other, reporting back from the frontiers of the field while pushing them farther outward. But before heads got too big about the grand march of science, the afternoon keynote speaker, Gregory Wilson from Software Carpentry, was there to remind those in attendance that they were only a minority of a minority of those who have truly integrated computation into the way they perform science.

Wilson drew attention to the 95 Percent — the vast majority of scientists who lack even basic computational skills. Forget writing code, cloud computing, and version control; these are the scientists who struggle to even install basic programs downloaded from the internet. Many of the cutting-edge computational methods discussed at this conference may be exciting developments for those in the know, Wilson said, but they’re completely irrelevant to scientists who still handle their data using brute force “pencil and paper” methods because they don’t have the basic computational skills to boost their efficiency. His chosen metaphor was a new CT scanner arriving in a South American country where millions of people lacked clean water — a shiny toy is nice, but doesn’t do much for people with basic needs.

“Would you rather see your best ideas left on the shelf because most of the people can’t reach them?,” Wilson asked his audience. “Or would you rather raise a generation who can all do the things we think are exciting?”

Wilson’s advice is to target graduate students at the start of their scientific careers, teach them via short (2-day to 2-week) workshops, and focus on practical skills instead of theory. It’s an approach he’s refined with Software Carpentry, which provides volunteers to teach various computation workshops around the world. Assessments of Software Carpentry workshops have found that as many as two-thirds of the attendees experience a significant impact on their productivity, shaving as much as 20 percent off the time it takes to perform routine analysis tasks (and making life a lot easier for their tech support departments). Successfully spreading the gospel of computation to all young scientists could create a six-fold increase in the number of scientists who can avoid karoshi, Wilson joked — the Japanese term for “death by overwork.”

Making Sense of Ecological Data (Ecology Session 10:30 – 12:00)

Like many fields in the biological sciences, ecology is suddenly dealing with an abundance of data that often presents difficult organizational challenges. Many different organizations and scientists are collecting data on ecological systems using a variety of instruments, surveys and sensors that may differ dramatically from researcher to researcher. These data sources then stream into databases that may also be uniquely designed by the group, presenting significant technical hurdles for comparing or integrating this data across researchers and extracting the maximum value.

In the morning session on ecology, three scientists offered examples of how computation can extract sense out of this data chaos. Whether the data was collected by government agencies, academic research collaborations, or citizen scientists, computational methods can reshape messy or esoteric databases into resources useful to the scientific community and laypeople.

For example, Deborah McGuinness described a project that grew from a suggestion of one of her computer science students at Rensselaer Polytechnic Institute. The student talked about how babies in her hometown were becoming sick from contaminated tapwater, and wanted to create a resource where a concerned citizen could look up polluted water sources near their home and find out details about the contaminants, their health effects (on humans and wildlife), and relevant regulations. In order to build this tool, called SemantAqua, the students had to semantically organize data from the EPA, the USGS, and organizations that collect data about wildlife into a unified system. The result was successful enough to attract the interest of government resource managers, and the team is currently working to expand the tool into SemantEco, which will include air pollution data as well.

Even within a single academic field, the multitude of data collection methods can pose a problem. Social-ecologists study the governance of shared natural resources such as forests or fisheries, to determine what regulatory system best preserves a resource’s sustainability. But each research group within this field studies systems using their own surveys, with their own sets of questions about the resource, the people who use it, and the agencies that govern it. Scott Jensen from Indiana University talked about his efforts to map the data from one such long-running project, IFRI, to a shared research framework proposed by Nobel laureate Elinor Ostrom. If successful, the program could take data collected by any research group’s methodology and convert it to a common architecture that will make the data more accessible to their fellow scientists.

A different data problem is presented by projects where citizens volunteer their own time to collect and submit ecological data. For 10 years, the eBird project has asked birdwatchers to submit data via the internet about the birds they have observed, in order to create a network monitoring the distribution of bird species. But as you might imagine, the quality of these observations varies based upon the expertise of the person making the submission. Currently, the project uses a quality control system where expert birders set a range of accepted time periods for observing a particular species in a given area, and observations that fall outside of that time range must be manually reviewed. But Jun Yu of Oregon State University described a computational data quality filter that uses machine learning techniques to estimate the expertise of the observer, accepting more unusual observations from expert observers while flagging those from the “novices” for further review. In a case study, the computational filter reduced the number of flagged observations by 43% and found 52% more invalid observations.

Hubs for Science (Keynote 8:30 – 10)

Most people know about Moore’s law, the accurate 1965 prediction by Intel co-founder Gordon E. Moore that the number of transistors that fit into a circuit doubles every two years or so. Less well known is the early open source software that made Moore’s law — and the faster, cheaper computers we enjoy because of it — possible for computer engineers. SPICE, a circuit simulation tool developed by Berkeley researchers in the 1960’s, was originally intended as a teaching tool, and was transported by its creators from campus to campus on unwieldy magnetic tape. As the program was improved upon by its users, it eventually became the industry standard for designing new computer chips, enabling the technological advances that made today’s laptops exponentially more powerful than the room-sized supercomputers of the past.

Gerhard Klimeck of Purdue University, this morning’s keynote speaker, used this anecdote to illustrate the potential power of sharing simulation tools for education and research. Klimeck’s group has developed a free service called nanoHUB, an online community where researchers doing science and engineering at the nanoscale can openly share simulation programs and information resources. In ten years, the site has grown to a user base of some 12,000 people using more than 260 simulation tools, with an order of magnitude more using the tutorials and educational materials on the site.

The key to nurturing this environment, Klimeck said, was making it easier for scientists to share their simulation tools over the internet, in a fashion that was user-friendly for the students and researchers on the other side. Traditionally, researchers have written code for a customer base of one, themselves, and the process of making those tools accessible to other scientists can take years. By providing a software development infrastructure that allows researchers to upload simulation tools and give them a user-friendly interface in days to weeks, nanoHUB has made the path to sharing these tools much smoother, Klimeck said. Researchers can even constantly update their tools after they have been shared, fine-tuning their accuracy and incorporating suggestions from the community of nanoHUB users.

The result is a virtual toybox of nanoscale simulation tools that is used almost equally by researchers and students. More than 760 courses at nearly 200 institutions have incorporated nanoHUB tools into engineering and chemistry courses, Klimeck said, and 857 research papers have cited the tools from the site. This dual use replicates the example of SPICE, as open source software is adapted for use in both educational and research communities, and ends up accelerating the rate of science through both branches. The model is promising enough that Klimeck’s team has also started HUBzero.org, which gives researchers in other fields the tools to start their own hub community – an opportunity that has been taken up by scientists studying pharmaceuticals, earthquakes, and cancer care, for example. Klimeck also hopes to integrate the sharing of experimental data from published papers into the hub structure, allowing for new, more interactive scientific publications that allow readers to replicate the experiments within for themselves.

“These are the papers of the future, that publish both the data and/or the tools,” Klimeck said. “That’s where the world is moving and we have a platform that supports that.”