These are some of the projects I worked or I am currently working on.
Creating portable software for supercomputers requires some software layer that translates abstract application invocations (e.g., “I want to run application x on n nodes with parameters a, b, and c“) into the specifics of the queuing system installed on the clusters/supercomputers that said portable software is meant to run on. Most projects end up having to write such a layer from scratch. However, this is really difficult to get right, since relatively tight access control to supercomputers makes testing on a wide range of systems difficult. The end result is a less-than-ideal use of manpower.
Within the ExaWorks project, I advocated for creating a library that solves this problem. I also wrote most of the language agnostic specification (loosely based on the Java CoG Abstraction Library) and portions of the Python reference implementation. This is not the first library of its kind, so the question of why other libraries are not being maintained or have not achieved widespread usage is a valid one. And so is the question why this library should be any better. One reason is that PSI/J is designed to be exclusively a user-side library (so users can patch and update as needed rather than being part of an OS installation). Another is that PSI/J provides a continuous testing system that allows users interested in using it on specific clusters to easily deploy nightly tests on those clusters and automatically send test results to PSI/J. Are these, together with the rest of the design and governance aspects of the project sufficient to make a difference? The project is still in its infancy and only time will tell…
I wrote this during my PhD while working on some phenomenology problem. The basic idea is to try test some new theoretical model against reality, reality here being represented by various measurements done at particle accelerators. Unfortunately, most of the data is published in the form of some plot in a paper, or, sometimes, a separate downloadable plot. In order to extract the actual numeric information, one has to use so-called “plot digitizers”, which are tools where you move your mouse where you think a data point is, click, and the tool records the coordinate of your click. More advanced tools attempt to automatically detect where the data points might be, but it is hard to write a general algorithm that can interpret some plot and understand where the data is. However, I noticed that, in some cases, data was available as Encapsulated PostScript (EPS) files. And most of these in High Energy Physics are produced by ROOT. PostScript is a programming language and an EPS file is a set of instructions on how to draw things, like lines, circles, etc. It turns out that when you look at EPS files produced by ROOT, it is relatively easy to spot the coordinates of the lines or other objects being drawn. Furthermore, EPS is a vector format, which, unlike raster image formats (PNG, JPEG), have arbitrary resolution (or precision). So the drawing primitives found in EPS files can potentially have the precision of the original data. I ended up writing a more thorough PostScript interpreter and an interface to select graphics elements and extract coordinates from them. This is available for free.
WholeTale is a multi-institutional project that is meant to enable reproducible science by serving as a platform that links published code and published data in a runnable “tale”. These tales can, in turn, be exported to data/code repositories where they can receive cite-able identifiers (e.g., a DOI).
Tales can be run locally on the WholeTale platform. However, tales can potentially be linked to rather large amounts of data that cannot fit simultaneously on the resources available to the platform. A data management system that can fetch and cache data on-demand is needed. I designed and implemented such a system for WholeTale. I also created the prototype design and backend implementation for versioning of tales as well as the tracking of “reproducible runs”, which are invocations of tale code that are meant to preserve all the details the context in which the invocations occur and, in principle, lead to… reproducible results.
This is not related to a similarly named product from Apple (although, 10 or so years after releasing Swift, their main page still links to the Swift/K project). Swift/K is a parallel scripting language initially designed by Yong Zhao, then a PhD student at the University of Chicago. Yong and I refined the language and wrote a prototype implementation of it, which was a rather simple translation to a set of calls into a library built in a parallel language that I have written before. Swift/K allows one to see files as variables and application invocations as function invocations, while taking care of managing data movement, resource allocation, parallelization, etc. For example, one could write r = f(g(a), h(b)); to mean “invoke the application represented by g on the file represented by a and, in parallel, invoke the application represented by h on the file represented by b, then invoke the application represented by f on the resulting output files from the two invocations, and, finally, store the result in a file represented by r“.
Swift/K could manage hundreds of thousands of concurrent invocations. On top of Coasters, it could scale to tens of thousands of compute nodes (which was mostly the limit of what was available at the time). It inspired a couple of other similar projects: Swift/T and Parsl.
The Coaster system is a pilot job system, which allows one to run many heterogeneous jobs within one or more batch scheduler jobs. The main reason for using a pilot job system is to efficiently run short jobs (jobs that took seconds or minutes to run) on supercomputers. It has a throughput in the range of a few hundreds to a few thousands of jobs per second. It used a custom protocol that allow a single secure network connection to efficiently support unlimited concurrent conversations, including file and job data. Files could be staged directly to compute node local storage, entirely bypassing shared filesystems, which tend to be inefficient when accessed concurrently from thousands of compute nodes.
This is an event display for the CMS detector at CERN that uses input files designed for the iSpy event display. It was written in JavaScript before the widespread availability of WebGL, so it uses software rendering and a simplified detector model. It was later rewritten by Tom McCauley with WebGL support. The re-written version is now used by the CMS Open Data Portal allowing anybody with a web browser to load and see events from the CMS detector.
Karajan is a parallel language implemented on top of a lightweight threading core, which allows it to scale to millions of concurrent threads as well as use threads pervasively without significant penalties. It was based on the idea of structured concurrency, which could be seen as the structured-programming equivalent to parallel programming. Parallel constructs had well-defined semantics that minimized nondeterminism. For example, one could write y = f(parallel(g(), h())) which meant that the invocations of g and h were performed in parallel, yet f always received its arguments in syntactic order regardless of whether the evaluation of g was completed before that of h or not.
The Java CoG Kit Abstraction Library is a library that abstracts access to compute resources, such as queuing systems, middleware libraries, and file transfer libraries. It was designed by Gregor von Laszewski and Kaizar Amin (who was a PhD student working with Gregor at the time). I implemented portions of it initially and later ended up maintaining the library and using it in Swift/K.