Building a next-generation trace analysis tool

You,research

Distributed computing has been a game-changer in meeting large-scale computing needs, both in industry and academia. It’s used in a wildly diverse range of applications, from handling millions of requests per second to speeding up incredible scientific simulations. Many of today’s computing would not be possible without massive amounts of parallelism and concurrency.

Unfortunately, optimizing performance in large-scale parallel applications has been a major challenge. How to identify crucial bottlenecks in mission-critical code? Unoptimized code, and inefficient use of system resources, impacts completion time of high-performance workloads like running simulations or training AI models. Similarly, it can affect the throughput and latency of web servers.

The key to address this is tracing — capturing events in an application’s lifecycle. However, productively analyzing traces from massively-parallel programs, running on numerous processors, is becoming increasingly difficult.

At UMD’s Parallel Software and Systems Group, my team and I are developing Pipit, a Python library to simplify trace analysis for developers. Pipit converts raw trace data into a standardized Pandas DataFrame, enabling users to perform various analyses like finding load imbalance or identifying runtime patterns using Pipit’s built-in analysis functions.

While existing tools like Vampir or Nsight Systems rely on a particular trace format, and are limited to graphical interfaces, Pipit takes it a step further by offering users direct access to the DataFrame. It allows for a great level of extensibility, like the ability to write a reader or functions for application and domain-specific analyses, as well as automated workflows using Python scripts and Jupyter notebooks.

Pipit is still a work in progress. We are expanding support for GPU traces — like CUDA, OpenMP, and OpenACC, as well as implementing algorithms, like critical path detection.

My current research is on Pipit’s scalability for processing large trace data, by exploiting distributed memory and parallel processing. There are many challenges that come with scalability — as many trace analysis operations are stateful, and cannot be easily parallelized.

My team and I are planning on presenting this library at different conferences. We hope that Pipit will be greatly beneficial addressing performance challenges crucial for modern computing environments.