PhD, Postdoctoral and Internship opportunities
I am looking for excellent students at various levels (postdocs, graduates, and undergraduates) to work with on several projects listed below. In general, a strong background in computer science (algorithms, architectures, operating systems), good command of C/C++/Pyhon, and experience in software development are required. Most of our research is done on bleeding edge experimental testbeds and the largest supercomputers in the world, such as Aurora. Please contact me for more information.
Current
- DataStates (DOE ASCR) is a data model in which users do not interact with a data service directly to read/write datasets but rather tag datasets with properties expressing hints, constraints, and persistency semantics, which automatically adds snapshots (called data states) into the lineage– a history recording the evolution of all snapshots using an optimal I/O plan. Such an approach has several advantages: (1) it eliminates the need to deal with complex heterogeneous storage stacks at large scale, shifting the focus on the meaning and properties of the data instead; (2) it bring an incentive to collaborate more, verify and understand the results more thoroughly by sharing and analyzing intermediate results; (3) it encourages the development of new algorithms and ideas that reuse and revisit intermediate and historical data frequently. Such capabilities are particularly important to facilitate quicker advances at the intersection of HPC, artificial intelligence and big data analytics. 
- VLCC-States (NSF) is a checkpointing framework based on the concept of composable state providers, which are associated with distributed data structures and transformations. It extends VELOC with additional capabilities as follows: (1) composable providers of intermediate states, which hide the complexity of capturing and assembling checkpoints of dis- tributed data structures and their transformations across different modules and programming languages while optimizing their layout to eliminate redundancies, reduce sizes, and improve performance; (2) multi-level co-optimized caching and prefetching, which enable scalable management of the life-cycle of checkpoints for interleavings of capture and reuse operations on heterogeneous storage stacks under concurrency; (3) specialized checkpointing tools large AI models, notably an integration with PyTorch and DeepSpeed to enable users to transparently take advantage of high-performance and scalable checkpointing for large AI models (LLMs, Transformers) using a familiar API. 
- AI4S-TT (DOE ASCR) this project aims to design and develop a memory- and compute-efficient pre-training framework for transformer models based on the idea of applying low-rank compression techniques. To this end, we are exploring three novel directions: (1) theoretical foundation and novel optimization of low-rank techniques for large-scale transformer model pre-training, (2) numerical methods to speed up tensorized pre-training, including variations of auto-differentiation that predict model spikes; (3) performance and scalability considerations, including how to adapt 3D parallelism (tensor/pipeline/data) and distributed GPU memory management for low-rank techniques. 
- RECUP (DOE ASCR) is a comprehensive reproducibility framework that explores how to build a novel and scalable data management system for capturing, fusing, storing, and organizing the rich and multi-modal information necessary for reproducibility of hybrid workflows at scale. It targets three aspects: (1) task metadata, which describes the depenencies (e.g., workflow DAGs) beween tasks and dynamic decisions (e.g., task order); (2) performance metadata, which describes runtimes and performance counters at fine granularity for each task; (3) intermediate results. Using a combination of metadata and intermediate results, RECUP enables a comprehensive reproducibility analysis: users can compare multiple repeated runs to check the performance and correctness of intermediate workflow stages, identify points of divergence, and identify the root causes of divergence. 
- Diaspora (DOE ASCR) is a set of resilience-enabling services for science from HPC to edge computing. It aims to create resilient scientific applications across integrated computing infrastructures. The project is developing a system that will allow scientists to quickly and accurately share information about data, application, and resource status to meet a broad set of resilience needs so that researchers can better manage and overcome potential disruptions in the future. To accomplish this, Diaspora is creating a hierarchical event fabric, developing resilience services, and evaluating these new capabilities in scientific applications. 
- DTIO (DOE ASCR) aims to build a unified I/O framework that enables inter-operability between the fragmented storage ecosystem of the HPC, Big Data Analytics and AI communities. It targets high performance and scalability on HPC systems that include a heterogeneous storage stack (node-local GPU/host memory, SSDs, parallel file systems, object stores). To this end, it addresses aspects such as: online conversion between data formats, direct streaming between data sources and data sinks, intent-driven translation between different data acces paradigms. 
Past
- BRAID (DOE ASCR) is a policy-driven automation framework for flows used to collect, analyze, organize, and learn from data. It proposes new methods for processing data from next-generation science experiments efficiently and reliably in continuum of computing environments, while satisfying requirements for rapid response, high reconstruction fidelity, data enhancement, data preservation, and machine learning capabilities. In this context, my team focuses on providing scalable data management techniques for flows used to train DNN models. Specifically, we focus on two aspects: (1) how to efficiently cache and augment training samples across a large number of nodes, such that the training pipeline is not bottlenecked by data ingestion; (2) how to efficiently enhance data streams with historic access to representative training samples, which is needed by continual learning approaches that update DNN models in real-time. - Collaborators: Ian Foster, Justin Wozniak, Zhenchun Liu, Tekin Bicer, Eliu Huerta (Argonne National Laboratory, USA), Kyle Chard (University of Chicago, USA) 
- Triple Convergence (DOE ASCR) aims to address the requirements of modern workflows that combine HPC, Big Data and machine learning tasks at scale. Specifically, we focus on three aspects: (1) Robustness: how to make workflow tasks robust to failures based on their nature (e.g. HPC tasks need to use checkpoint-restart, but machine learning tasks may take advantage of relaxed state recovery that avoids rollback); (2) Reconfigurability: how to enable workflow tasks to adapt to changing conditions dynamically in order to facilitate elasticity (e.g., dynamically add/remove resources to tasks), better resource sharing and flexible communication channels/data buffers between them; (3) Reproducibility: understand what provenance information and runtime decisions are essential and need to be captured during runtime in order to replay a workflow task under potentially different configurations without affecting the results. - Collaborators: Tom Peterka, Orcun Yildiz (Argonne National Laboratory, USA), Dmitriy Morozov, Arnur Nigmetov (Lawrence Berkeley National Laboratory, USA) 
- VeloC: Very Low Overhead Checkpointing System) (DOE ECP) is a multi-level checkpoint-restart runtime for HPC supercomputing infrastructures and large-scale data centers sponsored by ECP (Exascale Computing Project). It aims to delivers high performance and scalability for complex heterogeneous storage hierarchies without sacrificing ease of use and flexibility. Checkpoint-restart is primarily used as a fault-tolerance mechanism for tightly coupled HPC applications but is essential in many other administrative use cases: suspend-resume, migration, debugging. Furthermore, many applications naturally return to previous states as part of the computational model (e.g., adjoint computations, neural networks), which can be performed efficiently using checkpoint-restart. - Collaborators: Franck Cappello, Sheng Di (Argonne National Laboratory, USA); Kathryn Mohror, Adam Moody, Gregory Kosinovski (Lawrence Livermore National Laboratory, USA). - Status: VeloC is openly available under the MIT license here. Positions available! 
- HP-CDS: High Performance Collaborative Distributed Storage (IBM) is an experimental storage prototype specifically designed to deliver high throughput with low resource utilization at scale for data-intensive distributed applications that exhibit non-trivial I/O patterns or irregularity due to multi-tenancy. It is centered around the idea of organizing the storage elements in a decentralized peer-to-peer network that constantly exchanges information about locally observed content and I/O access patterns in order to discover global trends that can be exploited by collaboration, such as: dynamic prefetch of data blocks from peers with similar access pattern, on-the-fly de-duplication and dissemination of hot data, automated system-level storage elasticity. - Collaborators: Andrzej Kochut, Alexei Karve (IBM Research USA); Kate Keahey (Argonne National Laboratory, USA); Pierre Riteau (University of Chicago, USA) - Status: HP-CDS was integrated into the OpenStack ecosystem and used within IBM. 
- BlobSeer (ANR) is a large-scale distributed data storage service that is centered around the idea of using versioning both at data and metadata level to deliver high throughput under concurrency. BlobSeer was leveraged and demonstrated its effectiveness in several contexts, including: big data analytics based on Hadoop MapReduce, scalable checkpoint-restart for HPC applications, virtual disk dissemination, snapshoting and live block migration in large scale IaaS clouds. BlobSeer became a main research direction of the KerData team, with numerous projects and PhD theses centered around it. - Collaborators: Gabriel Antoniu, Luc Bouge (INRIA, France) and many other current and former members of the KerData team. - Status: BlobSeer is openly available under LGPL here.