PhD, Postdoctoral and Internship opportunities
I am looking for excellent researchers and students at various levels (postdocs, graduates, and undergraduates) to work with on several projects listed below. In general, a strong background in computer science and good C/C++/Pyhon knowledge and experience in software development are required. Most of our research is done on bleeding edge experimental testbeds and the largest supercomputers in the world, such as Aurora. Please contact me for more information.
Current openings: Postdoctoral Appointee - DataStates (Requisition Number: 413015)
DataStates is a data model in which users do not interact with a data service directly to read/write datasets but rather tag datasets with properties expressing hints, constraints, and persistency semantics, which automatically adds snapshots (called data states) into the lineage– a history recording the evolution of all snapshots using an optimal I/O plan. Such an approach has several advantages: (1) it eliminates the need to deal with complex heterogeneous storage stacks at large scale, shifting the focus on the meaning and properties of the data instead; (2) it bring an incentive to collaborate more, verify and understand the results more thoroughly by sharing and analyzing intermediate results; (3) it encourages the development of new algorithms and ideas that reuse and revisit intermediate and historical data frequently. Such capabilities are particularly important to facilitate quicker advances at the intersection of HPC, artificial intelligence and big data analytics.
Status: Positions available!
BRAID is a policy-driven automation framework for flows used to collect, analyze, organize, and learn from data. It proposes new methods for processing data from next-generation science experiments efficiently and reliably in continuum of computing environments, while satisfying requirements for rapid response, high reconstruction fidelity, data enhancement, data preservation, and machine learning capabilities. In this context, my team focuses on providing scalable data management techniques for flows used to train DNN models. Specifically, we focus on two aspects: (1) how to efficiently cache and augment training samples across a large number of nodes, such that the training pipeline is not bottlenecked by data ingestion; (2) how to efficiently enhance data streams with historic access to representative training samples, which is needed by continual learning approaches that update DNN models in real-time.
Collaborators: Ian Foster, Justin Wozniak, Zhenchun Liu, Tekin Bicer, Eliu Huerta (Argonne National Laboratory, USA), Kyle Chard (University of Chicago, USA)
Status: Positions available!
Triple Convergence aims to address the requirements of modern workflows that combine HPC, Big Data and machine learning tasks at scale. Specifically, we focus on three aspects: (1) Robustness: how to make workflow tasks robust to failures based on their nature (e.g. HPC tasks need to use checkpoint-restart, but machine learning tasks may take advantage of relaxed state recovery that avoids rollback); (2) Reconfigurability: how to enable workflow tasks to adapt to changing conditions dynamically in order to facilitate elasticity (e.g., dynamically add/remove resources to tasks), better resource sharing and flexible communication channels/data buffers between them; (3) Reproducibility: understand what provenance information and runtime decisions are essential and need to be captured during runtime in order to replay a workflow task under potentially different configurations without affecting the results.
Collaborators: Tom Peterka, Orcun Yildiz (Argonne National Laboratory, USA), Dmitriy Morozov, Arnur Nigmetov (Lawrence Berkeley National Laboratory, USA)
Status: Positions available!
VeloC (Very Low Overhead Checkpointing System) is a multi-level checkpoint-restart runtime for HPC supercomputing infrastructures and large-scale data centers sponsored by ECP (Exascale Computing Project). It aims to delivers high performance and scalability for complex heterogeneous storage hierarchies without sacrificing ease of use and flexibility. Checkpoint-restart is primarily used as a fault-tolerance mechanism for tightly coupled HPC applications but is essential in many other administrative use cases: suspend-resume, migration, debugging. Furthermore, many applications naturally return to previous states as part of the computational model (e.g., adjoint computations, neural networks), which can be performed efficiently using checkpoint-restart.
Collaborators: Franck Cappello, Sheng Di (Argonne National Laboratory, USA); Kathryn Mohror, Adam Moody, Gregory Kosinovski (Lawrence Livermore National Laboratory, USA).
Status: VeloC is openly available under the MIT license here. Positions available!
HP-CDS (High Performance Collaborative Distributed Storage) is an experimental storage prototype specifically designed to deliver high throughput with low resource utilization at scale for data-intensive distributed applications that exhibit non-trivial I/O patterns or irregularity due to multi-tenancy. It is centered around the idea of organizing the storage elements in a decentralized peer-to-peer network that constantly exchanges information about locally observed content and I/O access patterns in order to discover global trends that can be exploited by collaboration, such as: dynamic prefetch of data blocks from peers with similar access pattern, on-the-fly de-duplication and dissemination of hot data, automated system-level storage elasticity.
Collaborators: Andrzej Kochut, Alexei Karve (IBM Research USA); Kate Keahey (Argonne National Laboratory, USA); Pierre Riteau (University of Chicago, USA)
Status: HP-CDS was integrated into the OpenStack ecosystem and used within IBM.
BlobSeer is a large-scale distributed data storage service that is centered around the idea of using versioning both at data and metadata level to deliver high throughput under concurrency. BlobSeer was leveraged and demonstrated its effectiveness in several contexts, including: big data analytics based on Hadoop MapReduce, scalable checkpoint-restart for HPC applications, virtual disk dissemination, snapshoting and live block migration in large scale IaaS clouds. BlobSeer became a main research direction of the KerData team, with numerous projects and PhD theses centered around it.
Collaborators: Gabriel Antoniu, Luc Bouge (INRIA, France) and many other current and former members of the KerData team.
Status: BlobSeer is openly available under LGPL here.