Publications

  1. [1]Nicolae, B. 2020. DataStates: Towards Lightweight Data Models for Deep Learning. SMC’20: The 2020 Smoky Mountains Computational Sciences and Engineering Conference (Nashville, United States, 2020). 
    Details
    Keywords: deep learning, state preservation, clone, model reuse
    Abstract: A key emerging pattern in deep learning applications is the need to capture intermediate DNN model snapshots and preserve or clone them to explore a large number of alternative training and/or inference paths. However, with increasing model complexity and new training approaches that mix data, model, pipeline and layer-wise parallelism, this pattern is challenging to address in a scalable and efficient manner. To this end, this position paper advocates for rethinking how to represent and manipulate DNN learning models. It relies on a broader notion of data states, a collection of annotated, potentially distributed data sets (tensors in the case of DNN models) that AI applications can capture at key moments during the runtime and revisit/reuse later. Instead explicitly interacting with the storage layer (e.g., write to a file), users can "tag" DNN models at key moments during runtime with metadata that expresses attributes and persistency/movement semantics. A high-performance runtime is the responsible to interpret the metadata and perform the necessary actions in the background, while offering a rich interface to find data states of interest. Using this approach has benefits at several levels: new capabilities, performance portability, high performance and scalability.
    BibTex
    @inproceedings{DataStates20,
      title = {{DataStates: Towards Lightweight Data Models for Deep Learning}},
      author = {Nicolae, Bogdan},
      url = {https://hal.inria.fr/hal-02941295},
      booktitle = {{SMC'20: The 2020 Smoky Mountains Computational Sciences and Engineering Conference}},
      address = {Nashville, United States},
      year = {2020},
      keywords = {deep learning, state preservation, clone, model reuse}
    }

  2. [2]Maurya, A., Nicolae, B., Guliani, I. and Rafique, M.M. 2020. CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters. DS-RT’20: The 24th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (Prague, Czech Republic, 2020), 167–174. 
    Details
    Keywords: high performance computing, job scheduling, checkpointing strategies
    Abstract: The increasing scale and complexity of scientific applications are rapidly transforming the ecosystem of tools, methods, and workflows adopted by the high-performance computing (HPC) community. Big data analytics and deep learning are gaining traction as essential components in this ecosystem in a variety of scenarios, such as, steering of experimental instruments , acceleration of high-fidelity simulations through surrogate computations, and guided ensemble searches. In this context, the batch job model traditionally adopted by the supercomputing in-frastructures needs to be complemented with support to schedule opportunistic on-demand analytics jobs, leading to the problem of efficient preemption of batch jobs with minimum loss of progress. In this paper, we design and implement a simulator, CoSim, that enables on-the-fly analysis of the trade-offs arising between delaying the start of opportunistic on-demand jobs, which leads to longer analytics latency, and loss of progress due to preemption of batch jobs, which is necessary to make room for on-demand jobs. To this end, we propose an algorithm based on dynamic programming with predictable performance and scalability that enables supercomputing infrastructure schedulers to analyze the aforementioned trade-off and take decisions in near real-time. Compared with other state-of-art approaches using traces of the Theta pre-Exascale machine, our approach is capable of finding the optimal solution, while achieving high performance and scalability.
    BibTex
    @inproceedings{CoSim20,
      title = {{CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters}},
      author = {Maurya, Avinash and Nicolae, Bogdan and Guliani, Ishan and Rafique, M Mustafa},
      url = {https://hal.inria.fr/hal-02925237},
      booktitle = {{DS-RT'20: The 24th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications}},
      address = {Prague, Czech Republic},
      year = {2020},
      pages = {167-174},
      keywords = {high performance computing, job scheduling, checkpointing strategies}
    }

  3. [3]Nicolae, B., Wozniak, J.M., Dorier, M. and Cappello, F. 2020. DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training. CLUSTER’20: The 2020 IEEE International Conference on Cluster Computing (Kobe, Japan, 2020). 
    Details
    Keywords: deep learning, data-parallel training, layer-wise parallelism, state cloning and replication, large-scale AI
    Abstract: Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. "fork" the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of data-parallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.
    BibTex
    @inproceedings{DeepClone20,
      title = {{DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training}},
      author = {Nicolae, Bogdan and Wozniak, Justin M and Dorier, Matthieu and Cappello, Franck},
      url = {https://hal.archives-ouvertes.fr/hal-02914545},
      booktitle = {{CLUSTER'20: The 2020 IEEE International Conference on Cluster Computing}},
      address = {Kobe, Japan},
      year = {2020},
      keywords = {deep learning, data-parallel training, layer-wise parallelism, state cloning and replication, large-scale AI}
    }

  4. [4]Dey, T., Sato, K., Nicolae, B., Guo, J., Domke, J., Yu, W., Cappello, F. and Mohror, K. 2020. Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning. HPS’20: The 2020 IEEE International Workshop on High-Performance Storage (New Orleans, USA, 2020).   
    Details
    Keywords: high performance computing, checkpoint-restat, machine learning optimization
    Abstract: With the emergence of versatile storage systems, multi-level checkpointing (MLC) has become a common approach to gain efficiency. However, multi-level checkpoint/restart can cause enormous I/O traffic on HPC systems. To use multilevel checkpointing efficiently, it is important to optimize checkpoint/restart configurations. Current approaches, namely modeling and simulation, are either inaccurate or slow in determining the optimal configuration for a large scale system. In this paper, we show that machine learning models can be used in combination with accurate simulation to determine the optimal checkpoint configurations. We also demonstrate that more advanced techniques such as neural networks can further improve the performance in optimizing checkpoint configurations.
    BibTex
    @inproceedings{MLCkpt20,
      title = {Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning},
      url = {https://hal.archives-ouvertes.fr/hal-02914478},
      booktitle = {HPS'20: The 2020 IEEE International Workshop on High-Performance Storage},
      author = {Dey, Tonmoy and Sato, Kento and Nicolae, Bogdan and Guo, Jian and Domke, Jens and Yu, Weikuan and Cappello, Franck and Mohror, Kathryn},
      address = {New Orleans, USA},
      doi = {10.1109/IPDPSW50202.2020.00174},
      year = {2020},
      keywords = {high performance computing, checkpoint-restat, machine learning optimization}
    }

  5. [5]Nicolae, B., Li, J., Wozniak, J., Bosilca, G., Dorier, M. and Cappello, F. 2020. DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models. CGrid’20: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (Melbourne, Australia, 2020), 172–181.   
    Details
    Keywords: deep learning, checkpointing, state preservation, multi-level data persistence, fine-grain asynchronous I/O
    Abstract: In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.
    BibTex
    @inproceedings{DeepFreeze20,
      title = {DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models},
      year = {2020},
      author = {Nicolae, Bogdan and Li, Jiali and Wozniak, Justin and Bosilca, George and Dorier, Matthieu and Cappello, Franck},
      booktitle = {CGrid'20: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing},
      address = {Melbourne, Australia},
      pages = {172-181},
      doi = {10.1109/CCGrid49817.2020.00-76},
      url = {https://hal.inria.fr/hal-02543977},
      keywords = {deep learning, checkpointing, state preservation, multi-level data persistence, fine-grain asynchronous I/O}
    }

  6. [6]Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K. and Cappello, F. 2019. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale. IPDPS’19: The 2019 IEEE International Parallel and Distributed Processing Symposium (Rio de Janeiro, Brazil, 2019), 911–920.   
    Details
    Keywords: parallel I/O, checkpoint-restart, immutable data, adaptive multilevel asynchronous I/O
    Abstract: Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asyn-chronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non-trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.
    BibTex
    @inproceedings{VeloCIPDPS19,
      title = {VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale},
      year = {2019},
      author = {Nicolae, Bogdan and Moody, Adam and Gonsiorowski, Elsa and Mohror, Kathryn and Cappello, Franck},
      booktitle = {IPDPS'19: The 2019 IEEE International Parallel and Distributed Processing Symposium},
      pages = {911-920},
      address = {Rio de Janeiro, Brazil},
      doi = {10.1109/IPDPS.2019.00099},
      url = {https://hal.inria.fr/hal-02184203},
      keywords = {parallel I/O, checkpoint-restart, immutable data, adaptive multilevel asynchronous I/O}
    }

  7. [7]Tseng, S.-M., Nicolae, B., Bosilca, G., Jeannot, E., Chandramowlishwaran, A. and Cappello, F. 2019. Towards Portable Online Prediction of Network Utilization using MPI-level Monitoring. EuroPar’19 : 25th International European Conference on Parallel and Distributed Systems (Goettingen, Germany, 2019), 47–60.   
    Details
    Keywords: Work stealing, Prediction of resource utilization, Timeseries forecasting, Network monitoring, Online learning
    Abstract: Stealing network bandwidth helps a variety of HPC runtimes and services to run additional operations in the background without negatively affecting the applications. A key ingredient to make this possible is an accurate prediction of the future network utilization, enabling the runtime to plan the background operations in advance, such as to avoid competing with the application for network bandwidth. In this paper, we propose a portable deep learning predictor that only uses the information available through MPI introspection to construct a recurrent sequence-to-sequence neural network capable of forecasting network utilization. We leverage the fact that most HPC applications exhibit periodic behaviors to enable predictions far into the future (at least the length of a period). Our online approach does not have an initial training phase, it continuously improves itself during application execution without incurring significant computational overhead. Experimental results show better accuracy and lower computational overhead compared with the state-of-the-art on two representative applications.
    BibTex
    @inproceedings{NetPredEuroPar19,
      title = {Towards Portable Online Prediction of Network Utilization using MPI-level Monitoring},
      year = {2019},
      author = {Tseng, Shu-Mei and Nicolae, Bogdan and Bosilca, George and Jeannot, Emmanuel and Chandramowlishwaran, Aparna and Cappello, Franck},
      booktitle = {EuroPar’19 : 25th International European Conference on Parallel and Distributed Systems},
      pages = {47-60},
      address = {Goettingen, Germany},
      doi = {10.1007/978-3-030-29400-7_4},
      url = {https://hal.inria.fr/hal-02184204},
      keywords = {Work stealing, Prediction of resource utilization, Timeseries forecasting, Network monitoring, Online learning}
    }

  8. [8]Liang, X., Di, S., Li, S., Tao, D., Nicolae, B., Chen, Z. and Cappello, F. 2019. Significantly Improving Lossy Compression Quality Based on an Optimized Hybrid Prediction Model. SC ’19: 32nd International Conference for High Performance Computing, Networking, Storage and Analytics (Denver, USA, 2019), 1–26.   
    Details
    Keywords: Error-Bounded Lossy Compression, Rate Distortion, Data Dumping/Loading, Compression Performance
    Abstract: With the ever-increasing volumes of data produced by today’s large-scale scientific simulations, error-bounded lossy compression techniques have become critical: not only can they significantly reduce the data size but they also can retain high data fidelity for postanalysis. In this paper, we design a strategy to improve the compression quality significantly based on an optimized, hybrid prediction model. Our contribution is fourfold. (1) We propose a novel, transform-based predictor and optimize its compression quality. (2) We significantly improve the coefficient-encoding efficiency for the data-fitting predictor. (3) We propose an adaptive framework that can select the best-fit predictor accurately for different datasets. (4) We evaluate our solution and several existing state-ofthe-art lossy compressors by running real-world applications on a supercomputer with 8,192 cores. Experiments show that our adaptive compressor can improve the compression ratio by 112∼165% compared with the second-best compressor. The parallel I/O performance is improved by about 100% because of the significantly reduced data size. The total I/O time is reduced by up to 60X with our compressor compared with the original I/O time.
    BibTex
    @inproceedings{SZ-SC19,
      title = {Significantly Improving Lossy Compression Quality Based on an Optimized Hybrid Prediction Model},
      year = {2019},
      author = {Liang, Xin and Di, Sheng and Li, Sihuan and Tao, Dingwen and Nicolae, Bogdan and Chen, Zizhong and Cappello, Franck},
      booktitle = {SC '19: 32nd International Conference for High Performance Computing, Networking, Storage and Analytics},
      pages = {1-26},
      address = {Denver, USA},
      doi = {10.1145/3295500.3356193},
      url = {http://tao.cs.ua.edu/paper/SC19-HybridModel.pdf},
      keywords = {Error-Bounded Lossy Compression, Rate Distortion, Data Dumping/Loading, Compression Performance}
    }

  9. [9]Liang, X., Di, S., Tao, D., Li, S., Nicolae, B., Chen, Z. and Cappello, F. 2019. Improving Performance of Data Dumping with Lossy Compression for Scientific Simulation. CLUSTER’19: IEEE International Conference on Cluster Computing (Albuquerque, USA, 2019), 1–11.   
    Details
    Keywords: lossy compression, efficient data flush, parallel file systems
    Abstract: Because of the ever-increasing data being produced by today’s high performance computing (HPC) scientific simulations, I/O performance is becoming a significant bottleneck for their executions. An efficient error-controlled lossy compressor is a promising solution to significantly reduce data writing time for scientific simulations running on supercomputers. In this paper, we explore how to optimize the data dumping performance for scientific simulation by leveraging error-bounded lossy compression techniques. The contributions of the paper are threefold. (1) We propose a novel I/O performance profiling model that can effectively represent the I/O performance with different execution scales and data sizes, and optimize the estimation accuracy of data dumping performance using least square method. (2) We develop an adaptive lossy compression framework that can select the bestfit compressor (between two leading lossy compressors SZ and ZFP) with optimized parameter settings with respect to overall data dumping performance. (3) We evaluate our adaptive lossy compression framework with up to 32k cores on a supercomputer facilitated with fast I/O systems and using realworld scientific simulation datasets. Experiments show that our solution can mostly always lead the data dumping performance to the optimal level with very accurate selection of the bestfit lossy compressor and settings. The data dumping performance can be improved by up to 27% at different scales.
    BibTex
    @inproceedings{SZ-CLUSTER19,
      title = {Improving Performance of Data Dumping with Lossy Compression for Scientific Simulation},
      year = {2019},
      author = {Liang, Xin and Di, Sheng and Tao, Dingwen and Li, Sihuan and Nicolae, Bogdan and Chen, Zizhong and Cappello, Franck},
      booktitle = {CLUSTER'19: IEEE International Conference on Cluster Computing},
      pages = {1-11},
      address = {Albuquerque, USA},
      doi = {10.1109/CLUSTER.2019.8891037},
      url = {http://tao.cs.ua.edu/paper/CLUSTER19-IOAwareLossy.pdf},
      keywords = {lossy compression, efficient data flush, parallel file systems}
    }

  10. [10]Nicolae, B., Riteau, P., Zhen, Z. and Keahey, K. 2019. Transparent Throughput Elasticity for Modern Cloud Storage: An Adaptive Block-Level Caching Proposal. Applying Integration Techniques and Methods in Distributed Systems and Technologies. IGI Global. 156–191. 
    Details
    Keywords: cloud computing, storage elasticity, adaptive I/O throughput, block-level caching
    Abstract: Storage elasticity on the cloud is a crucial feature in the age of data-intensive computing, especially when considering fluctuations of I/O throughput. In this chapter, the authors explore how to transparently boost the I/O bandwidth during peak utilization to deliver high performance without over-provisioning storage resources. The proposal relies on the idea of leveraging short-lived virtual disks of better performance characteristics (and more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored during runtime. They show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. Second, they introduce a corresponding performance and cost prediction methodology. They demonstrate the benefits of our proposal both for micro-benchmarks and for two real-life applications using large-scale experiments. They conclude with a discussion on how these techniques can be generalized for increasingly complex landscape of modern cloud storage.
    BibTex
    @incollection{ElasticStoreIGI19,
      title = {Transparent Throughput Elasticity for Modern Cloud Storage: An Adaptive Block-Level Caching Proposal},
      year = {2019},
      author = {Nicolae, Bogdan and Riteau, Pierre and Zhen, Zhuo and Keahey, Kate},
      booktitle = {Applying Integration Techniques and Methods in Distributed Systems and Technologies},
      pages = {156-191},
      publisher = {IGI Global},
      isbn = {9781522582953},
      doi = {10.4018/978-1-5225-8295-3.ch007},
      keywords = {cloud computing, storage elasticity, adaptive I/O throughput, block-level caching}
    }

  11. [11]Caino-Lores, S., Carretero, J., Nicolae, B., Yildiz, O. and Peterka, T. 2019. Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY. IEEE Access. 7, (2019), 156929–156955. DOI:https://doi.org/10.1109/ACCESS.2019.2949836.   
    Details
    Keywords: big data, HPC, convergence, data model, Spark, DIY
    Abstract: Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process- and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment.
    BibTex
    @article{SparkDIYIEEE19,
      title = {Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY},
      year = {2019},
      author = {Caino-Lores, Silvina and Carretero, Jesus and Nicolae, Bogdan and Yildiz, Orcun and Peterka, Tom},
      journal = {IEEE Access},
      volume = {7},
      pages = {156929--156955},
      doi = {10.1109/ACCESS.2019.2949836},
      url = {https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8884083},
      keywords = {big data, HPC, convergence, data model, Spark, DIY}
    }

  12. [12]Li, J., Nicolae, B., Wozniak, J. and Bosilca, G. 2019. Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training. MLHPC’19: The 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (in conjuction with SC’19) (Denver, USA, 2019), 1–8.   
    Details
    Keywords: deep learning, behavior analysis, Tensorflow, data-parallel learning, tensor parallelism
    Abstract: In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular. Despite the fact that high-performance computing (HPC) technologies have been designed to address such patterns efficiently, the behavior of data-parallel approaches on HPC platforms is not well understood. To address this issue, in this paper we study the behavior of Horovod, a popular data-parallel approach that relies on MPI, on Theta, a pre-Exascale machine at Argonne National Laboratory. Using two representative applications, we explore two aspects: (1) how performance and scalability is affected by important parameters such as number of nodes, number of workers, threads per node, batch size; (2) how computational phases are interleaved withall-reduce communication phases at fine granularity and what consequences this interleaving has in terms of potential bottlenecks. Our findings show that pipelining of back-propagation, gradient reduction and weight updates mitigate the effects of stragglers during all-reduce only partially. Furthermore, there can be significant delays between weights update, which can be leveraged to mask the overhead of additional background operations that are coupled with the training.
    BibTex
    @inproceedings{DLAsyncMLHPC19,
      title = {Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training},
      year = {2019},
      author = {Li, Jiali and Nicolae, Bogdan and Wozniak, Justin and Bosilca, George},
      booktitle = {MLHPC'19: The 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (in conjuction with SC'19)},
      pages = {1-8},
      address = {Denver, USA},
      doi = {10.1109/MLHPC49564.2019.00006},
      url = {https://hal.inria.fr/hal-02570148},
      keywords = {deep learning, behavior analysis, Tensorflow, data-parallel learning, tensor parallelism}
    }

  13. [13]Nicolae, B., Cappello, F., Moody, A., Gonsiorowski, E. and Mohror, K. 2018. VeloC: Very Low Overhead Checkpointing System. SC ’18: 31th International Conference for High Performance Computing, Networking, Storage and Analysis (Dallas, USA, 2018). 
    Details
    Keywords: HPC, resilience, checkpoint-restart
    Abstract: Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications. However, such a pattern frequently leads to I/O bottlenecks that lead to poor scalability and performance. As modern HPC infrastructures continue to evolve, there is a growing gap between compute capacity vs. I/O capabilities. Furthermore, the storage hierarchy is becoming increasingly heterogeneous: in addition to parallel file systems, it comprises burst buffers, key-value stores, deep memory hierarchies at node level, etc. In this context, state of art is insufficient to deal with the diversity of vendor APIs, performance and persistency characteristics. This poster proposes VeloC, a low-overhead checkpointing system specifically designed to address the checkpointing needs of future exascale HPC systems. VeloC offers a simple API at user level, while employing an advanced multi-level resilience strategy that transparently optimizes the performance and scalability of checkpointing by leveraging heterogeneous storage.
    BibTex
    @inproceedings{VeloC-SC18,
      author = {Nicolae, Bogdan and Cappello, Franck and Moody, Adam and Gonsiorowski, Elsa and Mohror, Kathryn},
      title = {VeloC: Very Low Overhead Checkpointing System},
      booktitle = {{SC '18: 31th International Conference for High Performance Computing,
      	Networking, Storage and Analysis}},
      year = {2018},
      address = {Dallas, USA},
      url = {https://sc18.supercomputing.org/proceedings/tech_poster/poster_files/post230s2-file3.pdf},
      note = {Poster Session},
      keywords = {HPC, resilience, checkpoint-restart}
    }

  14. [14]Clemente-Castello, F.J., Nicolae, B., Mayo, R. and Fernandez, J.C. 2018. Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting. IEEE Transactions on Parallel and Distributed Systems. 29, 8 (2018), 1794–1807. DOI:https://doi.org/10.1109/TPDS.2018.2802932.   
    Details
    Keywords: cloud computing, hybrid cloud, bursting, MapReduce
    Abstract: Hybrid cloud bursting (i.e., leasing temporary off-premise cloud resources to boost the overall capacity during peak utilization) can be a cost-effective way to deal with the increasing complexity of big data analytics, especially for iterative applications. However, the low throughput, high latency network link between the on-premise and off-premise resources (“weak link”) makes maintaining scalability difficult. While several data locality techniques have been designed for big data bursting on hybrid clouds, their effectiveness is difficult to estimate in advance. Yet such estimations are critical, because they help users decide whether the extra pay-as-you-go cost incurred by using the off-premise resources justifies the runtime speed-up. To this end, the current paper presents a performance model and methodology to estimate the runtime of iterative MapReduce applications in a hybrid cloud-bursting scenario. The paper focuses on the overhead incurred by the weak link at fine granularity, for both the map and the reduce phases. This approach enables high estimation accuracy, as demonstrated by extensive experiments at scale using a mix of real-world iterative MapReduce applications from standard big data benchmarking suites that cover a broad spectrum of data patterns. Not only are the produced estimations accurate in absolute terms compared with experimental results, but they are also up to an order of magnitude more accurate than applying state-of-art estimation approaches originally designed for single-site MapReduce deployments.
    BibTex
    @article{MapRedTPDS18,
      title = {Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting},
      year = {2018},
      author = {Clemente-Castello, Francisco J and Nicolae, Bogdan and Mayo, Rafael and Fernandez, Juan Carlos},
      journal = {IEEE Transactions on Parallel and Distributed Systems},
      volume = {29},
      number = {8},
      pages = {1794-1807},
      doi = {10.1109/TPDS.2018.2802932},
      url = {https://hal.archives-ouvertes.fr/hal-01999033/en},
      keywords = {cloud computing, hybrid cloud, bursting, MapReduce}
    }

  15. [15]Caino-Lores, S., Carretero, J., Nicolae, B., Yildiz, O. and Peterka, T. 2018. Spark-DIY: A Framework for Interoperable Spark Operations with High Performance Block-Based Data Models. BDCAT’18: 5th IEEE/ACM International Conference on Big Data Computing Applications and Technologies (Zurich, Switzerland, 2018), 1–10. 
    Details
    Keywords: big data, Spark, high performance computing, convergence
    Abstract: Today’s scientific applications are increasingly relying on a variety of data sources, storage facilities, and computing infrastructures, and there is a growing demand for data analysis and visualization for these applications. In this context, exploiting Big Data frameworks for scientific computing is an opportunity to incorporate high-level libraries, platforms, and algorithms for machine learning, graph processing, and streaming; inherit their data awareness and fault-tolerance; and increase productivity. Nevertheless, limitations exist when Big Data platforms are integrated with an HPC environment, namely poor scalability, severe memory overhead, and huge development effort. This paper focuses on a popular Big Data framework -Apache Spark- and proposes an architecture to support the integration of highly scalable MPI block-based data models and communication patterns with a map-reduce-based programming model. The resulting platform preserves the data abstraction and programming interface of Spark, without conducting any changes in the framework, but allows the user to delegate operations to the MPI layer. The evaluation of our prototype shows that our approach integrates Spark and MPI efficiently at scale, so end users can take advantage of the productivity facilitated by the rich ecosystem of high-level Big Data tools and libraries based on Spark, without compromising efficiency and scalability.
    BibTex
    @inproceedings{SparkDIY18,
      title = {Spark-DIY: A Framework for Interoperable Spark Operations with High Performance Block-Based Data Models},
      year = {2018},
      author = {Caino-Lores, Silvina and Carretero, Jesus and Nicolae, Bogdan and Yildiz, Orcun and Peterka, Tom},
      booktitle = {BDCAT'18: 5th IEEE/ACM International Conference on Big Data Computing Applications and Technologies},
      pages = {1-10},
      address = {Zurich, Switzerland},
      doi = {10.1109/BDCAT.2018.00010},
      keywords = {big data, Spark, high performance computing, convergence}
    }

  16. [16]Marcu, O.-C., Costan, A., Antoniu, G., Perez-Hernandez, M.S., Nicolae, B., Tudoran, R. and Bortoli, S. 2018. KerA: Scalable Data Ingestion for Stream Processing. ICDCS’18: 38th IEEE International Conference on Distributed Computing Systems (Vienna, Austria, 2018), 1480–1485.   
    Details
    Keywords: big data, stream computing, data ingestion
    Abstract: Big Data applications are increasingly moving from batch-oriented execution models to stream-based models that enable them to extract valuable insights close to real-time. To support this model, an essential part of the streaming processing pipeline is data ingestion, i.e., the collection of data from various sources (sensors, NoSQL stores, filesystems, etc.) and their delivery for processing. Data ingestion needs to support high throughput, low latency and must scale to a large number of both data producers and consumers. Since the overall performance of the whole stream processing pipeline is limited by that of the ingestion phase, it is critical to satisfy these performance goals. However, state-of-art data ingestion systems such as Apache Kafka build on static stream partitioning and offset-based record access, trading performance for design simplicity. In this paper we propose KerA, a data ingestion framework that alleviate the limitations of state-of-art thanks to a dynamic partitioning scheme and to lightweight indexing, thereby improving throughput, latency and scalability. Experimental evaluations show that KerA outperforms Kafka up to 4x for ingestion throughput and up to 5x for the overall stream processing throughput. Furthermore, they show that KerA is capable of delivering data fast enough to saturate the big data engine acting as the consumer.
    BibTex
    @inproceedings{KeraICDCS18,
      title = {KerA: Scalable Data Ingestion for Stream Processing},
      year = {2018},
      author = {Marcu, Ovidiu-Cristian and Costan, Alexandru and Antoniu, Gabriel and Perez-Hernandez, Maria S and Nicolae, Bogdan and Tudoran, Radu and Bortoli, Stefano},
      booktitle = {ICDCS'18: 38th IEEE International Conference on Distributed Computing Systems},
      pages = {1480-1485},
      address = {Vienna, Austria},
      doi = {10.1109/ICDCS.2018.00152},
      url = {https://hal.archives-ouvertes.fr/hal-01773799/en},
      keywords = {big data, stream computing, data ingestion}
    }

  17. [17]Nicolae, B., Costa, C., Misale, C., Katrinis, K. and Park, Y. 2017. Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics. IEEE Transactions on Parallel and Distributed Systems. 28, 6 (2017), 1663–1674. DOI:https://doi.org/10.1109/TPDS.2016.2627558.   
    Details
    Keywords: elastic buffering, big data analytics, data shuffling, memory-efficient I/O, Spark
    Abstract: Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40% better performance with 50% less memory utilization for buffering and excellent weak scalability.
    BibTex
    @article{ShuffleTPDS16,
      title = {Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics},
      year = {2017},
      author = {Nicolae, Bogdan and Costa, Carlos and Misale, Claudia and Katrinis, Kostas and Park, Yonhoo},
      journal = {IEEE Transactions on Parallel and Distributed Systems},
      volume = {28},
      number = {6},
      pages = {1663-1674},
      doi = {10.1109/TPDS.2016.2627558},
      url = {https://hal.archives-ouvertes.fr/hal-01531374v1/en},
      keywords = {elastic buffering, big data analytics, data shuffling, memory-efficient I/O, Spark}
    }

  18. [18]Clemente-Castello, F.J., Nicolae, B., Mayo, M.M.R.R. and Fernandez, J.C. 2017. Evaluation of Data Locality Strategies for Hybrid Cloud Bursting of Iterative MapReduce. CCGrid’17 : 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Madrid, Spain, 2017), 181–185.   
    Details
    Keywords: hybrid cloud, big data analytics, data locality, data management, scheduling, MapReduce, iterative
    Abstract: Hybrid cloud bursting (i.e., leasing temporary off-premise cloud resources to boost the overall capacity during peak utilization) is a popular and cost-effective way to deal with the increasing complexity of big data analytics. It is particularly promising for iterative MapReduce applications that reuse massive amounts of input data at each iteration, which compensates for the high overhead and cost of concurrent data transfers from the on-premise to the off-premise VMs over a weak inter-site link that is of limited capacity. In this paper we study how to combine various MapReduce data locality techniques designed for hybrid cloud bursting in order to achieve scalability for iterative MapReduce applications in a cost-effective fashion. This is a non-trivial problem due to the complex interaction between the data movements over the weak link and the scheduling of computational tasks that have to adapt to the shifting data distribution. We show that using the right combination of techniques, iterative MapReduce applications can scale well in a hybrid cloud bursting scenario and come even close to the scalability observed in single sites.
    BibTex
    @inproceedings{HybridCCGrid17,
      title = {Evaluation of Data Locality Strategies for Hybrid Cloud Bursting of Iterative MapReduce},
      year = {2017},
      author = {Clemente-Castello, Francisco J and Nicolae, Bogdan and Mayo, M Mustafa Rafique Rafael and Fernandez, Juan Carlos},
      booktitle = {CCGrid’17 : 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing},
      pages = {181-185},
      address = {Madrid, Spain},
      doi = {10.1109/CCGRID.2017.96},
      url = {https://hal.inria.fr/hal-01469991},
      keywords = {hybrid cloud, big data analytics, data locality, data management, scheduling, MapReduce, iterative}
    }

  19. [19]Marcu, O.-C., Costan, A., Antoniu, G., Perez-Hernandez, M.S., Tudoran, R., Bortoli, S. and Nicolae, B. 2017. Towards a unified storage and ingestion architecture for stream processing. BigData’17: 2017 IEEE International Conference on Big Data (Boston, USA, 2017), 2402–2407.   
    Details
    Keywords: Big Data, Streaming, Storage, Ingestion, Unified Architecture
    Abstract: Big Data applications are rapidly moving from a batch-oriented execution model to a streaming execution model in order to extract value from the data in real-time. However, processing live data alone is often not enough: in many cases, such applications need to combine the live data with previously archived data to increase the quality of the extracted insights. Current streaming-oriented runtimes and middlewares are not flexible enough to deal with this trend, as they address ingestion (collection and pre-processing of data streams) and persistent storage (archival of intermediate results) using separate services. This separation often leads to I/O redundancy (e.g., write data twice to disk or transfer data twice over the network) and interference (e.g., I/O bottlenecks when collecting data streams and writing archival data simultaneously). In this position paper, we argue for a unified ingestion and storage architecture for streaming data that addresses the aforementioned challenge. We identify a set of constraints and benefits for such a unified model, while highlighting the important architectural aspects required to implement it in real life. Based on these aspects, we briefly sketch our plan for future work that develops the position defended in this paper.
    BibTex
    @inproceedings{KeraBigData17,
      title = {Towards a unified storage and ingestion architecture for stream processing},
      year = {2017},
      author = {Marcu, Ovidiu-Cristian and Costan, Alexandru and Antoniu, Gabriel and Perez-Hernandez, Maria S and Tudoran, Radu and Bortoli, Stefano and Nicolae, Bogdan},
      booktitle = {BigData'17: 2017 IEEE International Conference on Big Data},
      pages = {2402-2407},
      address = {Boston, USA},
      doi = {10.1109/BigData.2017.8258196},
      url = {https://hal.inria.fr/hal-01649207/},
      keywords = {Big Data, Streaming, Storage, Ingestion, Unified Architecture}
    }

  20. [20]Marcu, O.-C., Tudoran, R., Nicolae, B., Costan, A., Antoniu, G. and Perez-Hernandez, M.S. 2017. Exploring Shared State in Key-Value Store for Window-Based Multi-Pattern Streaming Analytics. EBDMA’17: 1st Workshop on the Integration of Extreme Scale Computing and Big Data Management and Analytics (Madrid, Spain, 2017), 1044–1052.   
    Details
    Keywords: Big Data, sliding-window aggregations, memory deduplication, Apache Flink, streaming analytics
    Abstract: We are now witnessing an unprecedented growth of data that needs to be processed at always increasing rates in order to extract valuable insights. Big Data streaming analytics tools have been developed to cope with the online dimension of data processing: they enable real-time handling of live data sources by means of stateful aggregations (operators). Current state-of-art frameworks (e.g. Apache Flink [1]) enable each operator to work in isolation by creating data copies, at the expense of increased memory utilization. In this paper, we explore the feasibility of deduplication techniques to address the challenge of reducing memory footprint for window-based stream processing without significant impact on performance. We design a deduplication method specifically for window-based operators that rely on key-value stores to hold a shared state. We experiment with a synthetically generated workload while considering several deduplication scenarios and based on the results, we identify several potential areas of improvement. Our key finding is that more fine-grained interactions between streaming engines and (key-value) stores need to be designed in order to better respond to scenarios that have to overcome memory scarcity.
    BibTex
    @inproceedings{KeraEBMA17,
      title = {Exploring Shared State in Key-Value Store for Window-Based Multi-Pattern Streaming Analytics},
      year = {2017},
      author = {Marcu, Ovidiu-Cristian and Tudoran, Radu and Nicolae, Bogdan and Costan, Alexandru and Antoniu, Gabriel and Perez-Hernandez, Maria S},
      booktitle = {EBDMA'17: 1st Workshop on the Integration of Extreme Scale Computing and Big Data Management and Analytics},
      pages = {1044-1052},
      address = {Madrid, Spain},
      doi = {10.1109/CCGRID.2017.126},
      url = {https://hal.inria.fr/hal-01530744},
      keywords = {Big Data, sliding-window aggregations, memory deduplication, Apache Flink, streaming analytics}
    }

  21. [21]Nicolae, B., Costa, C., Misale, C., Katrinis, K. and Park, Y. 2016. Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics. CCGrid’16: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Cartagena, Colombia, 2016), 409–412.   
    Details
    Keywords: elastic buffering, Big data analytics, data shuffling, memory-efficient I/O, Spark
    Abstract: Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.
    BibTex
    @inproceedings{ShuffleCCGrid16,
      title = {Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics},
      year = {2016},
      author = {Nicolae, Bogdan and Costa, Carlos and Misale, Claudia and Katrinis, Kostas and Park, Yonhoo},
      booktitle = {CCGrid’16: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing},
      pages = {409-412},
      address = {Cartagena, Colombia},
      doi = {10.1109/CCGrid.2016.85},
      url = {http://ieeexplore.ieee.org/iel7/7510545/7515592/07515716.pdf},
      note = {Short Paper},
      keywords = {elastic buffering, Big data analytics, data shuffling, memory-efficient I/O, Spark}
    }

  22. [22]Clemente-Castello, F.J., Nicolae, B., Mayo, R., Fernandez, J.C. and Rafique, M. 2016. On Exploiting Data Locality for Iterative MapReduce Applications in Hybrid Clouds . BDCAT’16: 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (Shanghai, China, 2016), 118–122.   
    Details
    Keywords: hybrid cloud, bursting, big data analytics, iterative, MapReduce, data locality, data management, scheduling
    Abstract: Hybrid cloud bursting (i.e., leasing temporary off-premise cloud resources to boost the capacity during peak utilization), has made significant impact especially for big data analytics, where the explosion of data sizes and increasingly complex computations frequently leads to insufficient local data center capacity. Cloud bursting however introduces a major challenge to runtime systems due to the limited throughput and high latency of data transfers between on-premise and off-premise resources (weak link). This issue and how to address it is not well understood. We contribute with a comprehensive study on what challenges arise in this context, what potential strategies can be applied to address them and what best practices can be leveraged in real-life. Specifically, we focus our study on iterative MapReduce applications, which are a class of large-scale data intensive applications particularly popular on hybrid clouds. In this context, we study how data locality can be leveraged over the weak link both from the storage layer perspective (when and how to move it off-premise) and from the scheduling perspective (when to compute off-premise). We conclude with a brief discussion on how to set up an experimental framework suitable to study the effectiveness of our proposal in future work.
    BibTex
    @inproceedings{HybridMapRed16,
      title = {On Exploiting Data Locality for Iterative MapReduce Applications in Hybrid Clouds
      },
      year = {2016},
      author = {Clemente-Castello, Francisco J and Nicolae, Bogdan and Mayo, Rafael and Fernandez, Juan Carlos and Rafique, Mustafa},
      booktitle = {BDCAT'16: 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies},
      pages = {118-122},
      address = {Shanghai, China},
      doi = {10.1145/3006299.3006329},
      url = {https://hal.archives-ouvertes.fr/hal-01476052v1/en},
      keywords = {hybrid cloud, bursting, big data analytics, iterative, MapReduce, data locality, data management, scheduling}
    }

  23. [23]Nicolae, B., Kochut, A. and Karve, A. 2016. Towards Scalable On-Demand Collective Data Access in IaaS Clouds: An Adaptive Collaborative Content Exchange Proposal. Journal of Parallel and Distributed Computing. 87, (2016), 67–79. DOI:https://doi.org/10.1016/j.jpdc.2015.09.006.   
    Details
    Keywords: IaaS, scalable content dissemination, collective I/O, on-demand data access, high thoughput, collaborative I/O, adaptive prefetching
    Abstract: A critical feature of IaaS cloud computing is the ability to quickly disseminate the content of a shared dataset at large scale. In this context, a common pattern is collective read, i.e., accessing the same VM image or dataset from a large number of VM instances concurrently. Several approaches deal with this pattern either by means of pre-broadcast before access or on-demand concurrent access to the repository where the image or dataset is stored. We propose a different solution using a hybrid strategy that augments on-demand access with a collaborative scheme in which the VMs leverage similarities between their access pattern in order to anticipate future read accesses and exchange chunks between themselves in order to reduce contention to the remote repository. Large scale experiments show significant improvement over conventional approaches from multiple perspectives: completion time, sustained read throughput, fairness of I/O read operations and bandwidth utilization.
    BibTex
    @article{ACEStoreJPDC16,
      title = {Towards Scalable On-Demand Collective Data Access in IaaS Clouds: An Adaptive Collaborative Content Exchange Proposal},
      year = {2016},
      author = {Nicolae, Bogdan and Kochut, Andrzej and Karve, Alexei},
      journal = {Journal of Parallel and Distributed Computing},
      volume = {87},
      pages = {67-79},
      doi = {10.1016/j.jpdc.2015.09.006},
      url = {https://hal.inria.fr/hal-01355213/en},
      keywords = {IaaS, scalable content dissemination, collective I/O, on-demand data access, high thoughput, collaborative I/O, adaptive prefetching}
    }

  24. [24]Tudoran, R., Nicolae, B. and Brasche, G. 2016. Data Multiverse : The Uncertainty Challenge of Future Big Data Analytics. IKC’16: 2nd International KEYSTONE Conference (Cluj-Napoca, Romania, 2016), 17–22.   
    Details
    Keywords: big data analytics, large scale data processing, data access model, data uncertainty, approximate computing
    Abstract: With the explosion of data sizes, extracting valuable insight out of big data becomes increasingly difficult. New challenges begin to emerge that complement traditional, long-standing challenges related to building scalable infrastructure and runtime systems that can deliver the desired level of performance and resource efficiency. This vision paper focuses on one such challenge, which we refer to as the analytics uncertainty: with so much data available from so many sources, it is difficult to anticipate what the data can be useful for, if at all. As a consequence, it is difficult to anticipate what data processing algorithms and methods are the most appropriate to extract value and insight. In this context, we contribute with a study on current big data analytics state-of-art, the use cases where the analytics uncertainty is emerging as a problem and future research directions to address them.
    BibTex
    @inproceedings{DBLP:conf/cost/TudoranNB16,
      title = {Data Multiverse : The Uncertainty Challenge of Future Big Data Analytics},
      year = {2016},
      author = {Tudoran, Radu and Nicolae, Bogdan and Brasche, Gotz},
      booktitle = {IKC'16: 2nd International KEYSTONE Conference},
      pages = {17-22},
      address = {Cluj-Napoca, Romania},
      doi = {10.1007/978-3-319-53640-8_2},
      url = {https://hal.archives-ouvertes.fr/hal-01480509v1/en},
      keywords = {big data analytics, large scale data processing, data access model, data uncertainty, approximate computing}
    }

  25. [25]Nicolae, B. 2015. Techniques to improve the scalability of collective checkpointing at large scale. HPCS’15: The 2015 International Conference on High Performance Computing and Simulation (Amsterdam, The Netherlands, 2015), 660–661.   
    Details
    Keywords: checkpointing, checkpoint restart, redundancy, scalability, data resilience, high performance computing, adaptive I/O, collective I/O, deduplication
    Abstract: Scientific and data-intensive computing have matured over the last couple of years in all fields of science and industry. Their rapid increase in complexity and scale has prompted ongoing efforts dedicated to reach exascale infrastructure capability by the end of the decade. However, advances in this context are not homogeneous: I/O capabilities in terms of networking and storage are lagging behind computational power and are often considered a major limitation that that persists even at petascale [1]. A particularly difficult challenge in this context are collective I/O access patterns (which we henceforth refer to as collective checkpointing) where all processes simultaneously dump large amounts of related data simultaneously to persistent storage. This pattern is often exhibited by large-scale, bulk-synchronous applications in a variety of circumstances, e.g., when they use checkpoint-restart fault tolerance techniques to save intermediate computational states at regular time intervals [2] or when intermediate, globally synchronized results are needed during the lifetime of the computation (e.g. to understand how a simulation progresses during key phases). Under such circumstances, a decoupled storage system (e.g. a parallel file system such as GPFS [3] or a specialized storage system such as BlobSeer [4]) does not provide sufficient I/O bandwidth to handle the explosion of data sizes: for example, Jones et al. [5] predict dump times in the order of several hours. In order to overcome the I/O bandwidth limitation, one potential solution is to equip the compute nodes with local storage (i.e., HDDs, SSDs, NVMs, etc.) or use I/O forwarding nodes. Using this approach, a large part of the data can be dumped locally, which completely avoids the need to consume and compete for the I/O bandwidth of a decoupled storage system. However, this is not without drawbacks: the local storage devices or I/O forwarding nodes are prone to failures and as such the data they hold is volatile. Thus, a popular approach in practice is to wait until the local dump has finished, then let the application continue while the checkpoints are in turn dumped to a parallel file system in background. Such a straightforward solution can be effective at hiding the overhead incurred to due I/O bandwidth limitations, but this not necessarily the case: it may happen that there is not enough time to fully flush everything to the parallel file system before the next collective checkpoint request is issued. In fact, this a likely scenario with growing scale, as the failure rate increases, which introduces the need to checkpoint at smaller intervals in order to compensate for this effect. Furthermore, a smaller checkpoint interval also means local dumps are frequent and as such their overhead becomes significant itself.
    BibTex
    @inproceedings{CollCkptHPCS15,
      title = {Techniques to improve the scalability of collective checkpointing at large scale},
      year = {2015},
      author = {Nicolae, Bogdan},
      booktitle = {HPCS’15: The 2015 International Conference on High Performance Computing and Simulation},
      pages = {660-661},
      address = {Amsterdam, The Netherlands},
      doi = {10.1109/HPCSim.2015.7237113},
      url = {http://ieeexplore.ieee.org/document/7237113/},
      keywords = {checkpointing, checkpoint restart, redundancy, scalability, data resilience, high performance computing, adaptive I/O, collective I/O, deduplication}
    }

  26. [26]Roman, R.-I., Nicolae, B., Costan, A. and Antoniu, G. 2015. Understanding Spark Performance in Hybrid and Multi-Site Clouds. BDAC-15: 6th International Workshop on Big Data Analytics: Challenges and Opportunities (Austin, USA, 2015). 
    Details
    Keywords: big data, Spark, hybrid cloud, network bottleneck
    Abstract: Recently, hybrid multi-site big data analytics (that combines on-premise with off-premise resources) has gained increasing popularity as a tool to process large amounts of data on-demand, without additional capital investment to increase the size of a single datacenter. However, making the most out of hybrid setups for big data analytics is challenging because on-premise resources can communicate with off-premise resources at significantly lower throughput and higher latency. Understanding the impact of this aspect is not trivial, especially in the context of modern big data an-alytics frameworks that introduce complex communication patterns and are optimized to overlap communication with computation in order to hide data transfer latencies. This paper contributes with a work-in-progress study that aims to identify and explain this impact in relationship to the known behavior on a single cloud. To this end, it analyses a representative big data workload on a hybrid Spark setup. Unlike previous experience that emphasized low end-impact of network communications in Spark, we found significant overhead in the shuffle phase when the bandwidth between the on-premise and off-premise resources is sufficiently small.
    BibTex
    @inproceedings{HybridMapRedBDAC15,
      title = {Understanding Spark Performance in Hybrid and Multi-Site Clouds},
      year = {2015},
      author = {Roman, Roxana-Ioana and Nicolae, Bogdan and Costan, Alexandru and Antoniu, Gabriel},
      booktitle = {BDAC-15: 6th International Workshop on Big Data Analytics: Challenges and Opportunities},
      address = {Austin, USA},
      url = {https://hal.inria.fr/hal-01239140/en/},
      keywords = {big data, Spark, hybrid cloud, network bottleneck}
    }

  27. [27]Nicolae, B., Riteau, P. and Keahey, K. 2015. Towards Transparent Throughput Elasticity for IaaS Cloud Storage: Exploring the Benefits of Adaptive Block-Level Caching. International Journal of Distributed Systems and Technologies. 6, 4 (2015), 21–44. DOI:https://doi.org/10.4018/IJDST.2015100102.   
    Details
    Keywords: IaaS, cloud computing, storage elasticity, adaptive I/O, virtual disk, block-level caching, performance prediction, cost prediction, modeling
    Abstract: Storage elasticity on IaaS clouds is a crucial feature in the age of data-intensive computing, especially when considering fluctuations of I/O throughput. This paper provides a transparent solution that automatically boosts I/O bandwidth during peaks for underlying virtual disks, effectively avoiding over-provisioning without performance loss. Our proposal relies on the idea of leveraging short-lived virtual disks of better performance characteristics (and thus more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored. We show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. Furthermore, we introduce a performance and cost prediction methodology that can be used both independently to estimate in advance what trade-off between performance and cost is possible, as well as an optimization technique that enables better cache size selection to meet the desired performance level with minimal cost. We demonstrate the benefits of our proposal both for microbenchmarks and for two real-life applications using large-scale experiments.
    BibTex
    @article{ElasticBWIJDST15,
      title = {Towards Transparent Throughput Elasticity for IaaS Cloud Storage: Exploring the Benefits of Adaptive Block-Level Caching},
      year = {2015},
      author = {Nicolae, Bogdan and Riteau, Pierre and Keahey, Kate},
      journal = {International Journal of Distributed Systems and Technologies},
      volume = {6},
      number = {4},
      pages = {21-44},
      doi = {10.4018/IJDST.2015100102},
      url = {https://hal.inria.fr/hal-01199464/en/},
      keywords = {IaaS, cloud computing, storage elasticity, adaptive I/O, virtual disk, block-level caching, performance prediction, cost prediction, modeling}
    }

  28. [28]Clemente-Castello, F.J., Nicolae, B., Katrinis, K., Rafique, M.M., Mayo, R., Fernandez, J.C. and Loreti, D. 2015. Enabling Big Data Analytics in the Hybrid Cloud Using Iterative MapReduce. UCC’15: 8th IEEE/ACM International Conference on Utility and Cloud Computing (Limassol, Cyprus, 2015), 290–299.   
    Details
    Keywords: hybrid cloud, big data analytics, iterative, MapReduce, data locality, performance prediction
    Abstract: The cloud computing model has seen tremendous commercial success through its materialization via two prominent models to date, namely public and private cloud. Recently, a third model combining the former two service models as on-/off-premise resources has been receiving significant market traction: hybrid cloud. While state of art techniques that address workload performance prediction and efficient workload execution over hybrid cloud setups exist, how to address data-intensive workloads - including Big Data Analytics - in similar environments is nascent. This paper addresses this gap by taking on the challenge of bursting over hybrid clouds for the benefit of accelerating iterative MapReduce applications. We first specify the challenges associated with data locality and data movement in such setups. Subsequently, we propose a novel technique to address the locality issue, without requiring changes to the MapReduce framework or the underlying storage layer. In addition, we contribute with a performance prediction methodology that combines modeling with micro-benchmarks to estimate completion time for iterative MapReduce applications, which enables users to estimate cost-to-solution before committing extra resources from public clouds. We show through experimentation in a dual-Openstack hybrid cloud setup that our solutions manage to bring substantial improvement at predictable cost-control for two real-life iterative MapReduce applications: large-scale machine learning and text analysis.
    BibTex
    @inproceedings{HybridMapRed15,
      title = {Enabling Big Data Analytics in the Hybrid Cloud Using Iterative MapReduce},
      year = {2015},
      author = {Clemente-Castello, Francisco J and Nicolae, Bogdan and Katrinis, Kostas and Rafique, M Mustafa and Mayo, Rafael and Fernandez, Juan Carlos and Loreti, Daniela},
      booktitle = {UCC’15: 8th IEEE/ACM International Conference on Utility and Cloud Computing},
      pages = {290-299},
      address = {Limassol, Cyprus},
      doi = {10.1109/UCC.2015.47},
      url = {https://hal.inria.fr/hal-01207186/en},
      keywords = {hybrid cloud, big data analytics, iterative, MapReduce, data locality, performance prediction}
    }

  29. [29]Kochut, A., Karve, A. and Nicolae, B. 2015. Towards Efficient On-demand VM Provisioning: Study of VM Runtime I/O Access Patterns to Shared Image Content. IM’15: 13th IFIP/IEEE International Symposium on Integrated Network Management (Ottawa, Canada, 2015), 321–329.   
    Details
    Keywords: cloud computing, Iaas, content similarity, deduplication, correlations, I/O access pattern, virtual disk
    Abstract: IaaS clouds are becoming a standard way of providing elastic compute capacity at an affordable cost. To achieve that, VM provisioning system has to optimally allocate I/O and compute resources. One of the significant optimization opportunities is to leverage content similarity across VM images. While many studies have been devoted to de-duplication of VM images, this paper is to the best of our knowledge, the first one to comprehensively study the relationship between the VM image similarity structure and the runtime I/O access patterns. Our study focuses on block-level similarity and I/O access patterns, revealing correlations between common content of different images and application-level access semantics. Furthermore, it also zooms on several runtime I/O access pattern aspects, such as the similarity of the sequences in which common content is accessed. The results show a strong tendency for access pattern locality within common content clusters across VM images, regardless of the rest of the composition. Furthermore, it reveals a strong tendency for read accesses to refer to the same subsets of blocks within the common content clusters, while preserving the same ordering. These results provide important insights that can be used to optimize on-demand VM image content delivery under concurrency.
    BibTex
    @inproceedings{VMRuntimeIO-IM15,
      title = {Towards Efficient On-demand VM Provisioning: Study of VM Runtime I/O Access Patterns to Shared Image Content},
      year = {2015},
      author = {Kochut, Andrzej and Karve, Alexei and Nicolae, Bogdan},
      booktitle = {IM’15: 13th IFIP/IEEE International Symposium on Integrated Network Management},
      pages = {321-329},
      address = {Ottawa, Canada},
      doi = {10.1109/INM.2015.7140307},
      url = {https://hal.inria.fr/hal-01138689/en},
      keywords = {cloud computing, Iaas, content similarity, deduplication, correlations, I/O access pattern, virtual disk}
    }

  30. [30]Nicolae, B. 2015. Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead. IPDPS ’15: 29th IEEE International Parallel and Distributed Processing Symposium (Hyderabad, India, 2015), 1023–1032.   
    Details
    Keywords: scalable I/O, checkpoint restart, checkpointing, data replication, deduplication, collective I/O, redundancy, data resilience, high availability
    Abstract: Dumping large amounts of related data simulta-neously to local storage devices instead of a parallel file system is a frequent I/O pattern of HPC applications running at large scale. Since local storage resources are prone to failures and have limited potential to serve multiple requests in parallel, techniques such as replication are often used to enable re-silience and high availability. However, replication introduces overhead, both in terms of network traffic necessary to distribute replicas, as well as extra storage space requirements. To reduce this overhead, state-of-art techniques often apply redundancy elimination (e.g. compression or deduplication) before replication, ignoring the natural redundancy that is already present. By contrast, this paper proposes a novel scheme that treats redundancy elimination and replication as a single co-optimized phase: remotely duplicated data is detected and directly leveraged to maintain a desired replication factor by keeping only as many replicas as needed and adding more if necessary. In this context, we introduce a series of high performance algorithms specifically designed to operate under tight and controllable constrains at large scale. We present how this idea can be leveraged in practice and demonstrate its viability for two real-life HPC applications.
    BibTex
    @inproceedings{DedupRepIPDPS15,
      title = {Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead},
      year = {2015},
      author = {Nicolae, Bogdan},
      booktitle = {IPDPS ’15: 29th IEEE International Parallel and Distributed Processing Symposium},
      pages = {1023-1032},
      address = {Hyderabad, India},
      doi = {10.1109/IPDPS.2015.82},
      url = {https://hal.inria.fr/hal-01115700/en},
      keywords = {scalable I/O, checkpoint restart, checkpointing, data replication, deduplication, collective I/O, redundancy, data resilience, high availability}
    }

  31. [31]Nicolae, B., Karve, A. and Kochut, A. 2015. Discovering and Leveraging Content Similarity to Optimize Collective On-Demand Data Access to IaaS Cloud Storage. CCGrid’15: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Shenzhen, China, 2015), 211–220.   
    Details
    Keywords: collective I/O, content similarity, deduplication, cloud storage, on-demand data access
    Abstract: A critical feature of IaaS cloud computing is the ability to quickly disseminate the content of a shared dataset at large scale. In this context, a common pattern is collective on-demand read, i.e., accessing the same VM image or dataset from a large number of V Minstances concurrently. There are various techniques that avoid I/Ocontention to the storage service where the dataset is located without relying on pre-broadcast. Most such techniques employ peer-to-peer collaborative behavior where the VM instances exchange information about the content that was accessed during runtime, such that it impossible to fetch the missing data pieces directly from each other rather than the storage system. However, such techniques are often limited within a group that performs a collective read. In light of high data redundancy on large IaaS data centers and multiple users that simultaneously run VM instance groups that perform collective reads, an important opportunity arises: enabling unrelated VMinstances belonging to different groups to collaborate and exchange common data in order to further reduce the I/O pressure on the storage system. This paper deals with the challenges posed by such absolution, which prompt the need for novel techniques to efficiently detect and leverage common data pieces across groups. To this end, we introduce a low-overhead fingerprint based approach that we evaluate and demonstrate to be efficient in practice for a representative scenario on dozens of nodes and a variety of group configurations.
    BibTex
    @inproceedings{CollDedupCCGrid15,
      title = {Discovering and Leveraging Content Similarity to Optimize Collective On-Demand Data Access to IaaS Cloud Storage},
      year = {2015},
      author = {Nicolae, Bogdan and Karve, Alexei and Kochut, Andrzej},
      booktitle = {CCGrid’15: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing},
      pages = {211-220},
      address = {Shenzhen, China},
      doi = {10.1109/CCGrid.2015.156},
      url = {https://hal.inria.fr/hal-01138684/en},
      keywords = {collective I/O, content similarity, deduplication, cloud storage, on-demand data access}
    }

  32. [32]Nicolae, B., Riteau, P. and Keahey, K. 2014. Bursting the Cloud Data Bubble: Towards Transparent Storage Elasticity in IaaS Clouds. IPDPS ’14: Proc. 28th IEEE International Parallel and Distributed Processing Symposium (Phoenix, USA, 2014), 135–144.   
    Details
    Keywords: adaptive I/O, cloud computing, elastic storage, utilization prediction
    Abstract: Storage elasticity on IaaS clouds is an important feature for data-intensive workloads: storage requirements can vary greatly during application runtime, making worst-case over-provisioning a poor choice that leads to unnecessarily tied-up storage and extra costs for the user. While the ability to adapt dynamically to storage requirements is thus attractive, how to implement it is not well understood. Current approaches simply rely on users to attach and detach virtual disks to the virtual machine (VM) instances and then manage them manually, thus greatly increasing application complexity while reducing cost efficiency. Unlike such approaches, this paper aims to provide a transparent solution that presents a unified storage space to the VM in the form of a regular POSIX file system that hides the details of attaching and detaching virtual disks by handling those actions transparently based on dynamic application requirements. The main difficulty in this context is to understand the intent of the application and regulate the available storage in order to avoid running out of space while minimizing the performance overhead of doing so. To this end, we propose a storage space prediction scheme that analyzes multiple system parameters and dynamically adapts monitoring based on the intensity of the I/O in order to get as close as possible to the real usage. We show the value of our proposal over static worst-case over-provisioning and simpler elastic schemes that rely on a reactive model to attach and detach virtual disks, using both synthetic benchmarks and real-life data-intensive applications. Our experiments demonstrate that we can reduce storage waste/cost by 30-40% with only 2-5% performance overhead.
    BibTex
    @inproceedings{ElasticStoreIPDPS14,
      title = {Bursting the Cloud Data Bubble: Towards Transparent Storage Elasticity in IaaS Clouds},
      year = {2014},
      author = {Nicolae, Bogdan and Riteau, Pierre and Keahey, Kate},
      booktitle = {IPDPS ’14: Proc. 28th IEEE International Parallel and Distributed Processing Symposium},
      pages = {135-144},
      address = {Phoenix, USA},
      doi = {10.1109/IPDPS.2014.25},
      url = {https://hal.inria.fr/hal-00947599/en},
      keywords = {adaptive I/O, cloud computing, elastic storage, utilization prediction}
    }

  33. [33]Nicolae, B., Riteau, P. and Keahey, K. 2014. Transparent Throughput Elasticity for IaaS Cloud Storage Using Guest-Side Block-Level Caching. UCC’14: 7th IEEE/ACM International Conference on Utility and Cloud Computing (London, UK, 2014), 186–195.   
    Details
    Keywords: adaptive I/O, block-level caching, cloud computing, elastic storage, virtual disk, utilization prediction
    Abstract: Storage elasticity on IaaS clouds is a crucial feature in the age of data-intensive computing. However, the traditional provisioning model of leveraging virtual disks of fixed capacity and performance characteristics has limited ability to match the increasingly dynamic nature of I/O application requirements. This mismatch is particularly problematic in the context of scientific applications that interleave periods of I/O inactivity with I/O intensive bursts. In this context, overprovisioning for best performance during peaks leads to significant extra costs because of unnecessarily tied-up resources, while any other trade-off leads to performance loss. This paper provides a transparent solution that automatically boosts I/O bandwidth during peaks for underlying virtual disks, effectively avoiding overprovisioning without performance loss. Our proposal relies on the idea of leveraging short-lived virtual disks of better performance characteristics (and thus more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored. We show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. We demonstrate the benefits of our proposal both for microbenchmarks and for two real-life applications using large-scale experiments.
    BibTex
    @inproceedings{ElasticBW-UCC14,
      title = {Transparent Throughput Elasticity for IaaS Cloud Storage Using Guest-Side Block-Level Caching},
      year = {2014},
      author = {Nicolae, Bogdan and Riteau, Pierre and Keahey, Kate},
      booktitle = {UCC’14: 7th IEEE/ACM International Conference on Utility and Cloud Computing},
      pages = {186-195},
      address = {London, UK},
      doi = {10.1109/UCC.2014.27},
      url = {https://hal.inria.fr/hal-01070227/en},
      keywords = {adaptive I/O, block-level caching, cloud computing, elastic storage, virtual disk, utilization prediction}
    }

  34. [34]Petcu, D., Gonzalez-Velez, H., Nicolae, B., Garcia-Gomez, J.M., Fuster-Garcia, E. and Sheridan, C. 2014. Next Generation HPC Clouds: A View for Large-Scale Scientific and Data-Intensive Applications. DIHC’14: The 2nd Workshop on Dependability and Interoperability in Heterogeneous Clouds (Porto, Portugal, 2014), 26–37.   
    Details
    Keywords: cloud storage, data analytics, heterogeneous clouds, high performance computing
    Abstract: In spite of the rapid growth of Infrastructure-as-a-Service offers, support to run data-intensive and scientific applications large-scale is still limited. On the user side, existing features and programming models are insufficiently developed to express an application in such way that it can benefit from an elastic infrastructure that dynamically adapts to the requirements, which often leads to unnecessary over-provisioning and extra costs. On the provider side, key performance and scalability issues arise when having to deal with large groups of tightly coupled virtualized resources needed by such applications, which is especially challenging considering the multi-tenant dimension where sharing of physical resources introduces interference both inside and across large virtual machine deployments. This paper contributes with a holistic vision that imagines a tight integration between programming models, runtime middlewares and the virtualization infrastructure in order to provide a framework that transparently handles allocation and utilization of heterogeneous resources while dealing with performance and elasticity issues.
    BibTex
    @inproceedings{CHAMP-DIHC14,
      title = {Next Generation HPC Clouds: A View for Large-Scale Scientific and Data-Intensive Applications},
      year = {2014},
      author = {Petcu, Dana and Gonzalez-Velez, Horacio and Nicolae, Bogdan and Garcia-Gomez, Juan Miguel and Fuster-Garcia, Elies and Sheridan, Craig},
      booktitle = {DIHC’14: The 2nd Workshop on Dependability and Interoperability in Heterogeneous Clouds},
      pages = {26-37},
      address = {Porto, Portugal},
      doi = {10.1007/978-3-319-14313-2_3},
      url = {http://link.springer.com/chapter/10.1007%2F978-3-319-14313-2_3},
      keywords = {cloud storage, data analytics, heterogeneous clouds, high performance computing}
    }

  35. [35]Ene, S., Nicolae, B., Costan, A. and Antoniu, G. 2014. To Overlap or Not to Overlap: Optimizing Incremental MapReduce Computations for On-Demand Data Upload. DataCloud ’14: The 5th International Workshop on Data-Intensive Computing in the Clouds (New Orleans, USA, 2014), 9–16.   
    Details
    Keywords: big data, data management, incremental processing, MapReduce
    Abstract: Research on cloud-based Big Data analytics has focused so far on optimizing the performance and cost-effectiveness of the computations, while largely neglecting an important as-pect: users need to upload massive datasets on clouds for their computations. This paper studies the problem of run-ning MapReduce applications when considering the simulta-neous optimization of performance and cost of both the data upload and its corresponding computation taken together. We analyze the feasibility of incremental MapReduce ap-proaches to advance the computation as much as possible during the data upload by using already transferred data to calculate intermediate results. Our key finding shows that overlapping the transfer time with as many incremental com-putations as possible is not always efficient: a better solution is to wait for enough to fill the computational capacity of the MapReduce cluster. Results show significant performance and cost reduction compared with state-of-the-art solutions that leverage incremental computations in a naive fashion.
    BibTex
    @inproceedings{OverlapMapRed14,
      title = {To Overlap or Not to Overlap: Optimizing Incremental MapReduce Computations for On-Demand Data Upload},
      year = {2014},
      author = {Ene, Stefan and Nicolae, Bogdan and Costan, Alexandru and Antoniu, Gabriel},
      booktitle = {DataCloud ’14: The 5th International Workshop on Data-Intensive Computing in the Clouds},
      pages = {9-16},
      address = {New Orleans, USA},
      doi = {10.1109/DataCloud.2014.7},
      url = {https://hal.inria.fr/hal-01094609/en},
      keywords = {big data, data management, incremental processing, MapReduce}
    }

  36. [36]Nicolae, B., Lemarinier, P. and Meneghin, M. 2014. Leveraging Naturally Distributed Data Redundancy to Optimize Collective Replication. SC ’14: 27th International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, USA, 2014). 
    Details
    Keywords: high performance computing, data resilience, high availability, replication, deduplication, collective I/O, redundancy, fault tolerance
    Abstract: Dumping large amounts of related data simultaneously to local storage devices instead of a parallel file system is a frequent I/O pattern of HPC applications running at large scale. Since local storage resources are prone to failures and have limited potential to serve multiple requests in parallel, techniques such as replication are often used to enable resilience and high availability. However, replication introduces overhead, both in terms of network traffic necessary to distribute replicas, as well as extra storage space requirements. To reduce this overhead, state-of-art techniques often apply redundancy elimination (e.g. compression or de-duplication) before replication, ignoring the natural redundancy that is already present. By contrast, this paper proposes a novel scheme that treats redundancy elimination and replication as a single co-optimized phase: remotely duplicated data is detected and directly leveraged to maintain a desired replication factor by keeping only as many replicas as neededand adding more if necessary. In this context, we introduce a series of high performance algorithms specifically designed to operate under tight and controllable constrains at large scale. We present how this idea can be leveraged in practice and demonstrate its viability for two real-life HPC applications.
    BibTex
    @inproceedings{CollDedupRep-SC14,
      title = {Leveraging Naturally Distributed Data Redundancy to Optimize Collective Replication},
      year = {2014},
      author = {Nicolae, Bogdan and Lemarinier, Pierre and Meneghin, Massimiliano},
      booktitle = {SC ’14: 27th International Conference for High Performance Computing, Networking, Storage and Analysis},
      address = {New Orleans, USA},
      url = {http://sc14.supercomputing.org/sites/all/themes/sc14/files/archive/tech_poster/poster_files/post286s2-file3.pdf},
      keywords = {high performance computing, data resilience, high availability, replication, deduplication, collective I/O, redundancy, fault tolerance}
    }

  37. [37]Nicolae, B. 2013. Understanding Vertical Scalability of I/O Virtualization for MapReduce Workloads: Challenges and Opportunities. BigDataCloud ’13: 2nd Workshop on Big Data Management in Clouds (held in conjunction with EuroPar’13) (Aachen, Germany, 2013).   
    Details
    Keywords: I/O virtualization, big data, vertical I/O scalability, big data, IaaS, cloud computing
    Abstract: As the explosion of data sizes continues to push the limits of our abilities to efficiently store and process big data, next generation big data systems face multiple challenges. One such important challenge relates to the limited scalability of I/O, a determining factor in the overall performance of big data applications. Although paradigms like MapReduce have long been used to take advantage of local disks and avoid data movements over the network as much as possible, with increasing core count per node, local storage comes under increasing I/O pressure itself and prompts the need to equip nodes with multiple disks. However, given the rising need to virtualize large datacenters in order to provide a more flexible allocation and consolidation of physical resources (transforming them into public or private/hybrid clouds), the following questions arise: is it possible to take advantage of multiple local disks at virtual machine (VM) level in order to speed up big data analytics? If so, what are the best practices to achieve a high virtualized aggregated I/O throughput? This paper aims to answer these questions in the context of I/O intensive MapReduce workloads: it analyzes and characterizes their behavior under different virtualization scenarios in order to propose best practices for current approaches and speculate on future areas of improvement.
    BibTex
    @inproceedings{VIOBigDataCloud13,
      title = {Understanding Vertical Scalability of I/O Virtualization for MapReduce Workloads: Challenges and Opportunities},
      year = {2013},
      author = {Nicolae, Bogdan},
      booktitle = {BigDataCloud ’13: 2nd Workshop on Big Data Management in Clouds (held in conjunction with EuroPar’13)},
      address = {Aachen, Germany},
      doi = {10.1007/978-3-642-54420-0_1},
      url = {https://hal.inria.fr/hal-00856877/en},
      keywords = {I/O virtualization, big data, vertical I/O scalability, big data, IaaS, cloud computing}
    }

  38. [38]Antoniu, G. et al. 2013. Scalable Data Management for Map-Reduce-based Data-Intensive Applications: A View for Cloud and Hybrid Infrastructures. International Journal of Cloud Computing. 2, (2013), 150–170. DOI:https://doi.org/10.1504/IJCC.2013.055265.   
    Details
    Keywords: MapReduce, cloud computing, desktop grids, hybrid infrastructures, bioinformatics, task scheduling, fault tolerance, scalable data management, data-intensive, scalable storage, massive data, concurrency control, volatility.
    Abstract: As map-reduce emerges as a leading programming paradigm for data-intensive computing, today’s frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper, we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing map-reduce frameworks, in order to achieve scalable, concurrency-optimised, fault-tolerant map-reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.
    BibTex
    @article{MapRedIJCC13,
      title = {Scalable Data Management for Map-Reduce-based Data-Intensive Applications: A View for Cloud and Hybrid Infrastructures},
      year = {2013},
      author = {Antoniu, Gabriel and Bigot, Julien and Blanchet, Cristophe and Bouge, Luc and Briant, Francois and Cappello, Franck and Costan, Alexandru and Desprez, Frederic and Fedak, Gilles and Gault, Sylvain and Keahey, Kate and Nicolae, Bogdan and Perez, Christian and Simonet, Anthony and Suter, Frederic and Tang, Bing and Terreux, Raphael},
      journal = {International Journal of Cloud Computing},
      volume = {2},
      pages = {150-170},
      doi = {10.1504/IJCC.2013.055265},
      url = {https://hal.inria.fr/hal-00684866/en},
      keywords = {MapReduce, cloud computing, desktop grids, hybrid infrastructures, bioinformatics, task scheduling, fault tolerance, scalable data management, data-intensive, scalable storage, massive data, concurrency control, volatility.}
    }

  39. [39]Nicolae, B. 2013. Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal. IPDPS ’13: The 27th IEEE International Parallel and Distributed Processing Symposium (Boston, USA, 2013), 19–28.   
    Details
    Keywords: I/O load balancing, checkpoint restart, deduplication, fault tolerance, high performance computing, checkpointing
    Abstract: With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence. For a large class of applications that run for a long time and are tightly coupled, Checkpoint-Restart (CR) is the only feasible method to survive failures. However, exploding checkpoint sizes that need to be dumped to storage pose a major scalability challenge, prompting the need to reduce the amount of checkpointing data. This paper contributes with a novel collective memory contents deduplication scheme that attempts to identify and eliminate duplicate memory pages before they are saved to storage. Unlike previous approaches that concentrate on the checkpoints of the same process, our approach identifies duplicate memory pages shared by different processes (regardless whether on the same or different node). We show both how to achieve such a global deduplication in a scalable fashion and how to leverage it effectively to optimize the data layout in such way that it minimizes I/O bottlenecks. Large scale experiments show significant reduction of storage space consumption and performance overhead compared to several state-of-art approaches, both in synthetic benchmarks and for a real life high performance computing application.
    BibTex
    @inproceedings{AICkptIPDPS13,
      title = {Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal},
      year = {2013},
      author = {Nicolae, Bogdan},
      booktitle = {IPDPS ’13: The 27th IEEE International Parallel and Distributed Processing Symposium},
      pages = {19-28},
      address = {Boston, USA},
      doi = {10.1109/IPDPS.2013.14},
      url = {https://hal.inria.fr/hal-00781532/en},
      keywords = {I/O load balancing, checkpoint restart, deduplication, fault tolerance, high performance computing, checkpointing}
    }

  40. [40]Nicolae, B. and Cappello, F. 2013. BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds. Journal of Parallel and Distributed Computing. 73, 5 (2013), 698–711. DOI:https://doi.org/10.1016/j.jpdc.2013.01.013.   
    Details
    Keywords: checkpoint restart, high performance computing, IaaS, cloud computing, snapshotting, fault tolerance, file system rollback, virtual disk
    Abstract: Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back file system changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.
    BibTex
    @article{BlobCRJPDC13,
      title = {BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds},
      year = {2013},
      author = {Nicolae, Bogdan and Cappello, Franck},
      journal = {Journal of Parallel and Distributed Computing},
      volume = {73},
      number = {5},
      pages = {698-711},
      doi = {10.1016/j.jpdc.2013.01.013},
      url = {https://hal.inria.fr/hal-00857964/en},
      keywords = {checkpoint restart, high performance computing, IaaS, cloud computing, snapshotting, fault tolerance, file system rollback, virtual disk},
      issn = {07437315}
    }

  41. [41]Nicolae, B. and Cappello, F. 2013. AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing. HPDC ’13: 22th International ACM Symposium on High-Performance Parallel and Distributed Computing (New York, USA, 2013), 155–166.   
    Details
    Keywords: scientific computing, high performance computing, cloud computing, fault tolerance, checkpoint restart, checkpointing, adaptive I/O
    Abstract: With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in past iterations. Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed to stable storage. Large scale experiments show up to 60% improvement when compared to state-of-art checkpointing approaches, all this achievable with an extra memory requirement of less than 5% of the total application memory.
    BibTex
    @inproceedings{AICkptHPDC13,
      title = {AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing},
      year = {2013},
      author = {Nicolae, Bogdan and Cappello, Franck},
      booktitle = {HPDC ’13: 22th International ACM Symposium on High-Performance Parallel and Distributed Computing},
      pages = {155-166},
      address = {New York, USA},
      doi = {10.1145/2462902.2462918},
      url = {https://hal.archives-ouvertes.fr/hal-00809847/en},
      keywords = {scientific computing, high performance computing, cloud computing, fault tolerance, checkpoint restart, checkpointing, adaptive I/O}
    }

  42. [42]Nicolae, B. and Rafique, M. 2013. Leveraging Collaborative Content Exchange for On-Demand VM Multi-Deployments in IaaS Clouds. Euro-Par ’13: 19th International Euro-Par Conference on Parallel Processing (Aachen, Germany, 2013).   
    Details
    Keywords: IaaS, cloud computing, multi-deployment, VM provisioning, collaborative content exchange
    Abstract: A critical feature of IaaS cloud computing is the ability to deploy, boot and terminate large groups of interdependent VMs very quickly, which enables users to efficiently exploit the on-demand nature and elasticity of clouds even for large-scale deployments. A common pattern in this context is multi-deployment, i.e., using the same VM image template to instantiate a large number of VMs in parallel. A difficult trade-off arises in this context: access the content of the template on demand but slowly due to I/O bottlenecks or pre -broadcast the full contents of the template on the local storage of the hosting nodes to avoid such bottlenecks. Unlike previous approaches that are biased towards either of the extremes, we propose a scheme that augments on-demand access through a collaborative scheme in which the VMs aim to leverage the similarity of access pattern in order to anticipate future accesses and exchange chunks between themselves in an attempt to reduce contention to the remote storage where the VM image template is stored. Large scale experiments show improvements in read throughput between 30%-40% compared to on demand access schemes that perform in isolation.
    BibTex
    @inproceedings{AStoreVD-EUROPAR13,
      title = {Leveraging Collaborative Content Exchange for On-Demand VM Multi-Deployments in IaaS Clouds},
      year = {2013},
      author = {Nicolae, Bogdan and Rafique, Mustafa},
      booktitle = {Euro-Par ’13: 19th International Euro-Par Conference on Parallel Processing},
      address = {Aachen, Germany},
      doi = {10.1007/978-3-642-40047-6_32},
      url = {https://hal.inria.fr/hal-00835432/en},
      keywords = {IaaS, cloud computing, multi-deployment, VM provisioning, collaborative content exchange}
    }

  43. [43]Antoniu, G. et al. 2012. Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures. ICACON ’12 : 1st International IBM Cloud Academy Conference (Research Triangle Park, USA, 2012). 
    Details
    Keywords: MapReduce, cloud computing, data-intensive computing, hybrid infrastructures, BlobSeer, BitDew, Nimbus, HLCM, Grid’5000
    Abstract: As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today’s frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing Map-Reduce frameworks, in order to achieve scalable, concurrency-optimized, fault-tolerant Map-Reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.
    BibTex
    @inproceedings{ANRMapRed-ICACON12,
      title = {Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures},
      year = {2012},
      author = {Antoniu, Gabriel and Bigot, Julien and Blanchet, Cristophe and Bouge, Luc and Briant, Francois and Cappello, Franck and Costan, Alexandru and Desprez, Frederic and Fedak, Gilles and Gault, Sylvain and Keahey, Kate and Nicolae, Bogdan and Perez, Christian and Simonet, Anthony and Suter, Frederic and Tang, Bing and Terreux, Raphael},
      booktitle = {ICACON ’12 : 1st International IBM Cloud Academy Conference},
      address = {Research Triangle Park, USA},
      url = {http://hal.inria.fr/hal-00684866/en},
      keywords = {MapReduce, cloud computing, data-intensive computing, hybrid infrastructures, BlobSeer, BitDew, Nimbus, HLCM, Grid'5000}
    }

  44. [44]Gomez, L.B., Nicolae, B., Maruyama, N., Cappello, F. and Matsuoka, S. 2012. Scalable Reed-Solomon-based Reliable Local Storage for HPC Applications on IaaS Clouds. Euro-Par ’12: 18th International Euro-Par Conference on Parallel Processing (Rhodes, Greece, 2012).   
    Details
    Keywords: Cloud computing, IaaS, storage systems, virtual disk, erasure codes, Reed Solomon
    Abstract: With increasing interest among mainstream users to run HPC applications, Infrastructure-as-a-Service (IaaS) cloud computing platforms represent a viable alternative to the acquisition and maintenance of expensive hardware, often out of the financial capabilities of such users. Also, one of the critical needs of HPC applications is an efficient, scalable and persistent storage. Unfortunately, storage options proposed by cloud providers are not standardized and typically use a different access model. In this context, the local disks on the compute nodes can be used to save large data sets such as the data generated by Checkpoint-Restart (CR). This local storage offers high throughput and scalability but it needs to be combined with persistency techniques, such as block replication or erasure codes. One of the main challenges that such techniques face is to minimize the overhead of performance and I/O resource utilization (i.e., storage space and bandwidth), while at the same time guaranteeing high reliability of the saved data. This paper introduces a novel persistency technique that leverages Reed-Solomon (RS) encoding to save data in a reliable fashion. Compared to traditional approaches that rely on block replication, we demonstrate about 50% higher throughput while reducing network bandwidth and storage utilization by a factor of 2 for the same targeted reliability level. This is achieved both by modeling and real life experimentation on hundreds of nodes.
    BibTex
    @inproceedings{RSEnc-EUROPAR12,
      title = {Scalable Reed-Solomon-based Reliable Local Storage for HPC Applications on IaaS Clouds},
      year = {2012},
      author = {Gomez, Leonardo Bautista and Nicolae, Bogdan and Maruyama, Naoya and Cappello, Franck and Matsuoka, Satoshi},
      booktitle = {Euro-Par ’12: 18th International Euro-Par Conference on Parallel Processing},
      address = {Rhodes, Greece},
      doi = {10.1007/978-3-642-32820-6_32},
      url = {http://hal.inria.fr/hal-00703119/en},
      keywords = {Cloud computing, IaaS, storage systems, virtual disk, erasure codes, Reed Solomon}
    }

  45. [45]Tran, V.-T., Nicolae, B. and Antoniu, G. 2012. Towards scalable array-oriented active storage: the Pyramid approach. SIGOPS Oper. Syst. Rev. 46, 1 (2012), 19–25. DOI:https://doi.org/10.1145/2146382.2146387.   
    Details
    Keywords: large scale data management, multi-dimensional I/O, concurrency control, parallel array processing, versioning
    Abstract: The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges. Experimental evaluation demonstrates substantial scalability improvements brought by Pyramid with respect to state-of-art approaches both in weak and strong scaling scenarios, with gains of 100% to 150%.
    BibTex
    @article{Pyramid-OSR12,
      title = {Towards scalable array-oriented active storage: the Pyramid approach},
      year = {2012},
      author = {Tran, Viet-Trung and Nicolae, Bogdan and Antoniu, Gabriel},
      journal = {SIGOPS Oper. Syst. Rev.},
      volume = {46},
      number = {1},
      pages = {19-25},
      doi = {10.1145/2146382.2146387},
      url = {https://hal.inria.fr/hal-00640900/en},
      keywords = {large scale data management, multi-dimensional I/O, concurrency control, parallel array processing, versioning},
      publisher = {ACM},
      address = {New York, NY, USA},
      issn = {0163-5980}
    }

  46. [46]Nicolae, B. and Cappello, F. 2012. A hybrid local storage transfer scheme for live migration of I/O intensive workloads. HPDC ’12: 21th International ACM Symposium on High-Performance Parallel and Distributed Computing (Delft, The Netherlands, 2012), 85–96.   
    Details
    Keywords: virtualization, live migration, block migration, local storage transfer, I/O intensive workloads, IaaS, cloud computing, data intensive applications
    Abstract: Live migration of virtual machines (VMs) is key feature of virtualization that is extensively leveraged in IaaS cloud environments: it is the basic building block of several important features, such as load balancing, pro-active fault tolerance, power management, online maintenance, etc. While most live migration efforts concentrate on how to transfer the memory from source to destination during the migration process, comparatively little attention has been devoted to the transfer of storage. This problem is gaining increasing importance: due to performance reasons, virtual machines that run large-scale, data-intensive applications tend to rely on local storage, which poses a difficult challenge on live migration: it needs to handle storage transfer in addition to memory transfer. This paper proposes a memory-migration independent approach that addresses this challenge. It relies on a hybrid active push / prioritized prefetch strategy, which makes it highly resilient to rapid changes of disk state exhibited by I/O intensive workloads. At the same time, it is minimally intrusive in order to ensure a maximum of portability with a wide range of hypervisors. Large scale experiments that involve multiple simultaneous migrations of both synthetic benchmarks and a real scientific application show improvements of up to 10x faster migration time, 10x less bandwidth consumption and 8x less performance degradation over state-of-art.
    BibTex
    @inproceedings{LiveMigrHPDC12,
      title = {A hybrid local storage transfer scheme for live migration of I/O intensive workloads},
      year = {2012},
      author = {Nicolae, Bogdan and Cappello, Franck},
      booktitle = {HPDC ’12: 21th International ACM Symposium on High-Performance Parallel and Distributed Computing},
      pages = {85-96},
      address = {Delft, The Netherlands},
      doi = {10.1145/2287076.2287088},
      url = {https://hal.inria.fr/hal-00686654/en/},
      keywords = {virtualization, live migration, block migration, local storage transfer, I/O intensive workloads, IaaS, cloud computing, data intensive applications}
    }

  47. [47]Nicolae, B. 2011. On the Benefits of Transparent Compression for Cost-Effective Cloud Data Storage. Transactions on Large-Scale Data- and Knowledge-Centered Systems. 3, 3 (2011), 167–184. DOI:https://doi.org/10.1007/978-3-642-23074-5.   
    Details
    Keywords: IaaS, cloud computing, scalable storage, high throughput, compression
    Abstract: Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring computational resources: it allows users to deploy virtual machines (VMs) at large scale and pay only for the resources that were actually used throughout the runtime of the VMs. This new model raises new challenges in the design and development of IaaS middleware: excessive storage costs associated with both user data and VM images might make the cloud less attractive, especially for users that need to manipulate huge data sets and a large number of VM images. Storage costs result not only from storage space utilization, but also from bandwidth consumption: in typical deployments, a large number of data transfers between the VMs and the persistent storage are performed, all under high performance requirements. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on data access performance. Our solution builds on BlobSeer, a distributed data management service specifically designed to sustain a high throughput for concurrent accesses to huge data sequences that are distributed at large scale. Extensive experiments demonstrate that our approach achieves large reductions (at least 40%) of bandwidth and storage space utilization, while still attaining high performance levels that even surpass the original (no compression) performance levels in several data-intensive scenarios.
    BibTex
    @article{Compression11,
      title = {On the Benefits of Transparent Compression for Cost-Effective Cloud Data Storage},
      year = {2011},
      author = {Nicolae, Bogdan},
      journal = {Transactions on Large-Scale Data- and Knowledge-Centered Systems},
      volume = {3},
      number = {3},
      pages = {167-184},
      doi = {10.1007/978-3-642-23074-5},
      url = {https://hal.inria.fr/inria-00613583},
      keywords = {IaaS, cloud computing, scalable storage, high throughput, compression},
      publisher = {Springer Berlin / Heidelberg}
    }

  48. [48]Nicolae, B., Antoniu, G., Bouge, L., Moise, D. and Carpen-Amarie, A. 2011. BlobSeer: Next-generation data management for large scale infrastructures. J. Parallel Distrib. Comput. 71, 2 (2011), 169–184. DOI:https://doi.org/10.1016/j.jpdc.2010.08.004.   
    Details
    Keywords: scalable storage, data management, high throughput, versioning, decentralized metadata, concurrency control, data model, BlobSeer
    Abstract: As data volumes increase at a high speed in more and more application fields of science, engineering, information services, etc., the challenges posed by data-intensive computing gain an increasing importance. The emergence of highly scalable infrastructures, e.g. for cloud computing and for petascale computing and beyond introduces additional issues for which scalable data management becomes an immediate need. This paper brings several contributions. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, we highlight the potentially large benefits of using versioning in this context. Second, based on these principles, we propose a set of versioning algorithms, both for data and metadata, that enable a high throughput under concurrency. Finally, we implement and evaluate these algorithms in the BlobSeer prototype, that we integrate as a storage backend in the Hadoop MapReduce framework. We perform extensive microbenchmarks as well as experiments with real MapReduce applications: they demonstrate that applying the principles defended in our approach brings substantial benefits to data intensive applications.
    BibTex
    @article{BlobSeerJPDC11,
      title = {BlobSeer: Next-generation data management for large scale infrastructures},
      year = {2011},
      author = {Nicolae, Bogdan and Antoniu, Gabriel and Bouge, Luc and Moise, Diana and Carpen-Amarie, Alexandra},
      journal = {J. Parallel Distrib. Comput.},
      volume = {71},
      number = {2},
      pages = {169-184},
      doi = {10.1016/j.jpdc.2010.08.004},
      url = {https://hal.inria.fr/inria-00511414/en/},
      keywords = {scalable storage, data management, high throughput, versioning, decentralized metadata, concurrency control, data model, BlobSeer},
      publisher = {Academic Press, Inc.},
      address = {Orlando, FL, USA},
      issn = {0743-7315}
    }

  49. [49]Nicolae, B., Bresnahan, J., Keahey, K. and Antoniu, G. 2011. Going Back and Forth: Efficient Multideployment and Multisnapshotting on Clouds. HPDC ’11: 20th International ACM Symposium on High-Performance Parallel and Distributed Computing (San José, USA, 2011), 147–158.   
    Details
    Keywords: Nimbus, Grid’5000, cloud computing, BlobSeer, VM storage, IaaS, multi-snaphotting, multi-deployment, large scale provisioning
    Abstract: Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring resources by introducing a simple change: allowing users to lease computational resources from the cloud provider’s datacenter for a short time by deploying virtual machines (VMs) on those resources. This new model raises new challenges in the design and development of IaaS middleware. One of those challenges is the need to deploy a large number (hundreds or even thousands) of VM instances simultaneously. Once the VM instances are deployed, another challenge is to simultaneously take a snapshot of many images and transfer them to persistent storage to support management tasks, such as suspend-resume and migration. With datacenters growing at a fast rate and configurations becoming heterogeneous, it is important to enable efficient concurrent deployment and snapshotting that is at the same time hypervisor independent and ensures a maximum of compatibility with different configurations. This paper addresses these challenges by proposing a virtual file system specifically optimized for virtual machine image storage. It is based on a lazy transfer scheme coupled with object-versioning that demonstrates excellent performance in terms of consumed resources: execution time, network traffic and storage space. Experiments on hundreds of nodes demonstrate performance improvements in concurrent VM deployments ranging from a factor of 2 up to 25 over state-of-art, with storage and bandwidth utilization reduction of as much as 90%, while at the same time keeping comparable snapshotting performance, which comes with the added benefit of high portability.
    BibTex
    @inproceedings{BlobSeerHPDC11,
      title = {Going Back and Forth: Efficient Multideployment and Multisnapshotting on Clouds},
      year = {2011},
      author = {Nicolae, Bogdan and Bresnahan, John and Keahey, Kate and Antoniu, Gabriel},
      booktitle = {HPDC ’11: 20th International ACM Symposium on High-Performance Parallel and Distributed Computing},
      pages = {147-158},
      address = {San José, USA},
      doi = {10.1145/1996130.1996152},
      url = {https://hal.inria.fr/inria-00570682/en},
      keywords = {Nimbus, Grid'5000, cloud computing, BlobSeer, VM storage, IaaS, multi-snaphotting, multi-deployment, large scale provisioning}
    }

  50. [50]Nicolae, B. and Cappello, F. 2011. BlobCR: Efficient Checkpoint-Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots. SC ’11: 24th International Conference for High Performance Computing, Networking, Storage and Analysis (Seattle, USA, 2011), 34–1.   
    Details
    Keywords: IaaS, cloud computing, large scale multi-deployment, checkpoint restart, fault tolerance, virtual disk snapshots, BlobSeer
    Abstract: Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space performance overhead of checkpoint-restart. We introduce a framework that combines checkpoint-restart protocols at guest level with virtual machine (VM) disk-image multi-snapshotting and multi-deployment at host level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.
    BibTex
    @inproceedings{BlobCR-SC11,
      title = {BlobCR: Efficient Checkpoint-Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots},
      year = {2011},
      author = {Nicolae, Bogdan and Cappello, Franck},
      booktitle = {SC ’11: 24th International Conference for High Performance Computing, Networking, Storage and Analysis},
      pages = {34-1},
      address = {Seattle, USA},
      doi = {10.1145/2063384.2063429},
      url = {http://hal.inria.fr/inria-00601865/en/},
      keywords = {IaaS, cloud computing, large scale multi-deployment, checkpoint restart, fault tolerance, virtual disk snapshots, BlobSeer}
    }

  51. [51]Nicolae, B., Cappello, F. and Antoniu, G. 2011. Optimizing multi-deployment on clouds by means of self-adaptive prefetching. Euro-Par ’11: 17th International Euro-Par Conference on Parallel Processing (Bordeaux, France, 2011), 503–513.   
    Details
    Keywords: IaaS, cloud computing, large scale multi-deployment, provisioning, adaptive I/O
    Abstract: With Infrastructure-as-a-Service (IaaS) cloud economics getting increasingly complex and dynamic, resource costs can vary greatly over short periods of time. Therefore, a critical issue is the ability to deploy, boot and terminate VMs very quickly, which enables cloud users to exploit elasticity to find the optimal trade-off between the computational needs (number of resources, usage time) and budget constraints. This paper proposes an adaptive prefetching mechanism aiming to reduce the time required to simultaneously boot a large number of VM instances on clouds from the same initial VM image (multi-deployment). Our proposal does not require any foreknowledge of the exact access pattern. It dynamically adapts to it at run time, enabling the slower instances to learn from the experience of the faster ones. Since all booting instances typically access only a small part of the virtual image along almost the same pattern, the required data can be pre-fetched in the background. Large scale experiments under concurrency on hundreds of nodes show that introducing such a prefetching mechanism can achieve a speed-up of up to 35% when compared to simple on-demand fetching.
    BibTex
    @inproceedings{BlobVM-EUROPAR11,
      title = {Optimizing multi-deployment on clouds by means of self-adaptive prefetching},
      year = {2011},
      author = {Nicolae, Bogdan and Cappello, Franck and Antoniu, Gabriel},
      booktitle = {Euro-Par ’11: 17th International Euro-Par Conference on Parallel Processing},
      pages = {503-513},
      address = {Bordeaux, France},
      doi = {10.1007/978-3-642-23400-2_46},
      url = {http://hal.inria.fr/inria-00594406/en/},
      keywords = {IaaS, cloud computing, large scale multi-deployment, provisioning, adaptive I/O}
    }

  52. [52]Tran, V.-T., Nicolae, B., Antoniu, G. and Bouge, L. 2011. Efficient support for MPI-I/O atomicity based on versioning. CCGRID ’11: 11th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (Newport Beach, USA, 2011), 514–523.   
    Details
    Keywords: large scale, storage, MPI-IO, atomicity, non-contiguous I/O, versioning
    Abstract: We consider the challenge of building data management systems that meet an important requirement of today’s data-intensive HPC applications: to provide a high I/O throughput while supporting highly concurrent data accesses. In this context, many applications rely on MPI-IO and require atomic, non-contiguous I/O operations that concurrently access shared data. In most existing implementations the atomicity requirement is often implemented through locking-based schemes, which have proven inefficient, especially for non-contiguous I/O. We claim that using a versioning-enabled storage backend has the potential to avoid expensive synchronization as exhibited by locking-based schemes, which is much more efficient. We describe a prototype implementation on top of ROMIO along this idea, and report on promising experimental results with standard MPI-IO benchmarks specifically designed to evaluate the performance of non-contiguous, overlapped I/O accesses under MPI atomicity guarantees.
    BibTex
    @inproceedings{BlobSeerCCGRID11,
      title = {Efficient support for MPI-I/O atomicity based on versioning},
      year = {2011},
      author = {Tran, Viet-Trung and Nicolae, Bogdan and Antoniu, Gabriel and Bouge, Luc},
      booktitle = {CCGRID ’11: 11th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing},
      pages = {514-523},
      address = {Newport Beach, USA},
      doi = {10.1109/CCGrid.2011.60},
      url = {http://hal.inria.fr/inria-00565358/en/},
      keywords = {large scale, storage, MPI-IO, atomicity, non-contiguous I/O, versioning}
    }

  53. [53]Tran, V.-T., Nicolae, B., Antoniu, G. and Bouge, L. 2011. Pyramid: A large-scale array-oriented active storage system. LADIS ’11: Proceedings of the 5th Workshop on Large-Scale Distributed Systems and Middleware (Newport Beach, USA, 2011). 
    Details
    Keywords: large scale data management, multi-dimensional I/O, concurrency control, parallel array processing, versioning
    Abstract: The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges and shows promising results in our initial evaluation.
    BibTex
    @inproceedings{Pyramid-LADIS11,
      title = {Pyramid: A large-scale array-oriented active storage system},
      year = {2011},
      author = {Tran, Viet-Trung and Nicolae, Bogdan and Antoniu, Gabriel and Bouge, Luc},
      booktitle = {LADIS ’11: Proceedings of the 5th Workshop on Large-Scale Distributed Systems and Middleware},
      address = {Newport Beach, USA},
      url = {https://hal.inria.fr/inria-00627665/en},
      keywords = {large scale data management, multi-dimensional I/O, concurrency control, parallel array processing, versioning}
    }

  54. [54]Suciu, A., Nicolae, B., Antoniu, G., Istvan, Z. and Szakats, I. 2010. Gathering Entropy at Large Scale with HAVEGE and BlobSeer. Automat. Comput. Appl. Math. 19, (2010), 3–11. 
    Details
    Keywords: random number generation, large scale, high throughput, high entropy, Blobseer, HAVEGE
    Abstract: Large sequences of random information are the foundation for a large class of applications: security, online gambling games, large scale Monte-Carlo simulations, etc. Many such applications are distributed and run on large-scale infrastructures such as clouds and grids. In this context, the random generator plays a crucial role: it needs to achieve a high entropy, a high throughput and last but not least a high degree of security. Several ways to generate high-entropy random information securely exist. For example, HAVEGE generates random information by gathering entropy from internal processor states of the machine where it is running alongside the user applications. These internal states are inheritably volatile and impossible to tamper with in a controlled fashion by the applications running on it. A centralized approach however does not scale to the high throughput requirement in a large scale setting. In order to do so, the output of several such instances needs to be combined into a single output stream. While this certainly has a good potential to solve the high throughput requirement, the way the outputs of the instances are combined in a single stream becomes a new weak link that can negatively impact all three requirements and therefore has to be addressed properly. In this paper we propose a distributed random number generator that efficiently addresses the aforementioned issue. We introduce a series of mechanisms to preserve a high entropy and degree of security for the combined output result and implement them on top of BlobSeer, a data storage service specifically designed to offer a high throughput in large-scale deployments even under heavy access concurrency. Large-scale experiments were performed on the G5K testbed and demonstrate substantial benefits for our approach.
    BibTex
    @article{BlobRand10,
      title = {Gathering Entropy at Large Scale with HAVEGE and BlobSeer},
      year = {2010},
      author = {Suciu, Alin and Nicolae, Bogdan and Antoniu, Gabriel and Istvan, Zsolt and Szakats, Istvan},
      journal = {Automat. Comput. Appl. Math.},
      volume = {19},
      pages = {3-11},
      url = {https://hal.inria.fr/hal-00803430/en},
      keywords = {random number generation, large scale, high throughput, high entropy, Blobseer, HAVEGE}
    }

  55. [55]Montes, J., Nicolae, B., Antoniu, G., Sanchez, A. and Perez, M. 2010. Using Global Behavior Modeling to Improve QoS in Cloud Data Storage Services. CloudCom ’10: Proc. 2nd IEEE International Conference on Cloud Computing Technology and Science (Indianapolis, USA, 2010), 304–311.   
    Details
    Keywords: QoS, cloud computing, data storage, behavioral modeling, throughput stabilization, GloBeM, BlobSeer, MapReduce
    Abstract: The cloud computing model aims to make large-scale data-intensive computing affordable even for users with limited financial resources, that cannot invest into expensive infrastructures necesssary to run them. In this context, MapReduce is emerging as a highly scalable programming paradigm that enables high-throughput data-intensive processing as a cloud service. Its performance is highly dependent on the underlying storage service, responsible to efficiently support massively parallel data accesses by guaranteeing a high throughput under heavy access concurrency. In this context, quality of service plays a crucial role: the storage service needs to sustain a stable throughput for each individual accesss, in addition to achieving a high aggregated throughput under concurrency. In this paper we propose a technique to address this problem using component monitoring, application-side feedback and behavior pattern analysis to automatically infer useful knowledge about the causes of poor quality of service and provide an easy way to reasonin about potential improvements. We apply our proposal to BlobSeer, a representative data storage service specifically designed to achieve high aggregated throughputs and show through extensive experimentation substantial improvements in the stability of individual data read accesses under MapReduce workloads.
    BibTex
    @inproceedings{BlobSeerCloudCom,
      title = {Using Global Behavior Modeling to Improve QoS in Cloud Data Storage Services},
      year = {2010},
      author = {Montes, Jesus and Nicolae, Bogdan and Antoniu, Gabriel and Sanchez, Alberto and Perez, Maria},
      booktitle = {CloudCom ’10: Proc. 2nd IEEE International Conference on Cloud Computing Technology and Science},
      pages = {304-311},
      address = {Indianapolis, USA},
      doi = {10.1109/CloudCom.2010.33},
      url = {https://hal.inria.fr/inria-00527650v1},
      keywords = {QoS, cloud computing, data storage, behavioral modeling, throughput stabilization, GloBeM, BlobSeer, MapReduce}
    }

  56. [56]Nicolae, B. 2010. High Throughput Data-Compression for Cloud Storage. Globe ’10: Proc. 3rd International Conference on Data Management in Grid and P2P Systems (Bilbao, Spain, 2010), 1–12.   
    Details
    Keywords: cloud computing, distributed data storage, high throughput, adaptive I/O, data intensive applications
    Abstract: As data volumes processed by large-scale distributed data-intensive applications grow at high-speed, an increasing I/O pressure is put on the underlying storage service, which is responsible for data management. One particularly difficult challenge, that the storage service has to deal with, is to sustain a high I/O throughput in spite of heavy access concurrency to massive data. In order to do so, massively parallel data transfers need to be performed, which invariably lead to a high bandwidth utilization. With the emergence of cloud computing, data intensive applications become attractive for a wide public that does not have the resources to maintain expensive large scale distributed infrastructures to run such applications. In this context, minimizing the storage space and bandwidth utilization is highly relevant, as these resources are paid for according to the consumption. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on I/O throughput when under heavy access concurrency. Our solution builds on BlobSeer, a highly parallel distributed data management service specifically designed to enable reading, writing and appending huge data sequences that are fragmented and distributed at a large scale. We demonstrate the benefits of our approach by performing extensive experimentations on the Grid’5000 testbed.
    BibTex
    @inproceedings{BlobSeerCompression,
      title = {High Throughput Data-Compression for Cloud Storage},
      year = {2010},
      author = {Nicolae, Bogdan},
      booktitle = {Globe ’10: Proc. 3rd International Conference on Data Management in Grid and P2P Systems},
      pages = {1-12},
      address = {Bilbao, Spain},
      doi = {10.1007/978-3-642-15108-8_1},
      url = {https://hal.inria.fr/inria-00490541},
      keywords = {cloud computing, distributed data storage, high throughput, adaptive I/O, data intensive applications}
    }

  57. [57]Nicolae, B. 2010. BlobSeer: Efficient Data Management for Data-Intensive Applications Distributed at Large-Scale. IPDPS ’10: Proc. 24th IEEE International Symposium on Parallel and Distributed Processing: Workshops and Phd Forum (Atlanta, USA, 2010), 1–4.   
    Details
    Keywords: data intensive applications, large scale, distributed data storage, high throughput, heavy access concurrency, versioning, efficient concurrency control, data striping, distributed metadata management
    Abstract: Large-scale data-intensive applications are a class of applications that acquire and maintain massive datasets, while performing distributed computations on these datasets. In this context, a a key factor is the storage service responsible for the data management, as it has to efficiently deal with massively parallel data access in order to ensure scalability and performance for the whole system itself. This PhD thesis proposes BlobSeer, a data management service specifically designed to address the needs of large-scale data-intensive applications. Three key design factors: data striping, distributed metadata management and versioning-based concurrency control enable BlobSeer not only to provide efficient support for features commonly used to exploit data-level parallelism, but also enable exploring a set of new features that can be leveraged to further improve parallel data access. Extensive experimentations, both in scale and scope, on the Grid5000 testbed demonstrate clear benefits of using BlobSeer as the underlying storage for a variety of scenarios: data-intensive grid applications, grid file systems, MapReduce datacenters, desktop grids. Further work targets providing efficient storage solutions for cloud computing as well.
    BibTex
    @inproceedings{BlobSeerPhdForum,
      title = {BlobSeer: Efficient Data Management for Data-Intensive Applications Distributed at Large-Scale},
      year = {2010},
      author = {Nicolae, Bogdan},
      booktitle = {IPDPS ’10: Proc. 24th IEEE International Symposium on Parallel and Distributed Processing: Workshops and Phd Forum},
      pages = {1-4},
      address = {Atlanta, USA},
      doi = {10.1109/IPDPSW.2010.5470802},
      url = {https://hal.inria.fr/inria-00457809/en/},
      note = {Best Poster Award},
      keywords = {data intensive applications, large scale, distributed data storage, high throughput, heavy access concurrency, versioning, efficient concurrency control, data striping, distributed metadata management}
    }

  58. [58]Nicolae, B. 2010. BlobSeer: Towards Efficient Data Storage Management for Large-Scale, Distributed Systems. University of Rennes 1. 
    Details
    Keywords: large scale data storage, cloud storage, versioning, decentralized metadata management, high throughput, heavy access concurrency
    Abstract: With data volumes increasing at a high rate and the emergence of highly scalable infrastructures (cloud computing, petascale computing), distributed management of data becomes a crucial issue that faces many challenges. This thesis brings several contributions in order to address such challenges. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, it highlights the potentially large benefits of using versioning in this context. Second, based on these principles, it introduces a series of distributed data and metadata management algorithms that enable a high throughput under concurrency. Third, it shows how to efficiently implement these algorithms in practice, dealing with key issues such as high-performance parallel transfers, efficient maintenance of distributed data structures, fault tolerance, etc. These results are used to build BlobSeer, an experimental prototype that is used to demonstrate both the theoretical benefits of the approach in synthetic benchmarks, as well as the practical benefits in real-life, applicative scenarios: as a storage backend for MapReduce applications, as a storage backend for deployment and snapshotting of virtual machine images in clouds, as a quality-of-service enabled data storage service for cloud applications. Extensive experimentation on the Grid’5000 testbed shows that BlobSeer remains scalable and sustains a high throughput even under heavy access concurrency, outperforming by a large margin several state-of-art approaches.
    BibTex
    @phdthesis{BlobSeerThesis,
      title = {BlobSeer: Towards Efficient Data Storage Management for Large-Scale, Distributed Systems},
      year = {2010},
      author = {Nicolae, Bogdan},
      school = {University of Rennes 1},
      address = {Rennes, France},
      url = {https://hal.inria.fr/tel-00552271/en},
      keywords = {large scale data storage, cloud storage, versioning, decentralized metadata management, high throughput, heavy access concurrency},
      month = nov
    }

  59. [59]Nicolae, B., Moise, D., Antoniu, G., Bouge, L. and Dorier, M. 2010. BlobSeer: Bringing High Throughput under Heavy Concurrency to Hadoop Map/Reduce Applications. IPDPS ’10: Proc. 24th IEEE International Parallel and Distributed Processing Symposium (Atlanta, USA, 2010), 1–12.   
    Details
    Keywords: large-scale distributed computing, data-intensive, MapReduce, distributed file systems, high throughput, heavy access concurrency, Hadoop, BlobSeer
    Abstract: Hadoop is a software framework supporting the Map-Reduce programming model. It relies on the Hadoop Distributed File System (HDFS) as its primary storage system. The efficiency of HDFS is crucial for the performance of Map-Reduce applications. We substitute the original HDFS layer of Hadoop with a new, concurrency-optimized data storage layer based on the BlobSeer data management service. Thereby, the efficiency of Hadoop is significantly improved for data-intensive Map-Reduce applications, which naturally exhibit a high degree of data access concurrency. Moreover, BlobSeer’s features (built-in versioning, its support for concurrent append operations) open the possibility for Hadoop to further extend its functionalities. We report on extensive experiments conducted on the Grid’5000 testbed. The results illustrate the benefits of our approach over the original HDFS-based implementation of Hadoop.
    BibTex
    @inproceedings{BlobSeerMapRed,
      title = {BlobSeer: Bringing High Throughput under Heavy Concurrency to Hadoop Map/Reduce Applications},
      year = {2010},
      author = {Nicolae, Bogdan and Moise, Diana and Antoniu, Gabriel and Bouge, Luc and Dorier, Matthieu},
      booktitle = {IPDPS ’10: Proc. 24th IEEE International Parallel and Distributed Processing Symposium},
      pages = {1-12},
      address = {Atlanta, USA},
      doi = {10.1109/IPDPS.2010.5470433},
      url = {https://hal.inria.fr/inria-00456801/en},
      keywords = {large-scale distributed computing, data-intensive, MapReduce, distributed file systems, high throughput, heavy access concurrency, Hadoop, BlobSeer}
    }

  60. [60]Nicolae, B., Antoniu, G. and Bouge, L. 2009. BlobSeer: How to Enable Efficient Versioning for Large Object Storage under Heavy Access Concurrency. EDBT/ICDT ’09 Workshops (Saint-Petersburg, Russia, 2009), 18–25.   
    Details
    Keywords: large scale data storage, concurrency control, versioning, decentralized metadata
    Abstract: To accommodate the needs of large-scale distributed P2P systems, scalable data management strategies are required, allowing appli cations to efficiently cope with continuously growing, highly dis tributed data. This paper addresses the problem of efficiently stor ing and accessing very large binary data objects (blobs). It proposesan efficient versioning scheme allowing a large number of clients to concurrently read, write and append data to huge blobs that are fragmented and distributed at a very large scale. Scalability under heavy concurrency is achieved thanks to an original metadata scheme, based on a distributed segment tree built on top of a Distributed Hash Table (DHT). Our approach has been implemented and experimented within our BlobSeer prototype on the Grid’5000 testbed, using up to 175 nodes.
    BibTex
    @inproceedings{BlobSeerDAMAP,
      title = {BlobSeer: How to Enable Efficient Versioning for Large Object Storage under Heavy Access Concurrency},
      year = {2009},
      author = {Nicolae, Bogdan and Antoniu, Gabriel and Bouge, Luc},
      booktitle = {EDBT/ICDT ’09 Workshops},
      pages = {18-25},
      address = {Saint-Petersburg, Russia},
      doi = {10.1145/1698790.1698796},
      url = {https://hal.inria.fr/inria-00382354v1},
      keywords = {large scale data storage, concurrency control, versioning, decentralized metadata}
    }

  61. [61]Nicolae, B., Antoniu, G. and Bouge, L. 2009. Enabling High Data Throughput in Desktop Grids Through Decentralized Data and Metadata Management: The BlobSeer Approach. Euro-Par ’09 : Proc. 15th International Euro-Par Conference on Parallel Processing (Delft, The Netherlands, 2009), 404–416.   
    Details
    Keywords: desktop grids, distributed metadata management, data intensive applications, large data size, heavy access concurrency, high speed writes
    Abstract: Whereas traditional Desktop Grids rely on centralized servers for data management, some recent progress has been made to enable distributed, large in- put data, using to peer-to-peer (P2P) protocols and Content Distribution Networks (CDN). We make a step further and propose a generic, yet efficient data storage which enables the use of Desktop Grids for applications with high output data re- quirements, where the access grain and the access patterns may be random. Our solution builds on a blob management service enabling a large number of con- current clients to efficiently read/write and append huge data that are fragmented and distributed at a large scale. Scalability under heavy concurrency is achieved thanks to an original metadata scheme using a distributed segment tree built on top of a Distributed Hash Table (DHT). The proposed approach has been imple- mented and its benefits have successfully been demonstrated within our BlobSeer prototype on the Grid’5000 testbed.
    BibTex
    @inproceedings{BlobSeerEuroPar09,
      title = {Enabling High Data Throughput in Desktop Grids Through Decentralized Data and Metadata Management: The BlobSeer Approach},
      year = {2009},
      author = {Nicolae, Bogdan and Antoniu, Gabriel and Bouge, Luc},
      booktitle = {Euro-Par ’09 : Proc. 15th International Euro-Par Conference on Parallel Processing},
      pages = {404-416},
      address = {Delft, The Netherlands},
      doi = {10.1007/978-3-642-03869-3_40},
      url = {https://hal.inria.fr/inria-00410956v2},
      keywords = {desktop grids, distributed metadata management, data intensive applications, large data size, heavy access concurrency, high speed writes}
    }

  62. [62]Tran, V.-T., Antoniu, G., Nicolae, B. and Bouge, L. 2009. Towards A Grid File System Based On A Large-Scale BLOB Management Service. Grids, P2P and Service Computing (Delft, The Netherlands, 2009), 7–19.   
    Details
    Keywords: data intensive applications, large scale, distributed data storage, high throughput, heavy access concurrency, versioning, efficient concurrency control, data striping, distributed metadata management
    Abstract: Large-scale data-intensive applications are a class of applications that acquire and maintain massive datasets, while performing distributed computations on these datasets. In this context, a a key factor is the storage service responsible for the data management, as it has to efficiently deal with massively parallel data access in order to ensure scalability and performance for the whole system itself. This PhD thesis proposes BlobSeer, a data management service specifically designed to address the needs of large-scale data-intensive applications. Three key design factors: data striping, distributed metadata management and versioning-based concurrency control enable BlobSeer not only to provide efficient support for features commonly used to exploit data-level parallelism, but also enable exploring a set of new features that can be leveraged to further improve parallel data access. Extensive experimentations, both in scale and scope, on the Grid5000 testbed demonstrate clear benefits of using BlobSeer as the underlying storage for a variety of scenarios: data-intensive grid applications, grid file systems, MapReduce datacenters, desktop grids. Further work targets providing efficient storage solutions for cloud computing as well.
    BibTex
    @inproceedings{BlobSeerGfarm,
      title = {Towards A Grid File System Based On A Large-Scale BLOB Management Service},
      year = {2009},
      author = {Tran, Viet-Trung and Antoniu, Gabriel and Nicolae, Bogdan and Bouge, Luc},
      booktitle = {Grids, P2P and Service Computing},
      pages = {7-19},
      address = {Delft, The Netherlands},
      doi = {10.1109/IPDPSW.2010.5470802},
      url = {https://hal.inria.fr/inria-00457809v1},
      keywords = {data intensive applications, large scale, distributed data storage, high throughput, heavy access concurrency, versioning, efficient concurrency control, data striping, distributed metadata management}
    }

  63. [63]Nicolae, B., Antoniu, G. and Bouge, L. 2008. Enabling lock-free concurrent fine-grain access to massive distributed data: Application to supernovae detection. Cluster ’08 : Proc. IEEE International Conference on Cluster Computing: Poster Session (Tsukuba, Japan, 2008), 310–315.   
    Details
    Keywords: large scale data management, object storage, huge file, versioning, heavy access concurrency
    Abstract: We consider the problem of efficiently managing massive data in a large-scale distributed environment. We consider data strings of size in the order of Terabytes, shared and accessed by concurrent clients. On each individual access, a segment of a string, of the order of Megabytes, is read or modified. Our goal is to provide the clients with efficient fine-grain access the data string as concurrently as possible, without locking the string itself. This issue is crucial in the context of applications in the field of astronomy, databases, data mining and multimedia. We illustrate these requiremens with the case of an application for searching supernovae. Our solution relies on distributed, RAM-based data storage, while leveraging a DHT-based, parallel metadata management scheme. The proposed architecture and algorithms have been validated through a software prototype and evaluated in a cluster environment.
    BibTex
    @inproceedings{BlobSeerCLUSTER,
      title = {Enabling lock-free concurrent fine-grain access to massive distributed data: Application to supernovae detection},
      year = {2008},
      author = {Nicolae, Bogdan and Antoniu, Gabriel and Bouge, Luc},
      booktitle = {Cluster ’08 : Proc. IEEE International Conference on Cluster Computing: Poster Session},
      pages = {310-315},
      address = {Tsukuba, Japan},
      doi = {10.1109/CLUSTR.2008.4663787},
      url = {https://hal.inria.fr/inria-00329698},
      keywords = {large scale data management, object storage, huge file, versioning, heavy access concurrency}
    }

  64. [64]Nicolae, B., Antoniu, G. and Bouge, L. 2008. Distributed Management of Massive Data: An Efficient Fine-Grain Data Access Scheme. VECPAR ’08 : Proc. 8th International Meeting on High Performance Computing for Computational Science (Toulouse, France, 2008), 532–543.   
    Details
    Keywords: high performance distributed computing, large scale data sharing, distributed data management, lock-free, fine grain access
    Abstract: This paper addresses the problem of efficiently storing and accessing massive data blocks in a large-scale distributed environment, while providing efficient fine-grain access to data subsets. This issue is crucial in the context of applications in the field of databases, data mining and multimedia. We propose a data sharing service based on distributed, RAM-based storage of data, while leveraging a DHT-based, natively parallel metadata management scheme. As opposed to the most commonly used grid storage infrastructures that provide mechanisms for explicit data localization and transfer, we provide a transparent access model, where data are accessed through global identifiers. Our proposal has been validated through a prototype implementation whose preliminary evaluation provides promising results.
    BibTex
    @inproceedings{BlobSeerVECPAR,
      title = {Distributed Management of Massive Data: An Efficient Fine-Grain Data Access Scheme},
      year = {2008},
      author = {Nicolae, Bogdan and Antoniu, Gabriel and Bouge, Luc},
      booktitle = {VECPAR ’08 : Proc. 8th International Meeting on High Performance Computing for Computational Science},
      pages = {532-543},
      address = {Toulouse, France},
      doi = {10.1007/978-3-540-92859-1_47},
      url = {https://hal.inria.fr/inria-00323248v1},
      keywords = {high performance distributed computing, large scale data sharing, distributed data management, lock-free, fine grain access}
    }