Studying Recommender Systems to Enhance Distributed Computing Schedulers


Autoria(s): Demoulin, Henri Maxime
Contribuinte(s)

Lee, Benjamin C.

Data(s)

2016

Resumo

<p>Distributed Computing frameworks belong to a class of programming models that allow developers to</p><p> launch workloads on large clusters of machines. Due to the dramatic increase in the volume of</p><p> data gathered by ubiquitous computing devices, data analytic workloads have become a common</p><p> case among distributed computing applications, making Data Science an entire field of</p><p> Computer Science. We argue that Data Scientist's concern lays in three main components: a dataset,</p><p> a sequence of operations they wish to apply on this dataset, and some constraint they may have</p><p> related to their work (performances, QoS, budget, etc). However, it is actually extremely</p><p> difficult, without domain expertise, to perform data science. One need to select the right amount</p><p> and type of resources, pick up a framework, and configure it. Also, users are often running their</p><p> application in shared environments, ruled by schedulers expecting them to specify precisely their resource</p><p> needs. Inherent to the distributed and concurrent nature of the cited frameworks, monitoring and </p><p> profiling are hard, high dimensional problems that block users from making the right</p><p> configuration choices and determining the right amount of resources they need. Paradoxically, the </p><p> system is gathering a large amount of monitoring data at runtime, which remains unused.</p><p> In the ideal abstraction we envision for data scientists, the system is adaptive, able to exploit</p><p> monitoring data to learn about workloads, and process user requests into a tailored execution</p><p> context. In this work, we study different techniques that have been used to make steps toward</p><p> such system awareness, and explore a new way to do so by implementing machine learning</p><p> techniques to recommend a specific subset of system configurations for Apache Spark applications.</p><p> Furthermore, we present an in depth study of Apache Spark executors configuration, which highlight</p><p> the complexity in choosing the best one for a given workload.</p>

Thesis

Identificador

http://hdl.handle.net/10161/12364

Palavras-Chave #Computer science
Tipo

Thesis