2 resultados para distributed curation
em Duke University
Resumo:
Distributed Computing frameworks belong to a class of programming models that allow developers to
launch workloads on large clusters of machines. Due to the dramatic increase in the volume of
data gathered by ubiquitous computing devices, data analytic workloads have become a common
case among distributed computing applications, making Data Science an entire field of
Computer Science. We argue that Data Scientist's concern lays in three main components: a dataset,
a sequence of operations they wish to apply on this dataset, and some constraint they may have
related to their work (performances, QoS, budget, etc). However, it is actually extremely
difficult, without domain expertise, to perform data science. One need to select the right amount
and type of resources, pick up a framework, and configure it. Also, users are often running their
application in shared environments, ruled by schedulers expecting them to specify precisely their resource
needs. Inherent to the distributed and concurrent nature of the cited frameworks, monitoring and
profiling are hard, high dimensional problems that block users from making the right
configuration choices and determining the right amount of resources they need. Paradoxically, the
system is gathering a large amount of monitoring data at runtime, which remains unused.
In the ideal abstraction we envision for data scientists, the system is adaptive, able to exploit
monitoring data to learn about workloads, and process user requests into a tailored execution
context. In this work, we study different techniques that have been used to make steps toward
such system awareness, and explore a new way to do so by implementing machine learning
techniques to recommend a specific subset of system configurations for Apache Spark applications.
Furthermore, we present an in depth study of Apache Spark executors configuration, which highlight
the complexity in choosing the best one for a given workload.
Resumo:
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.