14 resultados para scalable
em Duke University
Resumo:
The difluoromethyl-allo-threonyl hydroxamate-based compound LPC-058 is a potent inhibitor of UDP-3-O-(R-3-hydroxymyristoyl)-N-acetylglucosamine deacetylase (LpxC) in Gram-negative bacteria. A scalable synthesis of this compound is described. The key step in the synthetic sequence is a transition metal/base-catalyzed aldol reaction of methyl isocyanoacetate and difluoroacetone, giving rise to 4-(methoxycarbonyl)-5,5-disubstituted 2-oxazoline. A simple NMR-based determination of enantiomeric purity is also described.
Resumo:
Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
Resumo:
Uncertainty quantification (UQ) is both an old and new concept. The current novelty lies in the interactions and synthesis of mathematical models, computer experiments, statistics, field/real experiments, and probability theory, with a particular emphasize on the large-scale simulations by computer models. The challenges not only come from the complication of scientific questions, but also from the size of the information. It is the focus in this thesis to provide statistical models that are scalable to massive data produced in computer experiments and real experiments, through fast and robust statistical inference.
Chapter 2 provides a practical approach for simultaneously emulating/approximating massive number of functions, with the application on hazard quantification of Soufri\`{e}re Hills volcano in Montserrate island. Chapter 3 discusses another problem with massive data, in which the number of observations of a function is large. An exact algorithm that is linear in time is developed for the problem of interpolation of Methylation levels. Chapter 4 and Chapter 5 are both about the robust inference of the models. Chapter 4 provides a new criteria robustness parameter estimation criteria and several ways of inference have been shown to satisfy such criteria. Chapter 5 develops a new prior that satisfies some more criteria and is thus proposed to use in practice.
Resumo:
We demonstrate a scalable approach to addressing multiple atomic qubits for use in quantum information processing. Individually trapped 87Rb atoms in a linear array are selectively manipulated with a single laser guided by a microelectromechanical beam steering system. Single qubit oscillations are shown on multiple sites at frequencies of ≃3.5 MHz with negligible crosstalk to neighboring sites. Switching times between the central atom and its closest neighbor were measured to be 6-7 μs while moving between the central atom and an atom two trap sites away took 10-14 μs. © 2010 American Institute of Physics.
Resumo:
We developed a ratiometric method capable of estimating total hemoglobin concentration from optically measured diffuse reflectance spectra. The three isosbestic wavelength ratio pairs that best correlated to total hemoglobin concentration independent of saturation and scattering were 545/390, 452/390, and 529/390 nm. These wavelength pairs were selected using forward Monte Carlo simulations which were used to extract hemoglobin concentration from experimental phantom measurements. Linear regression coefficients from the simulated data were directly applied to the phantom data, by calibrating for instrument throughput using a single phantom. Phantoms with variable scattering and hemoglobin saturation were tested with two different instruments, and the average percent errors between the expected and ratiometrically-extracted hemoglobin concentration were as low as 6.3%. A correlation of r = 0.88 between hemoglobin concentration extracted using the 529/390 nm isosbestic ratio and a scalable inverse Monte Carlo model was achieved for in vivo dysplastic cervical measurements (hemoglobin concentrations have been shown to be diagnostic for the detection of cervical pre-cancer by our group). These results indicate that use of such a simple ratiometric method has the potential to be used in clinical applications where tissue hemoglobin concentrations need to be rapidly quantified in vivo.
Resumo:
BACKGROUND: Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. RESULTS: We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. CONCLUSIONS: permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.
Resumo:
The computational detection of regulatory elements in DNA is a difficult but important problem impacting our progress in understanding the complex nature of eukaryotic gene regulation. Attempts to utilize cross-species conservation for this task have been hampered both by evolutionary changes of functional sites and poor performance of general-purpose alignment programs when applied to non-coding sequence. We describe a new and flexible framework for modeling binding site evolution in multiple related genomes, based on phylogenetic pair hidden Markov models which explicitly model the gain and loss of binding sites along a phylogeny. We demonstrate the value of this framework for both the alignment of regulatory regions and the inference of precise binding-site locations within those regions. As the underlying formalism is a stochastic, generative model, it can also be used to simulate the evolution of regulatory elements. Our implementation is scalable in terms of numbers of species and sequence lengths and can produce alignments and binding-site predictions with accuracy rivaling or exceeding current systems that specialize in only alignment or only binding-site prediction. We demonstrate the validity and power of various model components on extensive simulations of realistic sequence data and apply a specific model to study Drosophila enhancers in as many as ten related genomes and in the presence of gain and loss of binding sites. Different models and modeling assumptions can be easily specified, thus providing an invaluable tool for the exploration of biological hypotheses that can drive improvements in our understanding of the mechanisms and evolution of gene regulation.
Resumo:
An enterprise information system (EIS) is an integrated data-applications platform characterized by diverse, heterogeneous, and distributed data sources. For many enterprises, a number of business processes still depend heavily on static rule-based methods and extensive human expertise. Enterprises are faced with the need for optimizing operation scheduling, improving resource utilization, discovering useful knowledge, and making data-driven decisions.
This thesis research is focused on real-time optimization and knowledge discovery that addresses workflow optimization, resource allocation, as well as data-driven predictions of process-execution times, order fulfillment, and enterprise service-level performance. In contrast to prior work on data analytics techniques for enterprise performance optimization, the emphasis here is on realizing scalable and real-time enterprise intelligence based on a combination of heterogeneous system simulation, combinatorial optimization, machine-learning algorithms, and statistical methods.
On-demand digital-print service is a representative enterprise requiring a powerful EIS.We use real-life data from Reischling Press, Inc. (RPI), a digit-print-service provider (PSP), to evaluate our optimization algorithms.
In order to handle the increase in volume and diversity of demands, we first present a high-performance, scalable, and real-time production scheduling algorithm for production automation based on an incremental genetic algorithm (IGA). The objective of this algorithm is to optimize the order dispatching sequence and balance resource utilization. Compared to prior work, this solution is scalable for a high volume of orders and it provides fast scheduling solutions for orders that require complex fulfillment procedures. Experimental results highlight its potential benefit in reducing production inefficiencies and enhancing the productivity of an enterprise.
We next discuss analysis and prediction of different attributes involved in hierarchical components of an enterprise. We start from a study of the fundamental processes related to real-time prediction. Our process-execution time and process status prediction models integrate statistical methods with machine-learning algorithms. In addition to improved prediction accuracy compared to stand-alone machine-learning algorithms, it also performs a probabilistic estimation of the predicted status. An order generally consists of multiple series and parallel processes. We next introduce an order-fulfillment prediction model that combines advantages of multiple classification models by incorporating flexible decision-integration mechanisms. Experimental results show that adopting due dates recommended by the model can significantly reduce enterprise late-delivery ratio. Finally, we investigate service-level attributes that reflect the overall performance of an enterprise. We analyze and decompose time-series data into different components according to their hierarchical periodic nature, perform correlation analysis,
and develop univariate prediction models for each component as well as multivariate models for correlated components. Predictions for the original time series are aggregated from the predictions of its components. In addition to a significant increase in mid-term prediction accuracy, this distributed modeling strategy also improves short-term time-series prediction accuracy.
In summary, this thesis research has led to a set of characterization, optimization, and prediction tools for an EIS to derive insightful knowledge from data and use them as guidance for production management. It is expected to provide solutions for enterprises to increase reconfigurability, accomplish more automated procedures, and obtain data-driven recommendations or effective decisions.
Resumo:
BACKGROUND/AIMS: The obesity epidemic has spread to young adults, and obesity is a significant risk factor for cardiovascular disease. The prominence and increasing functionality of mobile phones may provide an opportunity to deliver longitudinal and scalable weight management interventions in young adults. The aim of this article is to describe the design and development of the intervention tested in the Cell Phone Intervention for You study and to highlight the importance of adaptive intervention design that made it possible. The Cell Phone Intervention for You study was a National Heart, Lung, and Blood Institute-sponsored, controlled, 24-month randomized clinical trial comparing two active interventions to a usual-care control group. Participants were 365 overweight or obese (body mass index≥25 kg/m2) young adults. METHODS: Both active interventions were designed based on social cognitive theory and incorporated techniques for behavioral self-management and motivational enhancement. Initial intervention development occurred during a 1-year formative phase utilizing focus groups and iterative, participatory design. During the intervention testing, adaptive intervention design, where an intervention is updated or extended throughout a trial while assuring the delivery of exactly the same intervention to each cohort, was employed. The adaptive intervention design strategy distributed technical work and allowed introduction of novel components in phases intended to help promote and sustain participant engagement. Adaptive intervention design was made possible by exploiting the mobile phone's remote data capabilities so that adoption of particular application components could be continuously monitored and components subsequently added or updated remotely. RESULTS: The cell phone intervention was delivered almost entirely via cell phone and was always-present, proactive, and interactive-providing passive and active reminders, frequent opportunities for knowledge dissemination, and multiple tools for self-tracking and receiving tailored feedback. The intervention changed over 2 years to promote and sustain engagement. The personal coaching intervention, alternatively, was primarily personal coaching with trained coaches based on a proven intervention, enhanced with a mobile application, but where all interactions with the technology were participant-initiated. CONCLUSION: The complexity and length of the technology-based randomized clinical trial created challenges in engagement and technology adaptation, which were generally discovered using novel remote monitoring technology and addressed using the adaptive intervention design. Investigators should plan to develop tools and procedures that explicitly support continuous remote monitoring of interventions to support adaptive intervention design in long-term, technology-based studies, as well as developing the interventions themselves.
Resumo:
Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.
Resumo:
Graphene, first isolated in 2004 and the subject of the 2010 Nobel Prize in physics, has generated a tremendous amount of research interest in recent years due to its incredible mechanical and electrical properties. However, difficulties in large-scale production and low as-prepared surface area have hindered commercial applications. In this dissertation, a new material is described incorporating the superior electrical properties of graphene edge planes into the high surface area framework of carbon nanotube forests using a scalable and reproducible technology.
The objectives of this research were to investigate the growth parameters and mechanisms of a graphene-carbon nanotube hybrid nanomaterial termed “graphenated carbon nanotubes” (g-CNTs), examine the applicability of g-CNT materials for applications in electrochemical capacitors (supercapacitors) and cold-cathode field emission sources, and determine materials characteristics responsible for the superior performance of g-CNTs in these applications. The growth kinetics of multi-walled carbon nanotubes (MWNTs), grown by plasma-enhanced chemical vapor deposition (PECVD), was studied in order to understand the fundamental mechanisms governing the PECVD reaction process. Activation energies and diffusivities were determined for key reaction steps and a growth model was developed in response to these findings. Differences in the reaction kinetics between CNTs grown on single-crystal silicon and polysilicon were studied to aid in the incorporation of CNTs into microelectromechanical systems (MEMS) devices. To understand processing-property relationships for g-CNT materials, a Design of Experiments (DOE) analysis was performed for the purpose of determining the importance of various input parameters on the growth of g-CNTs, finding that varying temperature alone allows the resultant material to transition from CNTs to g-CNTs and finally carbon nanosheets (CNSs): vertically oriented sheets of few-layered graphene. In addition, a phenomenological model was developed for g-CNTs. By studying variations of graphene-CNT hybrid nanomaterials by Raman spectroscopy, a linear trend was discovered between their mean crystallite size and electrochemical capacitance. Finally, a new method for the calculation of nanomaterial surface area, more accurate than the standard BET technique, was created based on atomic layer deposition (ALD) of titanium oxide (TiO2).
Resumo:
Secure Access For Everyone (SAFE), is an integrated system for managing trust
using a logic-based declarative language. Logical trust systems authorize each
request by constructing a proof from a context---a set of authenticated logic
statements representing credentials and policies issued by various principals
in a networked system. A key barrier to practical use of logical trust systems
is the problem of managing proof contexts: identifying, validating, and
assembling the credentials and policies that are relevant to each trust
decision.
SAFE addresses this challenge by (i) proposing a distributed authenticated data
repository for storing the credentials and policies; (ii) introducing a
programmable credential discovery and assembly layer that generates the
appropriate tailored context for a given request. The authenticated data
repository is built upon a scalable key-value store with its contents named by
secure identifiers and certified by the issuing principal. The SAFE language
provides scripting primitives to generate and organize logic sets representing
credentials and policies, materialize the logic sets as certificates, and link
them to reflect delegation patterns in the application. The authorizer fetches
the logic sets on demand, then validates and caches them locally for further
use. Upon each request, the authorizer constructs the tailored proof context
and provides it to the SAFE inference for certified validation.
Delegation-driven credential linking with certified data distribution provides
flexible and dynamic policy control enabling security and trust infrastructure
to be agile, while addressing the perennial problems related to today's
certificate infrastructure: automated credential discovery, scalable
revocation, and issuing credentials without relying on centralized authority.
We envision SAFE as a new foundation for building secure network systems. We
used SAFE to build secure services based on case studies drawn from practice:
(i) a secure name service resolver similar to DNS that resolves a name across
multi-domain federated systems; (ii) a secure proxy shim to delegate access
control decisions in a key-value store; (iii) an authorization module for a
networked infrastructure-as-a-service system with a federated trust structure
(NSF GENI initiative); and (iv) a secure cooperative data analytics service
that adheres to individual secrecy constraints while disclosing the data. We
present empirical evaluation based on these case studies and demonstrate that
SAFE supports a wide range of applications with low overhead.
A New Method for Modeling Free Surface Flows and Fluid-structure Interaction with Ocean Applications
Resumo:
The computational modeling of ocean waves and ocean-faring devices poses numerous challenges. Among these are the need to stably and accurately represent both the fluid-fluid interface between water and air as well as the fluid-structure interfaces arising between solid devices and one or more fluids. As techniques are developed to stably and accurately balance the interactions between fluid and structural solvers at these boundaries, a similarly pressing challenge is the development of algorithms that are massively scalable and capable of performing large-scale three-dimensional simulations on reasonable time scales. This dissertation introduces two separate methods for approaching this problem, with the first focusing on the development of sophisticated fluid-fluid interface representations and the second focusing primarily on scalability and extensibility to higher-order methods.
We begin by introducing the narrow-band gradient-augmented level set method (GALSM) for incompressible multiphase Navier-Stokes flow. This is the first use of the high-order GALSM for a fluid flow application, and its reliability and accuracy in modeling ocean environments is tested extensively. The method demonstrates numerous advantages over the traditional level set method, among these a heightened conservation of fluid volume and the representation of subgrid structures.
Next, we present a finite-volume algorithm for solving the incompressible Euler equations in two and three dimensions in the presence of a flow-driven free surface and a dynamic rigid body. In this development, the chief concerns are efficiency, scalability, and extensibility (to higher-order and truly conservative methods). These priorities informed a number of important choices: The air phase is substituted by a pressure boundary condition in order to greatly reduce the size of the computational domain, a cut-cell finite-volume approach is chosen in order to minimize fluid volume loss and open the door to higher-order methods, and adaptive mesh refinement (AMR) is employed to focus computational effort and make large-scale 3D simulations possible. This algorithm is shown to produce robust and accurate results that are well-suited for the study of ocean waves and the development of wave energy conversion (WEC) devices.
Resumo:
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.