855 resultados para Large-scale databases
Resumo:
We propose a randomized algorithm for large scale SVM learning which solves the problem by iterating over random subsets of the data. Crucial to the algorithm for scalability is the size of the subsets chosen. In the context of text classification we show that, by using ideas from random projections, a sample size of O(log n) can be used to obtain a solution which is close to the optimal with a high probability. Experiments done on synthetic and real life data sets demonstrate that the algorithm scales up SVM learners, without loss in accuracy. 1
Resumo:
In this paper we propose a novel, scalable, clustering based Ordinal Regression formulation, which is an instance of a Second Order Cone Program (SOCP) with one Second Order Cone (SOC) constraint. The main contribution of the paper is a fast algorithm, CB-OR, which solves the proposed formulation more eficiently than general purpose solvers. Another main contribution of the paper is to pose the problem of focused crawling as a large scale Ordinal Regression problem and solve using the proposed CB-OR. Focused crawling is an efficient mechanism for discovering resources of interest on the web. Posing the problem of focused crawling as an Ordinal Regression problem avoids the need for a negative class and topic hierarchy, which are the main drawbacks of the existing focused crawling methods. Experiments on large synthetic and benchmark datasets show the scalability of CB-OR. Experiments also show that the proposed focused crawler outperforms the state-of-the-art.
Resumo:
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. Malleable applications, where the number of processors on which the applications execute can be changed during executions, can make use of their malleability to better tolerate high failure rates. We present AdFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. AdFT framework includes cost models for evaluating the benefits of various fault tolerance actions including checkpointing, live-migration and rescheduling, and runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in application performance, and is effective even for petascale systems and beyond.
Resumo:
We report on the large scale synthesis of millimetre long buckled multiwalled carbon nanotubes by one-step pyrolysis. Current carrying capability of a highly buckled region is shown to be more as compared to a less buckled region.
Resumo:
Critical applications like cyclone tracking and earthquake modeling require simultaneous high-performance simulations and online visualization for timely analysis. Faster simulations and simultaneous visualization enable scientists provide real-time guidance to decision makers. In this work, we have developed an integrated user-driven and automated steering framework that simultaneously performs numerical simulations and efficient online remote visualization of critical weather applications in resource-constrained environments. It considers application dynamics like the criticality of the application and resource dynamics like the storage space, network bandwidth and available number of processors to adapt various application and resource parameters like simulation resolution, simulation rate and the frequency of visualization. We formulate the problem of finding an optimal set of simulation parameters as a linear programming problem. This leads to 30% higher simulation rate and 25-50% lesser storage consumption than a naive greedy approach. The framework also provides the user control over various application parameters like region of interest and simulation resolution. We have also devised an adaptive algorithm to reduce the lag between the simulation and visualization times. Using experiments with different network bandwidths, we find that our adaptive algorithm is able to reduce lag as well as visualize the most representative frames.
Resumo:
Parabolized stability equation (PSE) models are being deve loped to predict the evolu-tion of low-frequency, large-scale wavepacket structures and their radiated sound in high-speed turbulent round jets. Linear PSE wavepacket models were previously shown to be in reasonably good agreement with the amplitude envelope and phase measured using a microphone array placed just outside the jet shear layer. 1,2 Here we show they also in very good agreement with hot-wire measurements at the jet center line in the potential core,for a different set of experiments. 3 When used as a model source for acoustic analogy, the predicted far field noise radiation is in reasonably good agreement with microphone measurements for aft angles where contributions from large -scale structures dominate the acoustic field. Nonlinear PSE is then employed in order to determine the relative impor-tance of the mode interactions on the wavepackets. A series of nonlinear computations with randomized initial conditions are use in order to obtain bounds for the evolution of the modes in the natural turbulent jet flow. It was found that n onlinearity has a very limited impact on the evolution of the wavepackets for St≥0. 3. Finally, the nonlinear mechanism for the generation of a low-frequency mode as the difference-frequency mode 4,5 of two forced frequencies is investigated in the scope of the high Reynolds number jets considered in this paper.
Resumo:
Chebyshev-inequality-based convex relaxations of Chance-Constrained Programs (CCPs) are shown to be useful for learning classifiers on massive datasets. In particular, an algorithm that integrates efficient clustering procedures and CCP approaches for computing classifiers on large datasets is proposed. The key idea is to identify high density regions or clusters from individual class conditional densities and then use a CCP formulation to learn a classifier on the clusters. The CCP formulation ensures that most of the data points in a cluster are correctly classified by employing a Chebyshev-inequality-based convex relaxation. This relaxation is heavily dependent on the second-order statistics. However, this formulation and in general such relaxations that depend on the second-order moments are susceptible to moment estimation errors. One of the contributions of the paper is to propose several formulations that are robust to such errors. In particular a generic way of making such formulations robust to moment estimation errors is illustrated using two novel confidence sets. An important contribution is to show that when either of the confidence sets is employed, for the special case of a spherical normal distribution of clusters, the robust variant of the formulation can be posed as a second-order cone program. Empirical results show that the robust formulations achieve accuracies comparable to that with true moments, even when moment estimates are erroneous. Results also illustrate the benefits of employing the proposed methodology for robust classification of large-scale datasets.
Resumo:
Daily rainfall datasets of 10 years (1998-2007) of Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) version 6 and India Meteorological Department (IMD) gridded rain gauge have been compared over the Indian landmass, both in large and small spatial scales. On the larger spatial scale, the pattern correlation between the two datasets on daily scales during individual years of the study period is ranging from 0.4 to 0.7. The correlation improved significantly (similar to 0.9) when the study was confined to specific wet and dry spells each of about 5-8 days. Wavelet analysis of intraseasonal oscillations (ISO) of the southwest monsoon rainfall show the percentage contribution of the major two modes (30-50 days and 10-20 days), to be ranging respectively between similar to 30-40% and 5-10% for the various years. Analysis of inter-annual variability shows the satellite data to be underestimating seasonal rainfall by similar to 110 mm during southwest monsoon and overestimating by similar to 150 mm during northeast monsoon season. At high spatio-temporal scales, viz., 1 degrees x1 degrees grid, TMPA data do not correspond to ground truth. We have proposed here a new analysis procedure to assess the minimum spatial scale at which the two datasets are compatible with each other. This has been done by studying the contribution to total seasonal rainfall from different rainfall rate windows (at 1 mm intervals) on different spatial scales (at daily time scale). The compatibility spatial scale is seen to be beyond 5 degrees x5 degrees average spatial scale over the Indian landmass. This will help to decide the usability of TMPA products, if averaged at appropriate spatial scales, for specific process studies, e.g., cloud scale, meso scale or synoptic scale.
Resumo:
Mycobacterium tuberculosis owes its high pathogenic potential to its ability to evade host immune responses and thrive inside the macrophage. The outcome of infection is largely determined by the cellular response comprising a multitude of molecular events. The complexity and inter-relatedness in the processes makes it essential to adopt systems approaches to study them. In this work, we construct a comprehensive network of infection-related processes in a human macrophage comprising 1888 proteins and 14,016 interactions. We then compute response networks based on available gene expression profiles corresponding to states of health, disease and drug treatment. We use a novel formulation for mining response networks that has led to identifying highest activities in the cell. Highest activity paths provide mechanistic insights into pathogenesis and response to treatment. The approach used here serves as a generic framework for mining dynamic changes in genome-scale protein interaction networks.
Resumo:
In this paper, we propose low-complexity algorithms based on Monte Carlo sampling for signal detection and channel estimation on the uplink in large-scale multiuser multiple-input-multiple-output (MIMO) systems with tens to hundreds of antennas at the base station (BS) and a similar number of uplink users. A BS receiver that employs a novel mixed sampling technique (which makes a probabilistic choice between Gibbs sampling and random uniform sampling in each coordinate update) for detection and a Gibbs-sampling-based method for channel estimation is proposed. The algorithm proposed for detection alleviates the stalling problem encountered at high signal-to-noise ratios (SNRs) in conventional Gibbs-sampling-based detection and achieves near-optimal performance in large systems with M-ary quadrature amplitude modulation (M-QAM). A novel ingredient in the detection algorithm that is responsible for achieving near-optimal performance at low complexity is the joint use of a mixed Gibbs sampling (MGS) strategy coupled with a multiple restart (MR) strategy with an efficient restart criterion. Near-optimal detection performance is demonstrated for a large number of BS antennas and users (e. g., 64 and 128 BS antennas and users). The proposed Gibbs-sampling-based channel estimation algorithm refines an initial estimate of the channel obtained during the pilot phase through iterations with the proposed MGS-based detection during the data phase. In time-division duplex systems where channel reciprocity holds, these channel estimates can be used for multiuser MIMO precoding on the downlink. The proposed receiver is shown to achieve good performance and scale well for large dimensions.
Resumo:
In this paper, we propose a low-complexity algorithm based on Markov chain Monte Carlo (MCMC) technique for signal detection on the uplink in large scale multiuser multiple input multiple output (MIMO) systems with tens to hundreds of antennas at the base station (BS) and similar number of uplink users. The algorithm employs a randomized sampling method (which makes a probabilistic choice between Gibbs sampling and random sampling in each iteration) for detection. The proposed algorithm alleviates the stalling problem encountered at high SNRs in conventional MCMC algorithm and achieves near-optimal performance in large systems with M-QAM. A novel ingredient in the algorithm that is responsible for achieving near-optimal performance at low complexities is the joint use of a randomized MCMC (R-MCMC) strategy coupled with a multiple restart strategy with an efficient restart criterion. Near-optimal detection performance is demonstrated for large number of BS antennas and users (e.g., 64, 128, 256 BS antennas/users).
Resumo:
Elastic Net Regularizers have shown much promise in designing sparse classifiers for linear classification. In this work, we propose an alternating optimization approach to solve the dual problems of elastic net regularized linear classification Support Vector Machines (SVMs) and logistic regression (LR). One of the sub-problems turns out to be a simple projection. The other sub-problem can be solved using dual coordinate descent methods developed for non-sparse L2-regularized linear SVMs and LR, without altering their iteration complexity and convergence properties. Experiments on very large datasets indicate that the proposed dual coordinate descent - projection (DCD-P) methods are fast and achieve comparable generalization performance after the first pass through the data, with extremely sparse models.
Resumo:
In this paper, we propose a multiple-input multiple-output (MIMO) receiver algorithm that exploits channel hardening that occurs in large MIMO channels. Channel hardening refers to the phenomenon where the off-diagonal terms of the matrix become increasingly weaker compared to the diagonal terms as the size of the channel gain matrix increases. Specifically, we propose a message passing detection (MPD) algorithm which works with the real-valued matched filtered received vector (whose signal term becomes, where is the transmitted vector), and uses a Gaussian approximation on the off-diagonal terms of the matrix. We also propose a simple estimation scheme which directly obtains an estimate of (instead of an estimate of), which is used as an effective channel estimate in the MPD algorithm. We refer to this receiver as the channel hardening-exploiting message passing (CHEMP) receiver. The proposed CHEMP receiver achieves very good performance in large-scaleMIMO systems (e.g., in systems with 16 to 128 uplink users and 128 base station antennas). For the considered large MIMO settings, the complexity of the proposed MPD algorithm is almost the same as or less than that of the minimum mean square error (MMSE) detection. This is because the MPD algorithm does not need a matrix inversion. It also achieves a significantly better performance compared to MMSE and other message passing detection algorithms using MMSE estimate of. Further, we design optimized irregular low density parity check (LDPC) codes specific to the considered large MIMO channel and the CHEMP receiver through EXIT chart matching. The LDPC codes thus obtained achieve improved coded bit error rate performance compared to off-the-shelf irregular LDPC codes.