851 resultados para Large scale graph processing
Resumo:
In many real world prediction problems the output is a structured object like a sequence or a tree or a graph. Such problems range from natural language processing to compu- tational biology or computer vision and have been tackled using algorithms, referred to as structured output learning algorithms. We consider the problem of structured classifi- cation. In the last few years, large margin classifiers like sup-port vector machines (SVMs) have shown much promise for structured output learning. The related optimization prob -lem is a convex quadratic program (QP) with a large num-ber of constraints, which makes the problem intractable for large data sets. This paper proposes a fast sequential dual method (SDM) for structural SVMs. The method makes re-peated passes over the training set and optimizes the dual variables associated with one example at a time. The use of additional heuristics makes the proposed method more efficient. We present an extensive empirical evaluation of the proposed method on several sequence learning problems.Our experiments on large data sets demonstrate that the proposed method is an order of magnitude faster than state of the art methods like cutting-plane method and stochastic gradient descent method (SGD). Further, SDM reaches steady state generalization performance faster than the SGD method. The proposed SDM is thus a useful alternative for large scale structured output learning.
Resumo:
Large-eddy simulation (LES) has emerged as a promising tool for simulating turbulent flows in general and, in recent years,has also been applied to the particle-laden turbulence with some success (Kassinos et al., 2007). The motion of inertial particles is much more complicated than fluid elements, and therefore, LES of turbulent flow laden with inertial particles encounters new challenges. In the conventional LES, only large-scale eddies are explicitly resolved and the effects of unresolved, small or subgrid scale (SGS) eddies on the large-scale eddies are modeled. The SGS turbulent flow field is not available. The effects of SGS turbulent velocity field on particle motion have been studied by Wang and Squires (1996), Armenio et al. (1999), Yamamoto et al. (2001), Shotorban and Mashayek (2006a,b), Fede and Simonin (2006), Berrouk et al. (2007), Bini and Jones (2008), and Pozorski and Apte (2009), amongst others. One contemporary method to include the effects of SGS eddies on inertial particle motions is to introduce a stochastic differential equation (SDE), that is, a Langevin stochastic equation to model the SGS fluid velocity seen by inertial particles (Fede et al., 2006; Shotorban and Mashayek, 2006a; Shotorban and Mashayek, 2006b; Berrouk et al., 2007; Bini and Jones, 2008; Pozorski and Apte, 2009).However, the accuracy of such a Langevin equation model depends primarily on the prescription of the SGS fluid velocity autocorrelation time seen by an inertial particle or the inertial particle–SGS eddy interaction timescale (denoted by $\delt T_{Lp}$ and a second model constant in the diffusion term which controls the intensity of the random force received by an inertial particle (denoted by C_0, see Eq. (7)). From the theoretical point of view, dTLp differs significantly from the Lagrangian fluid velocity correlation time (Reeks, 1977; Wang and Stock, 1993), and this carries the essential nonlinearity in the statistical modeling of particle motion. dTLp and C0 may depend on the filter width and particle Stokes number even for a given turbulent flow. In previous studies, dTLp is modeled either by the fluid SGS Lagrangian timescale (Fede et al., 2006; Shotorban and Mashayek, 2006b; Pozorski and Apte, 2009; Bini and Jones, 2008) or by a simple extension of the timescale obtained from the full flow field (Berrouk et al., 2007). In this work, we shall study the subtle and on-monotonic dependence of $\delt T_{Lp}$ on the filter width and particle Stokes number using a flow field obtained from Direct Numerical Simulation (DNS). We then propose an empirical closure model for $\delta T_{Lp}$. Finally, the model is validated against LES of particle-laden turbulence in predicting single-particle statistics such as particle kinetic energy. As a first step, we consider the particle motion under the one-way coupling assumption in isotropic turbulent flow and neglect the gravitational settling effect. The one-way coupling assumption is only valid for low particle mass loading.
Resumo:
Genome wide association studies (GWAS) have identified several low-penetrance susceptibility alleles in chronic lymphocytic leukemia (CLL). Nevertheless, these studies scarcely study regions that are implicated in non-coding molecules such as microRNAs (miRNAs). Abnormalities in miRNAs, as altered expression patterns and mutations, have been described in CLL, suggesting their implication in the development of the disease. Genetic variations in miRNAs can affect levels of miRNA expression if present in pre-miRNAs and in miRNA biogenesis genes or alter miRNA function if present in both target mRNA and miRNA sequences. Therefore, the present study aimed to evaluate whether polymorphisms in pre-miRNAs, and/or miRNA processing genes contribute to predisposition for CLL. A total of 91 SNPs in 107 CLL patients and 350 cancer-free controls were successfully analyzed using TaqMan Open Array technology. We found nine statistically significant associations with CLL risk after FDR correction, seven in miRNA processing genes (rs3805500 and rs6877842 in DROSHA, rs1057035 in DICER1, rs17676986 in SND1, rs9611280 in TNRC6B, rs784567 in TRBP and rs11866002 in CNOT1) and two in pre-miRNAs (rs11614913 in miR196a2 and rs2114358 in miR1206). These findings suggest that polymorphisms in genes involved in miRNAs biogenesis pathway as well as in pre-miRNAs contribute to the risk of CLL. Large-scale studies are needed to validate the current findings.
Resumo:
The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.
Resumo:
This thesis elaborates on the problem of preprocessing a large graph so that single-pair shortest-path queries can be answered quickly at runtime. Computing shortest paths is a well studied problem, but exact algorithms do not scale well to real-world huge graphs in applications that require very short response time. The focus is on approximate methods for distance estimation, in particular in landmarks-based distance indexing. This approach involves choosing some nodes as landmarks and computing (offline), for each node in the graph its embedding, i.e., the vector of its distances from all the landmarks. At runtime, when the distance between a pair of nodes is queried, it can be quickly estimated by combining the embeddings of the two nodes. Choosing optimal landmarks is shown to be hard and thus heuristic solutions are employed. Given a budget of memory for the index, which translates directly into a budget of landmarks, different landmark selection strategies can yield dramatically different results in terms of accuracy. A number of simple methods that scale well to large graphs are therefore developed and experimentally compared. The simplest methods choose central nodes of the graph, while the more elaborate ones select central nodes that are also far away from one another. The efficiency of the techniques presented in this thesis is tested experimentally using five different real world graphs with millions of edges; for a given accuracy, they require as much as 250 times less space than the current approach which considers selecting landmarks at random. Finally, they are applied in two important problems arising naturally in large-scale graphs, namely social search and community detection.
Resumo:
We study the problem of preprocessing a large graph so that point-to-point shortest-path queries can be answered very fast. Computing shortest paths is a well studied problem, but exact algorithms do not scale to huge graphs encountered on the web, social networks, and other applications. In this paper we focus on approximate methods for distance estimation, in particular using landmark-based distance indexing. This approach involves selecting a subset of nodes as landmarks and computing (offline) the distances from each node in the graph to those landmarks. At runtime, when the distance between a pair of nodes is needed, we can estimate it quickly by combining the precomputed distances of the two nodes to the landmarks. We prove that selecting the optimal set of landmarks is an NP-hard problem, and thus heuristic solutions need to be employed. Given a budget of memory for the index, which translates directly into a budget of landmarks, different landmark selection strategies can yield dramatically different results in terms of accuracy. A number of simple methods that scale well to large graphs are therefore developed and experimentally compared. The simplest methods choose central nodes of the graph, while the more elaborate ones select central nodes that are also far away from one another. The efficiency of the suggested techniques is tested experimentally using five different real world graphs with millions of edges; for a given accuracy, they require as much as 250 times less space than the current approach in the literature which considers selecting landmarks at random. Finally, we study applications of our method in two problems arising naturally in large-scale networks, namely, social search and community detection.
Resumo:
Computer egress simulation has potential to be used in large scale incidents to provide live advice to incident commanders. While there are many considerations which must be taken into account when applying such models to live incidents, one of the first concerns the computational speed of simulations. No matter how important the insight provided by the simulation, numerical hindsight will not prove useful to an incident commander. Thus for this type of application to be useful, it is essential that the simulation can be run many times faster than real time. Parallel processing is a method of reducing run times for very large computational simulations by distributing the workload amongst a number of CPUs. In this paper we examine the development of a parallel version of the buildingEXODUS software. The parallel strategy implemented is based on a systematic partitioning of the problem domain onto an arbitrary number of sub-domains. Each sub-domain is computed on a separate processor and runs its own copy of the EXODUS code. The software has been designed to work on typical office based networked PCs but will also function on a Windows based cluster. Two evaluation scenarios using the parallel implementation of EXODUS are described; a large open area and a 50 story high-rise building scenario. Speed-ups of up to 3.7 are achieved using up to six computers, with high-rise building evacuation simulation achieving run times of 6.4 times faster than real time.
Resumo:
Modeling dynamical systems represents an important application class covering a wide range of disciplines including but not limited to biology, chemistry, finance, national security, and health care. Such applications typically involve large-scale, irregular graph processing, which makes them difficult to scale due to the evolutionary nature of their workload, irregular communication and load imbalance. EpiSimdemics is such an application simulating epidemic diffusion in extremely large and realistic social contact networks. It implements a graph-based system that captures dynamics among co-evolving entities. This paper presents an implementation of EpiSimdemics in Charm++ that enables future research by social, biological and computational scientists at unprecedented data and system scales. We present new methods for application-specific processing of graph data and demonstrate the effectiveness of these methods on a Cray XE6, specifically NCSA's Blue Waters system.
Resumo:
Field programmable gate array devices boast abundant resources with which custom accelerator components for signal, image and data processing may be realised; however, realising high performance, low cost accelerators currently demands manual register transfer level design. Software-programmable ’soft’ processors have been proposed as a way to reduce this design burden but they are unable to support performance and cost comparable to custom circuits. This paper proposes a new soft processing approach for FPGA which promises to overcome this barrier. A high performance, fine-grained streaming processor, known as a Streaming Accelerator Element, is proposed which realises accelerators as large scale custom multicore networks. By adopting a streaming execution approach with advanced program control and memory addressing capabilities, typical program inefficiencies can be almost completely eliminated to enable performance and cost which are unprecedented amongst software-programmable solutions. When used to realise accelerators for fast fourier transform, motion estimation, matrix multiplication and sobel edge detection it is shown how the proposed architecture enables real-time performance and with performance and cost comparable with hand-crafted custom circuit accelerators and up to two orders of magnitude beyond existing soft processors.
Resumo:
Complex networks have recently attracted a significant amount of research attention due to their ability to model real world phenomena. One important problem often encountered is to limit diffusive processes spread over the network, for example mitigating pandemic disease or computer virus spread. A number of problem formulations have been proposed that aim to solve such problems based on desired network characteristics, such as maintaining the largest network component after node removal. The recently formulated critical node detection problem aims to remove a small subset of vertices from the network such that the residual network has minimum pairwise connectivity. Unfortunately, the problem is NP-hard and also the number of constraints is cubic in number of vertices, making very large scale problems impossible to solve with traditional mathematical programming techniques. Even many approximation algorithm strategies such as dynamic programming, evolutionary algorithms, etc. all are unusable for networks that contain thousands to millions of vertices. A computationally efficient and simple approach is required in such circumstances, but none currently exist. In this thesis, such an algorithm is proposed. The methodology is based on a depth-first search traversal of the network, and a specially designed ranking function that considers information local to each vertex. Due to the variety of network structures, a number of characteristics must be taken into consideration and combined into a single rank that measures the utility of removing each vertex. Since removing a vertex in sequential fashion impacts the network structure, an efficient post-processing algorithm is also proposed to quickly re-rank vertices. Experiments on a range of common complex network models with varying number of vertices are considered, in addition to real world networks. The proposed algorithm, DFSH, is shown to be highly competitive and often outperforms existing strategies such as Google PageRank for minimizing pairwise connectivity.
Resumo:
Biological systems exhibit rich and complex behavior through the orchestrated interplay of a large array of components. It is hypothesized that separable subsystems with some degree of functional autonomy exist; deciphering their independent behavior and functionality would greatly facilitate understanding the system as a whole. Discovering and analyzing such subsystems are hence pivotal problems in the quest to gain a quantitative understanding of complex biological systems. In this work, using approaches from machine learning, physics and graph theory, methods for the identification and analysis of such subsystems were developed. A novel methodology, based on a recent machine learning algorithm known as non-negative matrix factorization (NMF), was developed to discover such subsystems in a set of large-scale gene expression data. This set of subsystems was then used to predict functional relationships between genes, and this approach was shown to score significantly higher than conventional methods when benchmarking them against existing databases. Moreover, a mathematical treatment was developed to treat simple network subsystems based only on their topology (independent of particular parameter values). Application to a problem of experimental interest demonstrated the need for extentions to the conventional model to fully explain the experimental data. Finally, the notion of a subsystem was evaluated from a topological perspective. A number of different protein networks were examined to analyze their topological properties with respect to separability, seeking to find separable subsystems. These networks were shown to exhibit separability in a nonintuitive fashion, while the separable subsystems were of strong biological significance. It was demonstrated that the separability property found was not due to incomplete or biased data, but is likely to reflect biological structure.
Resumo:
To construct Biodiversity richness maps from Environmental Niche Models (ENMs) of thousands of species is time consuming. A separate species occurrence data pre-processing phase enables the experimenter to control test AUC score variance due to species dataset size. Besides, removing duplicate occurrences and points with missing environmental data, we discuss the need for coordinate precision, wide dispersion, temporal and synonymity filters. After species data filtering, the final task of a pre-processing phase should be the automatic generation of species occurrence datasets which can then be directly ’plugged-in’ to the ENM. A software application capable of carrying out all these tasks will be a valuable time-saver particularly for large scale biodiversity studies.
Resumo:
Background: Expression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples. Nevertheless, there is still much debate on the best ways to process data, to design experiments and analyse the output. Furthermore, many of the more sophisticated mathematical approaches to data analysis in the literature remain inaccessible to much of the biological research community. In this study we examine ways of extracting and analysing a large data set obtained using the Agilent long oligonucleotide transcriptomics platform, applied to a set of human macrophage and dendritic cell samples. Results: We describe and validate a series of data extraction, transformation and normalisation steps which are implemented via a new R function. Analysis of replicate normalised reference data demonstrate that intrarray variability is small (only around 2 of the mean log signal), while interarray variability from replicate array measurements has a standard deviation (SD) of around 0.5 log(2) units (6 of mean). The common practise of working with ratios of Cy5/Cy3 signal offers little further improvement in terms of reducing error. Comparison to expression data obtained using Arabidopsis samples demonstrates that the large number of genes in each sample showing a low level of transcription reflect the real complexity of the cellular transcriptome. Multidimensional scaling is used to show that the processed data identifies an underlying structure which reflect some of the key biological variables which define the data set. This structure is robust, allowing reliable comparison of samples collected over a number of years and collected by a variety of operators. Conclusions: This study outlines a robust and easily implemented pipeline for extracting, transforming normalising and visualising transcriptomic array data from Agilent expression platform. The analysis is used to obtain quantitative estimates of the SD arising from experimental (non biological) intra- and interarray variability, and for a lower threshold for determining whether an individual gene is expressed. The study provides a reliable basis for further more extensive studies of the systems biology of eukaryotic cells.
Resumo:
Sensible heat fluxes (QH) are determined using scintillometry and eddy covariance over a suburban area. Two large aperture scintillometers provide spatially integrated fluxes across path lengths of 2.8 km and 5.5 km over Swindon, UK. The shorter scintillometer path spans newly built residential areas and has an approximate source area of 2-4 km2, whilst the long path extends from the rural outskirts to the town centre and has a source area of around 5-10 km2. These large-scale heat fluxes are compared with local-scale eddy covariance measurements. Clear seasonal trends are revealed by the long duration of this dataset and variability in monthly QH is related to the meteorological conditions. At shorter time scales the response of QH to solar radiation often gives rise to close agreement between the measurements, but during times of rapidly changing cloud cover spatial differences in the net radiation (Q*) coincide with greater differences between heat fluxes. For clear days QH lags Q*, thus the ratio of QH to Q* increases throughout the day. In summer the observed energy partitioning is related to the vegetation fraction through use of a footprint model. The results demonstrate the value of scintillometry for integrating surface heterogeneity and offer improved understanding of the influence of anthropogenic materials on surface-atmosphere interactions.
Resumo:
The discovery of participation of astrocytes as active elements in glutamatergic tripartite synapses (composed by functional units of two neurons and one astrocyte) has led to the construction of models of cognitive functioning in the human brain, focusing on associative learning, sensory integration, conscious processing and memory formation/retrieval. We have modelled human cognitive functions by means of an ensemble of functional units (tripartite synapses) connected by gap junctions that link distributed astrocytes, allowing the formation of intra- and intercellular calcium waves that putatively mediate large-scale cognitive information processing. The model contains a diagram of molecular mechanisms present in tripartite synapses and contributes to explain the physiological bases of cognitive functions. It can be potentially expanded to explain emotional functions and psychiatric phenomena. © MSM 2011.