955 resultados para Large datasets
Resumo:
GraphChi is the first reported disk-based graph engine that can handle billion-scale graphs on a single PC efficiently. GraphChi is able to execute several advanced data mining, graph mining and machine learning algorithms on very large graphs. With the novel technique of parallel sliding windows (PSW) to load subgraph from disk to memory for vertices and edges updating, it can achieve data processing performance close to and even better than those of mainstream distributed graph engines. GraphChi mentioned that its memory is not effectively utilized with large dataset, which leads to suboptimal computation performances. In this paper we are motivated by the concepts of 'pin ' from TurboGraph and 'ghost' from GraphLab to propose a new memory utilization mode for GraphChi, which is called Part-in-memory mode, to improve the GraphChi algorithm performance. The main idea is to pin a fixed part of data inside the memory during the whole computing process. Part-in-memory mode is successfully implemented with only about 40 additional lines of code to the original GraphChi engine. Extensive experiments are performed with large real datasets (including Twitter graph with 1.4 billion edges). The preliminary results show that Part-in-memory mode memory management approach effectively reduces the GraphChi running time by up to 60% in PageRank algorithm. Interestingly it is found that a larger portion of data pinned in memory does not always lead to better performance in the case that the whole dataset cannot be fitted in memory. There exists an optimal portion of data which should be kept in the memory to achieve the best computational performance.
Resumo:
The seminal multiple view stereo benchmark evaluations from Middlebury and by Strecha et al. have played a major role in propelling the development of multi-view stereopsis methodology. Although seminal, these benchmark datasets are limited in scope with few reference scenes. Here, we try to take these works a step further by proposing a new multi-view stereo dataset, which is an order of magnitude larger in number of scenes and with a significant increase in diversity. Specifically, we propose a dataset containing 80 scenes of large variability. Each scene consists of 49 or 64 accurate camera positions and reference structured light scans, all acquired by a 6-axis industrial robot. To apply this dataset we propose an extension of the evaluation protocol from the Middlebury evaluation, reflecting the more complex geometry of some of our scenes. The proposed dataset is used to evaluate the state of the art multiview stereo algorithms of Tola et al., Campbell et al. and Furukawa et al. Hereby we demonstrate the usability of the dataset as well as gain insight into the workings and challenges of multi-view stereopsis. Through these experiments we empirically validate some of the central hypotheses of multi-view stereopsis, as well as determining and reaffirming some of the central challenges.
Resumo:
Objective: Vomiting in pregnancy is a common condition affecting 80% of pregnant women. Hyperemesis is at one end of the spectrum, seen in 0.5–2% of the pregnant population. Known factors such as nulliparity, younger age and high body mass indexare associated with an increased risk of this condition in the first trimester. Late pregnancy complications attributable to hyperemesis, the pathogenesis of which is poorly understood, have not been studied in large population-based studies in the United Kingdom. The objective of this study was to determine a plausible association between hyperemesis and pregnancy complications,such as pregnancy-related hypertension, gestational diabetes and liver problems in pregnancy, and the rates of elective (ElCS) and emergency caesarean section (EmCS). Methods: Using a database based on ICD-10 classification, anonymised data of admissions to a large multi-ethnic hospital in Manchester, UK between 2000 and 2012 were examined.Notwithstanding the obvious limitations with hospital database-based research, this large volume of datasets allows powerful studies of disease trends and complications.Results Between 2000 and 2012, 156 507 women aged 45 or under were admitted to hospital. Of these, 1111 women were coded for hyperemesis (0.4%). A greater proportion of women with hyperemesis than without hyperemesis were coded forhypertensive disorders in pregnancy such as pregnancy-induced hypertension, pre-eclampsia and eclampsia (2.7% vs 1.5%;P=0.001). The proportion of gestational diabetes and liver disorders in pregnancy was similar for both groups (diabetes:0.5% vs. 0.4%; P=0.945, liver disorders: 0.2% vs. 0.1%;P=0.662). Hyperemesis patients had a higher proportion of elective and emergency caesarean sections compared with the non-hyperemesis group (ElCS: 3.3% vs. 2%; P=0.002, EmCS: 5% vs.3%; P=0.00). Conclusions: There was a higher rate of emergency and elective caesarean section in women with hyperemesis, which could reflect the higher prevalence of pregnancy-related hypertensive disorders(but not diabetes or liver disorders) in this group. The factors contributing to the higher prevalence of hypertensive disorders arenot known, but these findings lead us to question whether there is a similar pathogenesis in the development of both the conditions and hence whether further study in this area is warranted.
Resumo:
Modern geographical databases, which are at the core of geographic information systems (GIS), store a rich set of aspatial attributes in addition to geographic data. Typically, aspatial information comes in textual and numeric format. Retrieving information constrained on spatial and aspatial data from geodatabases provides GIS users the ability to perform more interesting spatial analyses, and for applications to support composite location-aware searches; for example, in a real estate database: “Find the nearest homes for sale to my current location that have backyard and whose prices are between $50,000 and $80,000”. Efficient processing of such queries require combined indexing strategies of multiple types of data. Existing spatial query engines commonly apply a two-filter approach (spatial filter followed by nonspatial filter, or viceversa), which can incur large performance overheads. On the other hand, more recently, the amount of geolocation data has grown rapidly in databases due in part to advances in geolocation technologies (e.g., GPS-enabled smartphones) that allow users to associate location data to objects or events. The latter poses potential data ingestion challenges of large data volumes for practical GIS databases. In this dissertation, we first show how indexing spatial data with R-trees (a typical data pre-processing task) can be scaled in MapReduce—a widely-adopted parallel programming model for data intensive problems. The evaluation of our algorithms in a Hadoop cluster showed close to linear scalability in building R-tree indexes. Subsequently, we develop efficient algorithms for processing spatial queries with aspatial conditions. Novel techniques for simultaneously indexing spatial with textual and numeric data are developed to that end. Experimental evaluations with real-world, large spatial datasets measured query response times within the sub-second range for most cases, and up to a few seconds for a small number of cases, which is reasonable for interactive applications. Overall, the previous results show that the MapReduce parallel model is suitable for indexing tasks in spatial databases, and the adequate combination of spatial and aspatial attribute indexes can attain acceptable response times for interactive spatial queries with constraints on aspatial data.
Resumo:
With the exponential increasing demands and uses of GIS data visualization system, such as urban planning, environment and climate change monitoring, weather simulation, hydrographic gauge and so forth, the geospatial vector and raster data visualization research, application and technology has become prevalent. However, we observe that current web GIS techniques are merely suitable for static vector and raster data where no dynamic overlaying layers. While it is desirable to enable visual explorations of large-scale dynamic vector and raster geospatial data in a web environment, improving the performance between backend datasets and the vector and raster applications remains a challenging technical issue. This dissertation is to implement these challenging and unimplemented areas: how to provide a large-scale dynamic vector and raster data visualization service with dynamic overlaying layers accessible from various client devices through a standard web browser, and how to make the large-scale dynamic vector and raster data visualization service as rapid as the static one. To accomplish these, a large-scale dynamic vector and raster data visualization geographic information system based on parallel map tiling and a comprehensive performance improvement solution are proposed, designed and implemented. They include: the quadtree-based indexing and parallel map tiling, the Legend String, the vector data visualization with dynamic layers overlaying, the vector data time series visualization, the algorithm of vector data rendering, the algorithm of raster data re-projection, the algorithm for elimination of superfluous level of detail, the algorithm for vector data gridding and re-grouping and the cluster servers side vector and raster data caching.
Resumo:
Large-extent vegetation datasets that co-occur with long-term hydrology data provide new ways to develop biologically meaningful hydrologic variables and to determine plant community responses to hydrology. We analyzed the suitability of different hydrological variables to predict vegetation in two water conservation areas (WCAs) in the Florida Everglades, USA, and developed metrics to define realized hydrologic optima and tolerances. Using vegetation data spatially co-located with long-term hydrological records, we evaluated seven variables describing water depth, hydroperiod length, and number of wet/dry events; each variable was tested for 2-, 4- and 10-year intervals for Julian annual averages and environmentally-defined hydrologic intervals. Maximum length and maximum water depth during the wet period calculated for environmentally-defined hydrologic intervals over a 4-year period were the best predictors of vegetation type. Proportional abundance of vegetation types along hydrological gradients indicated that communities had different realized optima and tolerances across WCAs. Although in both WCAs, the trees/shrubs class was on the drier/shallower end of hydrological gradients, while slough communities occupied the wetter/deeper end, the distribution ofCladium, Typha, wet prairie and Salix communities, which were intermediate for most hydrological variables, varied in proportional abundance along hydrologic gradients between WCAs, indicating that realized optima and tolerances are context-dependent.
Resumo:
The social media classification problems draw more and more attention in the past few years. With the rapid development of Internet and the popularity of computers, there is astronomical amount of information in the social network (social media platforms). The datasets are generally large scale and are often corrupted by noise. The presence of noise in training set has strong impact on the performance of supervised learning (classification) techniques. A budget-driven One-class SVM approach is presented in this thesis that is suitable for large scale social media data classification. Our approach is based on an existing online One-class SVM learning algorithm, referred as STOCS (Self-Tuning One-Class SVM) algorithm. To justify our choice, we first analyze the noise-resilient ability of STOCS using synthetic data. The experiments suggest that STOCS is more robust against label noise than several other existing approaches. Next, to handle big data classification problem for social media data, we introduce several budget driven features, which allow the algorithm to be trained within limited time and under limited memory requirement. Besides, the resulting algorithm can be easily adapted to changes in dynamic data with minimal computational cost. Compared with two state-of-the-art approaches, Lib-Linear and kNN, our approach is shown to be competitive with lower requirements of memory and time.
Resumo:
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.
Resumo:
The MAREDAT atlas covers 11 types of plankton, ranging in size from bacteria to jellyfish. Together, these plankton groups determine the health and productivity of the global ocean and play a vital role in the global carbon cycle. Working within a uniform and consistent spatial and depth grid (map) of the global ocean, the researchers compiled thousands and tens of thousands of data points to identify regions of plankton abundance and scarcity as well as areas of data abundance and scarcity. At many of the grid points, the MAREDAT team accomplished the difficult conversion from abundance (numbers of organisms) to biomass (carbon mass of organisms). The MAREDAT atlas provides an unprecedented global data set for ecological and biochemical analysis and modeling as well as a clear mandate for compiling additional existing data and for focusing future data gathering efforts on key groups in key areas of the ocean. The present data set presents depth integrated values of diazotrophs abundance and biomass, computed from a collection of source data sets.
Resumo:
The MAREDAT atlas covers 11 types of plankton, ranging in size from bacteria to jellyfish. Together, these plankton groups determine the health and productivity of the global ocean and play a vital role in the global carbon cycle. Working within a uniform and consistent spatial and depth grid (map) of the global ocean, the researchers compiled thousands and tens of thousands of data points to identify regions of plankton abundance and scarcity as well as areas of data abundance and scarcity. At many of the grid points, the MAREDAT team accomplished the difficult conversion from abundance (numbers of organisms) to biomass (carbon mass of organisms). The MAREDAT atlas provides an unprecedented global data set for ecological and biochemical analysis and modeling as well as a clear mandate for compiling additional existing data and for focusing future data gathering efforts on key groups in key areas of the ocean. The present data set presents depth integrated values of diazotrophs nitrogen fixation rates, computed from a collection of source data sets.
Resumo:
Oceanic flood basalts are poorly understood, short-term expressions of highly increased heat flux and mass flow within the convecting mantle. The uniqueness of the Caribbean Large Igneous Province (CLIP, 92-74 Ma) with respect to other Cretaceous oceanic plateaus is its extensive sub-aerial exposures, providing an excellent basis to investigate the temporal and compositional relationships within a starting plume head. We present major element, trace element and initial Sr-Nd-Pb isotope composition of 40 extrusive rocks from the Caribbean Plateau, including onland sections in Costa Rica, Colombia and Curaçao as well as DSDP Sites in the Central Caribbean. Even though the lavas were erupted over an area of ~3*10**6 km**2, the majority have strikingly uniform incompatible element patterns (La/Yb=0.96+/-0.16, n=64 out of 79 samples, 2sigma) and initial Nd-Pb isotopic compositions (e.g. 143Nd/144Ndin=0.51291+/-3, epsilon-Nd i=7.3+/-0.6, 206Pb/204Pbin=18.86+/-0.12, n=54 out of 66, 2sigma). Lavas with endmember compositions have only been sampled at the DSDP Sites, Gorgona Island (Colombia) and the 65-60 Ma accreted Quepos and Osa igneous complexes (Costa Rica) of the subsequent hotspot track. Despite the relatively uniform composition of most lavas, linear correlations exist between isotope ratios and between isotope and highly incompatible trace element ratios. The Sr-Nd-Pb isotope and trace element signatures of the chemically enriched lavas are compatible with derivation from recycled oceanic crust, while the depleted lavas are derived from a highly residual source. This source could represent either oceanic lithospheric mantle left after ocean crust formation or gabbros with interlayered ultramafic cumulates of the lower oceanic crust. High 3He/4He in olivines of enriched picrites at Quepos are ~12 times higher than the atmospheric ratio suggesting that the enriched component may have once resided in the lower mantle. Evaluation of the Sm-Nd and U-Pb isotope systematics on isochron diagrams suggests that the age of separation of enriched and depleted components from the depleted MORB source mantle could have been <=500 Ma before CLIP formation and interpreted to reflect the recycling time of the CLIP source. Mantle plume heads may provide a mechanism for transporting large volumes of possibly young recycled oceanic lithosphere residing in the lower mantle back into the shallow MORB source mantle.
Resumo:
Introduction Quantitative and accurate measurements of fat and muscle in the body are important for prevention and diagnosis of diseases related to obesity and muscle degeneration. Manually segmenting muscle and fat compartments in MR body-images is laborious and time-consuming, hindering implementation in large cohorts. In the present study, the feasibility and success-rate of a Dixon-based MR scan followed by an intensity-normalised, non-rigid, multi-atlas based segmentation was investigated in a cohort of 3,000 subjects. Materials and Methods 3,000 participants in the in-depth phenotyping arm of the UK Biobank imaging study underwent a comprehensive MR examination. All subjects were scanned using a 1.5 T MR-scanner with the dual-echo Dixon Vibe protocol, covering neck to knees. Subjects were scanned with six slabs in supine position, without localizer. Automated body composition analysis was performed using the AMRA Profiler™ system, to segment and quantify visceral adipose tissue (VAT), abdominal subcutaneous adipose tissue (ASAT) and thigh muscles. Technical quality assurance was performed and a standard set of acceptance/rejection criteria was established. Descriptive statistics were calculated for all volume measurements and quality assurance metrics. Results Of the 3,000 subjects, 2,995 (99.83%) were analysable for body fat, 2,828 (94.27%) were analysable when body fat and one thigh was included, and 2,775 (92.50%) were fully analysable for body fat and both thigh muscles. Reasons for not being able to analyse datasets were mainly due to missing slabs in the acquisition, or patient positioned so that large parts of the volume was outside of the field-of-view. Discussion and Conclusions In conclusion, this study showed that the rapid UK Biobank MR-protocol was well tolerated by most subjects and sufficiently robust to achieve very high success-rate for body composition analysis. This research has been conducted using the UK Biobank Resource.
Resumo:
Recent progress in the technology for single unit recordings has given the neuroscientific community theopportunity to record the spiking activity of large neuronal populations. At the same pace, statistical andmathematical tools were developed to deal with high-dimensional datasets typical of such recordings.A major line of research investigates the functional role of subsets of neurons with significant co-firingbehavior: the Hebbian cell assemblies. Here we review three linear methods for the detection of cellassemblies in large neuronal populations that rely on principal and independent component analysis.Based on their performance in spike train simulations, we propose a modified framework that incorpo-rates multiple features of these previous methods. We apply the new framework to actual single unitrecordings and show the existence of cell assemblies in the rat hippocampus, which typically oscillate attheta frequencies and couple to different phases of the underlying field rhythm
Resumo:
In today's fast-paced and interconnected digital world, the data generated by an increasing number of applications is being modeled as dynamic graphs. The graph structure encodes relationships among data items, while the structural changes to the graphs as well as the continuous stream of information produced by the entities in these graphs make them dynamic in nature. Examples include social networks where users post status updates, images, videos, etc.; phone call networks where nodes may send text messages or place phone calls; road traffic networks where the traffic behavior of the road segments changes constantly, and so on. There is a tremendous value in storing, managing, and analyzing such dynamic graphs and deriving meaningful insights in real-time. However, a majority of the work in graph analytics assumes a static setting, and there is a lack of systematic study of the various dynamic scenarios, the complexity they impose on the analysis tasks, and the challenges in building efficient systems that can support such tasks at a large scale. In this dissertation, I design a unified streaming graph data management framework, and develop prototype systems to support increasingly complex tasks on dynamic graphs. In the first part, I focus on the management and querying of distributed graph data. I develop a hybrid replication policy that monitors the read-write frequencies of the nodes to decide dynamically what data to replicate, and whether to do eager or lazy replication in order to minimize network communication and support low-latency querying. In the second part, I study parallel execution of continuous neighborhood-driven aggregates, where each node aggregates the information generated in its neighborhoods. I build my system around the notion of an aggregation overlay graph, a pre-compiled data structure that enables sharing of partial aggregates across different queries, and also allows partial pre-computation of the aggregates to minimize the query latencies and increase throughput. Finally, I extend the framework to support continuous detection and analysis of activity-based subgraphs, where subgraphs could be specified using both graph structure as well as activity conditions on the nodes. The query specification tasks in my system are expressed using a set of active structural primitives, which allows the query evaluator to use a set of novel optimization techniques, thereby achieving high throughput. Overall, in this dissertation, I define and investigate a set of novel tasks on dynamic graphs, design scalable optimization techniques, build prototype systems, and show the effectiveness of the proposed techniques through extensive evaluation using large-scale real and synthetic datasets.
Resumo:
A large SAV bed in upper Chesapeake Bay has experienced several abrupt shifts over the past half-century, beginning with near-complete loss after a record-breaking flood in 1972, followed by an unexpected, rapid resurgence in the early 2000’s, then partial decline in 2011 following another major flood event. Together, these trends and events provide a unique opportunity to study a recovering SAV ecosystem from several different perspectives. First, I analyzed and synthesized existing time series datasets to make inferences about what factors prompted the recovery. Next, I analyzed existing datasets, together with field samples and a simple hydrodynamic model to investigate mechanisms of SAV bed loss and resilience to storm events. Finally, I conducted field deployments and experiments to explore how the bed affects internal physical and biogeochemical processes and what implications those effects have for the dynamics of the system. I found that modest reductions in nutrient loading, coupled with several consecutive dry years likely facilitated the SAV resurgence. Furthermore, positive feedback processes may have played a role in the sudden nature of the recovery because they could have reinforced the state of the bed before and after the abrupt shift. I also found that scour and poor water clarity associated with sediment deposition during the 2011 flood event were mechanisms of plant loss. However, interactions between the bed, water flow, and waves served as mechanisms of resilience because these processes created favorable growing conditions (i.e., clear water, low flow velocities) in the inner core of the bed. Finally, I found that that interactions between physical and biogeochemical processes led to low nutrient concentrations inside the bed relative to outside the bed, which created conditions that precluded algal growth and reinforced vascular plant dominance. This work demonstrates that positive feedbacks play a central role in SAV resilience to both chronic eutrophication as well as acute storm events. Furthermore, I show that analysis of long-term ecological monitoring data, together with field measurements and experiments, can be an effective approach for understanding the mechanisms underlying ecosystem dynamics.