920 resultados para Large Data


Relevância:

60.00% 60.00%

Publicador:

Resumo:

This chapter draws on a large data set of children's work samples collected as part of a five-year school reform project in a community of high poverty. One component of the data set from this project is a corpus of more than 2000 writing samples collected from students across eight grade levels (Prep to year 7) annually, across four years of the project (2009-2013). This paper utilises a selection of these texts to consider insights available to teachers and schools through a simple process of collecting and assessing writing samples produced by children over time. The focus is on what samples of writing might enable us to know and understand about learning and teaching this important dimension of literacy in current classrooms.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This work focuses on the role of macroseismology in the assessment of seismicity and probabilistic seismic hazard in Northern Europe. The main type of data under consideration is a set of macroseismic observations available for a given earthquake. The macroseismic questionnaires used to collect earthquake observations from local residents since the late 1800s constitute a special part of the seismological heritage in the region. Information of the earthquakes felt on the coasts of the Gulf of Bothnia between 31 March and 2 April 1883 and on 28 July 1888 was retrieved from the contemporary Finnish and Swedish newspapers, while the earthquake of 4 November 1898 GMT is an example of an early systematic macroseismic survey in the region. A data set of more than 1200 macroseismic questionnaires is available for the earthquake in Central Finland on 16 November 1931. Basic macroseismic investigations including preparation of new intensity data point (IDP) maps were conducted for these earthquakes. Previously disregarded usable observations were found in the press. The improved collection of IDPs of the 1888 earthquake shows that this event was a rare occurrence in the area. In contrast to earlier notions it was felt on both sides of the Gulf of Bothnia. The data on the earthquake of 4 November 1898 GMT were augmented with historical background information discovered in various archives and libraries. This earthquake was of some concern to the authorities, because extra fire inspections were conducted in three towns at least, i.e. Tornio, Haparanda and Piteå, located in the centre of the area of perceptibility. This event posed the indirect hazard of fire, although its magnitude around 4.6 was minor on the global scale. The distribution of slightly damaging intensities was larger than previously outlined. This may have resulted from the amplification of the ground shaking in the soft soil of the coast and river valleys where most of the population was found. The large data set of the 1931 earthquake provided an opportunity to apply statistical methods and assess methodologies that can be used when dealing with macroseismic intensity. It was evaluated using correspondence analysis. Different approaches such as gridding were tested to estimate the macroseismic field from the intensity values distributed irregularly in space. In general, the characteristics of intensity warrant careful consideration. A more pervasive perception of intensity as an ordinal quantity affected by uncertainties is advocated. A parametric earthquake catalogue comprising entries from both the macroseismic and instrumental era was used for probabilistic seismic hazard assessment. The parametric-historic methodology was applied to estimate seismic hazard at a given site in Finland and to prepare a seismic hazard map for Northern Europe. The interpretation of these results is an important issue, because the recurrence times of damaging earthquakes may well exceed thousands of years in an intraplate setting such as Northern Europe. This application may therefore be seen as an example of short-term hazard assessment.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Core Vector Machine(CVM) is suitable for efficient large-scale pattern classification. In this paper, a method for improving the performance of CVM with Gaussian kernel function irrespective of the orderings of patterns belonging to different classes within the data set is proposed. This method employs a selective sampling based training of CVM using a novel kernel based scalable hierarchical clustering algorithm. Empirical studies made on synthetic and real world data sets show that the proposed strategy performs well on large data sets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper discusses a method for scaling SVM with Gaussian kernel function to handle large data sets by using a selective sampling strategy for the training set. It employs a scalable hierarchical clustering algorithm to construct cluster indexing structures of the training data in the kernel induced feature space. These are then used for selective sampling of the training data for SVM to impart scalability to the training process. Empirical studies made on real world data sets show that the proposed strategy performs well on large data sets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Tiivistelmä ReferatAbstract Metabolomics is a rapidly growing research field that studies the response of biological systems to environmental factors, disease states and genetic modifications. It aims at measuring the complete set of endogenous metabolites, i.e. the metabolome, in a biological sample such as plasma or cells. Because metabolites are the intermediates and end products of biochemical reactions, metabolite compositions and metabolite levels in biological samples can provide a wealth of information on on-going processes in a living system. Due to the complexity of the metabolome, metabolomic analysis poses a challenge to analytical chemistry. Adequate sample preparation is critical to accurate and reproducible analysis, and the analytical techniques must have high resolution and sensitivity to allow detection of as many metabolites as possible. Furthermore, as the information contained in the metabolome is immense, the data set collected from metabolomic studies is very large. In order to extract the relevant information from such large data sets, efficient data processing and multivariate data analysis methods are needed. In the research presented in this thesis, metabolomics was used to study mechanisms of polymeric gene delivery to retinal pigment epithelial (RPE) cells. The aim of the study was to detect differences in metabolomic fingerprints between transfected cells and non-transfected controls, and thereafter to identify metabolites responsible for the discrimination. The plasmid pCMV-β was introduced into RPE cells using the vector polyethyleneimine (PEI). The samples were analyzed using high performance liquid chromatography (HPLC) and ultra performance liquid chromatography (UPLC) coupled to a triple quadrupole (QqQ) mass spectrometer (MS). The software MZmine was used for raw data processing and principal component analysis (PCA) was used in statistical data analysis. The results revealed differences in metabolomic fingerprints between transfected cells and non-transfected controls. However, reliable fingerprinting data could not be obtained because of low analysis repeatability. Therefore, no attempts were made to identify metabolites responsible for discrimination between sample groups. Repeatability and accuracy of analyses can be influenced by protocol optimization. However, in this study, optimization of analytical methods was hindered by the very small number of samples available for analysis. In conclusion, this study demonstrates that obtaining reliable fingerprinting data is technically demanding, and the protocols need to be thoroughly optimized in order to approach the goals of gaining information on mechanisms of gene delivery.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Structure comparison tools can be used to align related protein structures to identify structurally conserved and variable regions and to infer functional and evolutionary relationships. While the conserved regions often superimpose well, the variable regions appear non superimposable. Differences in homologous protein structures are thought to be due to evolutionary plasticity to accommodate diverged sequences during evolution. One of the kinds of differences between 3-D structures of homologous proteins is rigid body displacement. A glaring example is not well superimposed equivalent regions of homologous proteins corresponding to a-helical conformation with different spatial orientations. In a rigid body superimposition, these regions would appear variable although they may contain local similarity. Also, due to high spatial deviation in the variable region, one-to-one correspondence at the residue level cannot be determined accurately. Another kind of difference is conformational variability and the most common example is topologically equivalent loops of two homologues but with different conformations. In the current study, we present a refined view of the ``structurally variable'' regions which may contain local similarity obscured in global alignment of homologous protein structures. As structural alphabet is able to describe local structures of proteins precisely through Protein Blocks approach, conformational similarity has been identified in a substantial number of `variable' regions in a large data set of protein structural alignments; optimal residue-residue equivalences could be achieved on the basis of Protein Blocks which led to improved local alignments. Also, through an example, we have demonstrated how the additional information on local backbone structures through protein blocks can aid in comparative modeling of a loop region. In addition, understanding on sequence-structure relationships can be enhanced through our approach. This has been illustrated through examples where the equivalent regions in homologous protein structures share sequence similarity to varied extent but do not preserve local structure.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Estimates of predicate selectivities by database query optimizers often differ significantly from those actually encountered during query execution, leading to poor plan choices and inflated response times. In this paper, we investigate mitigating this problem by replacing selectivity error-sensitive plan choices with alternative plans that provide robust performance. Our approach is based on the recent observation that even the complex and dense "plan diagrams" associated with industrial-strength optimizers can be efficiently reduced to "anorexic" equivalents featuring only a few plans, without materially impacting query processing quality. Extensive experimentation with a rich set of TPC-H and TPC-DS-based query templates in a variety of database environments indicate that plan diagram reduction typically retains plans that are substantially resistant to selectivity errors on the base relations. However, it can sometimes also be severely counter-productive, with the replacements performing much worse. We address this problem through a generalized mathematical characterization of plan cost behavior over the parameter space, which lends itself to efficient criteria of when it is safe to reduce. Our strategies are fully non-invasive and have been implemented in the Picasso optimizer visualization tool.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Given a parametrized n-dimensional SQL query template and a choice of query optimizer, a plan diagram is a color-coded pictorial enumeration of the execution plan choices of the optimizer over the query parameter space. These diagrams have proved to be a powerful metaphor for the analysis and redesign of modern optimizers, and are gaining currency in diverse industrial and academic institutions. However, their utility is adversely impacted by the impractically large computational overheads incurred when standard brute-force exhaustive approaches are used for producing fine-grained diagrams on high-dimensional query templates. In this paper, we investigate strategies for efficiently producing close approximations to complex plan diagrams. Our techniques are customized to the features available in the optimizer's API, ranging from the generic optimizers that provide only the optimal plan for a query, to those that also support costing of sub-optimal plans and enumerating rank-ordered lists of plans. The techniques collectively feature both random and grid sampling, as well as inference techniques based on nearest-neighbor classifiers, parametric query optimization and plan cost monotonicity. Extensive experimentation with a representative set of TPC-H and TPC-DS-based query templates on industrial-strength optimizers indicates that our techniques are capable of delivering 90% accurate diagrams while incurring less than 15% of the computational overheads of the exhaustive approach. In fact, for full-featured optimizers, we can guarantee zero error with less than 10% overheads. These approximation techniques have been implemented in the publicly available Picasso optimizer visualization tool.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A "plan diagram" is a pictorial enumeration of the execution plan choices of a database query optimizer over the relational selectivity space. We have shown recently that, for industrial-strength database engines, these diagrams are often remarkably complex and dense, with a large number of plans covering the space. However, they can often be reduced to much simpler pictures, featuring significantly fewer plans, without materially affecting the query processing quality. Plan reduction has useful implications for the design and usage of query optimizers, including quantifying redundancy in the plan search space, enhancing useability of parametric query optimization, identifying error-resistant and least-expected-cost plans, and minimizing the overheads of multi-plan approaches. We investigate here the plan reduction issue from theoretical, statistical and empirical perspectives. Our analysis shows that optimal plan reduction, w.r.t. minimizing the number of plans, is an NP-hard problem in general, and remains so even for a storage-constrained variant. We then present a greedy reduction algorithm with tight and optimal performance guarantees, whose complexity scales linearly with the number of plans in the diagram for a given resolution. Next, we devise fast estimators for locating the best tradeoff between the reduction in plan cardinality and the impact on query processing quality. Finally, extensive experimentation with a suite of multi-dimensional TPCH-based query templates on industrial-strength optimizers demonstrates that complex plan diagrams easily reduce to "anorexic" (small absolute number of plans) levels incurring only marginal increases in the estimated query processing costs.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Even though several techniques have been proposed in the literature for achieving multiclass classification using Support Vector Machine(SVM), the scalability aspect of these approaches to handle large data sets still needs much of exploration. Core Vector Machine(CVM) is a technique for scaling up a two class SVM to handle large data sets. In this paper we propose a Multiclass Core Vector Machine(MCVM). Here we formulate the multiclass SVM problem as a Quadratic Programming(QP) problem defining an SVM with vector valued output. This QP problem is then solved using the CVM technique to achieve scalability to handle large data sets. Experiments done with several large synthetic and real world data sets show that the proposed MCVM technique gives good generalization performance as that of SVM at a much lesser computational expense. Further, it is observed that MCVM scales well with the size of the data set.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The Reeb graph of a scalar function tracks the evolution of the topology of its level sets. This paper describes a fast algorithm to compute the Reeb graph of a piecewise-linear (PL) function defined over manifolds and non-manifolds. The key idea in the proposed approach is to maximally leverage the efficient contour tree algorithm to compute the Reeb graph. The algorithm proceeds by dividing the input into a set of subvolumes that have loop-free Reeb graphs using the join tree of the scalar function and computes the Reeb graph by combining the contour trees of all the subvolumes. Since the key ingredient of this method is a series of union-find operations, the algorithm is fast in practice. Experimental results demonstrate that it outperforms current generic algorithms by a factor of up to two orders of magnitude, and has a performance on par with algorithms that are catered to restricted classes of input. The algorithm also extends to handle large data that do not fit in memory.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In many real world prediction problems the output is a structured object like a sequence or a tree or a graph. Such problems range from natural language processing to compu- tational biology or computer vision and have been tackled using algorithms, referred to as structured output learning algorithms. We consider the problem of structured classifi- cation. In the last few years, large margin classifiers like sup-port vector machines (SVMs) have shown much promise for structured output learning. The related optimization prob -lem is a convex quadratic program (QP) with a large num-ber of constraints, which makes the problem intractable for large data sets. This paper proposes a fast sequential dual method (SDM) for structural SVMs. The method makes re-peated passes over the training set and optimizes the dual variables associated with one example at a time. The use of additional heuristics makes the proposed method more efficient. We present an extensive empirical evaluation of the proposed method on several sequence learning problems.Our experiments on large data sets demonstrate that the proposed method is an order of magnitude faster than state of the art methods like cutting-plane method and stochastic gradient descent method (SGD). Further, SDM reaches steady state generalization performance faster than the SGD method. The proposed SDM is thus a useful alternative for large scale structured output learning.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

There are many applications such as software for processing customer records in telecom, patient records in hospitals, email processing software accessing a single email in a mailbox etc. which require to access a single record in a database consisting of millions of records. A basic feature of these applications is that they need to access data sets which are very large but simple. Cloud computing provides computing requirements for these kinds of new generation of applications involving very large data sets which cannot possibly be handled efficiently using traditional computing infrastructure. In this paper, we describe storage services provided by three well-known cloud service providers and give a comparison of their features with a view to characterize storage requirements of very large data sets as examples and we hope that it would act as a catalyst for the design of storage services for very large data set requirements in future. We also give a brief overview of other kinds of storage that have come up in the recent past for cloud computing.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The contour tree is a topological abstraction of a scalar field that captures evolution in level set connectivity. It is an effective representation for visual exploration and analysis of scientific data. We describe a work-efficient, output sensitive, and scalable parallel algorithm for computing the contour tree of a scalar field defined on a domain that is represented using either an unstructured mesh or a structured grid. A hybrid implementation of the algorithm using the GPU and multi-core CPU can compute the contour tree of an input containing 16 million vertices in less than ten seconds with a speedup factor of upto 13. Experiments based on an implementation in a multi-core CPU environment show near-linear speedup for large data sets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Maximum entropy approach to classification is very well studied in applied statistics and machine learning and almost all the methods that exists in literature are discriminative in nature. In this paper, we introduce a maximum entropy classification method with feature selection for large dimensional data such as text datasets that is generative in nature. To tackle the curse of dimensionality of large data sets, we employ conditional independence assumption (Naive Bayes) and we perform feature selection simultaneously, by enforcing a `maximum discrimination' between estimated class conditional densities. For two class problems, in the proposed method, we use Jeffreys (J) divergence to discriminate the class conditional densities. To extend our method to the multi-class case, we propose a completely new approach by considering a multi-distribution divergence: we replace Jeffreys divergence by Jensen-Shannon (JS) divergence to discriminate conditional densities of multiple classes. In order to reduce computational complexity, we employ a modified Jensen-Shannon divergence (JS(GM)), based on AM-GM inequality. We show that the resulting divergence is a natural generalization of Jeffreys divergence to a multiple distributions case. As far as the theoretical justifications are concerned we show that when one intends to select the best features in a generative maximum entropy approach, maximum discrimination using J-divergence emerges naturally in binary classification. Performance and comparative study of the proposed algorithms have been demonstrated on large dimensional text and gene expression datasets that show our methods scale up very well with large dimensional datasets.