982 resultados para datasets


Relevância:

20.00% 20.00%

Publicador:

Resumo:

The tree index structure is a traditional method for searching similar data in large datasets. It is based on the presupposition that most sub-trees are pruned in the searching process. As a result, the number of page accesses is reduced. However, time-series datasets generally have a very high dimensionality. Because of the so-called dimensionality curse, the pruning effectiveness is reduced in high dimensionality. Consequently, the tree index structure is not a suitable method for time-series datasets. In this paper, we propose a two-phase (filtering and refinement) method for searching time-series datasets. In the filtering step, a quantizing time-series is used to construct a compact file which is scanned for filtering out irrelevant. A small set of candidates is translated to the second step for refinement. In this step, we introduce an effective index compression method named grid-based datawise dimensionality reduction (DRR) which attempts to preserve the characteristics of the time-series. An experimental comparison with existing techniques demonstrates the utility of our approach.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Microarray data provides quantitative information about the transcription profile of cells. To analyze microarray datasets, methodology of machine learning has increasingly attracted bioinformatics researchers. Some approaches of machine learning are widely used to classify and mine biological datasets. However, many gene expression datasets are extremely high dimensionality, traditional machine learning methods can not be applied effectively and efficiently. This paper proposes a robust algorithm to find out rule groups to classify gene expression datasets. Unlike the most classification algorithms, which select dimensions (genes) heuristically to form rules groups to identify classes such as cancerous and normal tissues, our algorithm guarantees finding out best-k dimensions (genes), which are most discriminative to classify samples in different classes, to form rule groups for the classification of expression datasets. Our experiments show that the rule groups obtained by our algorithm have higher accuracy than that of other classification approaches

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Most real-world datasets are, to a certain degree, skewed. When considered that they are also large, they become the pinnacle challenge in data analysis. More importantly, we cannot ignore such datasets as they arise frequently in a wide variety of applications. Regardless of the analytic, it is often that the effectiveness of analysis can be improved if the characteristic of the dataset is known in advance. In this paper, we propose a novel technique to preprocess such datasets to obtain this insight. Our work is inspired by the resonance phenomenon, where similar objects resonate to a given response function. The key analytic result of our work is the data terrain, which shows properties of the dataset to enable effective and efficient analysis. We demonstrated our work in the context of various real-world problems. In doing so, we establish it as the tool for preprocessing data before applying computationally expensive algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Microarray data provides quantitative information about the transcription profile of cells. To analyse microarray datasets, methodology of machine learning has increasingly attracted bioinformatics researchers. Some approaches of machine learning are widely used to classify and mine biological datasets. However, many gene expression datasets are extremely high dimensionality, traditional machine learning methods cannot be applied effectively and efficiently. This paper proposes a robust algorithm to find out rule groups to classify gene expression datasets. Unlike the most classification algorithms, which select dimensions (genes) heuristically to form rules groups to identify classes such as cancerous and normal tissues, our algorithm guarantees finding out best-k dimensions (genes) to form rule groups for the classification of expression datasets. Our experiments show that the rule groups obtained by our algorithm have higher accuracy than that of other classification approaches.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper introduces a new type of discriminative subgraph pattern called breaker emerging subgraph pattern by introducing three constraints and two new concepts: base and breaker. A breaker emerging sub-graph pattern consists of three subpatterns: a con-strained emerging subgraph pattern, a set of bases and a set of breakers. An efficient approach is pro-posed for the discovery of top-k breaker emerging sub-graph patterns from graph datasets. Experimental re-sults show that the approach is capable of efficiently discovering top-k breaker emerging subgraph patterns from given datasets, is more efficient than two previ-ous methods for mining discriminative subgraph pat-terns. The discovered top-k breaker emerging sub-graph patterns are more informative, more discrim-inative, more accurate and more compact than the minimal distinguishing subgraph patterns. The top-k breaker emerging patterns are more useful for sub-structure analysis, such as molecular fragment analy-sis. © 2009, Australian Computer Society, Inc.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

 Background: Efficient and reliable surveillance and notification systems are vital for monitoring public health and disease outbreaks. However, most surveillance and notification systems are affected by a degree of underestimation (UE) and therefore uncertainty surrounds the 'true' incidence of disease affecting morbidity and mortality rates. Surveillance systems fail to capture cases at two distinct levels of the surveillance pyramid: from the community since not all cases seek healthcare (under-ascertainment), and at the healthcare-level, representing a failure to adequately report symptomatic cases that have sought medical advice (underreporting). There are several methods to estimate the extent of under-ascertainment and underreporting. Methods. Within the context of the ECDC-funded Burden of Communicable Diseases in Europe (BCoDE)-project, an extensive literature review was conducted to identify studies that estimate ascertainment or reporting rates for salmonellosis and campylobacteriosis in European Union Member States (MS) plus European Free Trade Area (EFTA) countries Iceland, Norway and Switzerland and four other OECD countries (USA, Canada, Australia and Japan). Multiplication factors (MFs), a measure of the magnitude of underestimation, were taken directly from the literature or derived (where the proportion of underestimated, under-ascertained, or underreported cases was known) and compared for the two pathogens. Results: MFs varied between and within diseases and countries, representing a need to carefully select the most appropriate MFs and methods for calculating them. The most appropriate MFs are often disease-, country-, age-, and sex-specific. Conclusions: When routine data are used to make decisions on resource allocation or to estimate epidemiological parameters in populations, it becomes important to understand when, where and to what extent these data represent the true picture of disease, and in some instances (such as priority setting) it is necessary to adjust for underestimation. MFs can be used to adjust notification and surveillance data to provide more realistic estimates of incidence. © 2014 Gibbons et al.; licensee BioMed Central Ltd.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In big data analysis, frequent itemsets mining plays a key role in mining associations, correlations and causality. Since some traditional frequent itemsets mining algorithms are unable to handle massive small files datasets effectively, such as high memory cost, high I/O overhead, and low computing performance, we propose a novel parallel frequent itemsets mining algorithm based on the FP-Growth algorithm and discuss its applications in this paper. First, we introduce a small files processing strategy for massive small files datasets to compensate defects of low read-write speed and low processing efficiency in Hadoop. Moreover, we use MapReduce to redesign the FP-Growth algorithm for implementing parallel computing, thereby improving the overall performance of frequent itemsets mining. Finally, we apply the proposed algorithm to the association analysis of the data from the national college entrance examination and admission of China. The experimental results show that the proposed algorithm is feasible and valid for a good speedup and a higher mining efficiency, and can meet the actual requirements of frequent itemsets mining for massive small files datasets. © 2014 ISSN 2185-2766.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Zones of mixing between shallow groundwaters of different composition were unravelled by two-way regionalized classification, a technique based on correspondence analysis (CA), cluster analysis (ClA) and discriminant analysis (DA), aided by gridding, map-overlay and contouring tools. The shallow groundwaters are from a granitoid plutonite in the Funda o region (central Portugal). Correspondence analysis detected three natural clusters in the working dataset: 1, weathering; 2, domestic effluents; 3, fertilizers. Cluster analysis set an alternative distribution of the samples by the three clusters. Group memberships obtained by correspondence analysis and by cluster analysis were optimized by discriminant analysis, gridded memberships as follows: codes 1, 2 or 3 were used when classification by correspondence analysis and cluster analysis produced the same results; code 0 when the grid node was first assigned to cluster 1 and then to cluster 2 or vice versa (mixing between weathering and effluents); code 4 in the other cases (mixing between agriculture and the other influences). Code-3 areas were systematically surrounded by code-4 areas, an observation attributed to hydrodynamic dispersion. Accordingly, the extent of code-4 areas in two orthogonal directions was assumed proportional to the longitudinal and transverse dispersivities of local soils. The results (0.7-16.8 and 0.4-4.3 m, respectively) are acceptable at the macroscopic scale. The ratios between longitudinal and transverse dispersivities (1.2-11.1) are also in agreement with results obtained by other studies.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Traditional pattern recognition techniques can not handle the classification of large datasets with both efficiency and effectiveness. In this context, the Optimum-Path Forest (OPF) classifier was recently introduced, trying to achieve high recognition rates and low computational cost. Although OPF was much faster than Support Vector Machines for training, it was slightly slower for classification. In this paper, we present the Efficient OPF (EOPF), which is an enhanced and faster version of the traditional OPF, and validate it for the automatic recognition of white matter and gray matter in magnetic resonance images of the human brain. © 2010 IEEE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this work, a new approach for supervised pattern recognition is presented which improves the learning algorithm of the Optimum-Path Forest classifier (OPF), centered on detection and elimination of outliers in the training set. Identification of outliers is based on a penalty computed for each sample in the training set from the corresponding number of imputable false positive and false negative classification of samples. This approach enhances the accuracy of OPF while still gaining in classification time, at the expense of a slight increase in training time. © 2010 Springer-Verlag.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The main purpose of this work is to report the presence of spurious discontinuities in the pattern of diurnal variation of sea level pressure of the three reanalysis datasets from: the National Centers for Environmental Prediction (NCEP) and National Center for Atmospheric Science (R1), the NCEP and Department of Energy (R2), and the European Centre for Medium Range Weather Forecasting (ERA-40). Such discontinuities can be connected to the major changes in the global observing system that have occurred throughout reanalyses years. In the R1, the richest period in discontinuities is 1956-1958, coinciding with the start of modern radiosonde observation network. Rapid increase in the density of surface-based observations from 1967 also had an important impact on both R1 and ERA-40, with larger impact on R1. The reanalyses show discontinuities in the 1970s related to the assimilation of radiances measured by the Vertical Temperature Profile Radiometer and TIROS-N Operational Vertical Sounders onboard satellites. In the ERA-40, which additionally assimilated Special Sensor Microwave/Imager data, there are discontinuities in 1987-1989. The R1 also presents further discontinuities, in 1988-1993 likely connected to replacement/introduction of NOAA-series satellites with different biases, and to the volcanic eruption of Mount Pinatubo in June 1991, which is known to have severely affected measurements of infrared radiances for several years. The discontinuities in 1996-1998 might be partially connected to change in the type of radiosonde, from VIZ-B to VIZ-B2. The R2, which covers only satellite era (1979-on), shows discontinuities mainly in 1992, 1996-1997, and 2001. The discontinuities in 1992 and 2001 might have been caused by change in the satellite measurements and those in 1996-1997 by some changes in land-based observations network. © 2012 Springer-Verlag.