903 resultados para Data selection
Resumo:
In the current Information Age, data production and processing demands are ever increasing. This has motivated the appearance of large-scale distributed information. This phenomenon also applies to Pattern Recognition so that classic and common algorithms, such as the k-Nearest Neighbour, are unable to be used. To improve the efficiency of this classifier, Prototype Selection (PS) strategies can be used. Nevertheless, current PS algorithms were not designed to deal with distributed data, and their performance is therefore unknown under these conditions. This work is devoted to carrying out an experimental study on a simulated framework in which PS strategies can be compared under classical conditions as well as those expected in distributed scenarios. Our results report a general behaviour that is degraded as conditions approach to more realistic scenarios. However, our experiments also show that some methods are able to achieve a fairly similar performance to that of the non-distributed scenario. Thus, although there is a clear need for developing specific PS methodologies and algorithms for tackling these situations, those that reported a higher robustness against such conditions may be good candidates from which to start.
Resumo:
Mode of access: Internet.
Resumo:
"Contributed to the Federal Information Processing Standards Task Group 15 - Computer Systems Security" -t.p.
Resumo:
Non-technical losses (NTL) identification and prediction are important tasks for many utilities. Data from customer information system (CIS) can be used for NTL analysis. However, in order to accurately and efficiently perform NTL analysis, the original data from CIS need to be pre-processed before any detailed NTL analysis can be carried out. In this paper, we propose a feature selection based method for CIS data pre-processing in order to extract the most relevant information for further analysis such as clustering and classifications. By removing irrelevant and redundant features, feature selection is an essential step in data mining process in finding optimal subset of features to improve the quality of result by giving faster time processing, higher accuracy and simpler results with fewer features. Detailed feature selection analysis is presented in the paper. Both time-domain and load shape data are compared based on the accuracy, consistency and statistical dependencies between features.
Resumo:
Data visualization algorithms and feature selection techniques are both widely used in bioinformatics but as distinct analytical approaches. Until now there has been no method of measuring feature saliency while training a data visualization model. We derive a generative topographic mapping (GTM) based data visualization approach which estimates feature saliency simultaneously with the training of the visualization model. The approach not only provides a better projection by modeling irrelevant features with a separate noise model but also gives feature saliency values which help the user to assess the significance of each feature. We compare the quality of projection obtained using the new approach with the projections from traditional GTM and self-organizing maps (SOM) algorithms. The results obtained on a synthetic and a real-life chemoinformatics dataset demonstrate that the proposed approach successfully identifies feature significance and provides coherent (compact) projections. © 2006 IEEE.
Resumo:
We address the important bioinformatics problem of predicting protein function from a protein's primary sequence. We consider the functional classification of G-Protein-Coupled Receptors (GPCRs), whose functions are specified in a class hierarchy. We tackle this task using a novel top-down hierarchical classification system where, for each node in the class hierarchy, the predictor attributes to be used in that node and the classifier to be applied to the selected attributes are chosen in a data-driven manner. Compared with a previous hierarchical classification system selecting classifiers only, our new system significantly reduced processing time without significantly sacrificing predictive accuracy.
Resumo:
The main objective of the project is to enhance the already effective health-monitoring system (HUMS) for helicopters by analysing structural vibrations to recognise different flight conditions directly from sensor information. The goal of this paper is to develop a new method to select those sensors and frequency bands that are best for detecting changes in flight conditions. We projected frequency information to a 2-dimensional space in order to visualise flight-condition transitions using the Generative Topographic Mapping (GTM) and a variant which supports simultaneous feature selection. We created an objective measure of the separation between different flight conditions in the visualisation space by calculating the Kullback-Leibler (KL) divergence between Gaussian mixture models (GMMs) fitted to each class: the higher the KL-divergence, the better the interclass separation. To find the optimal combination of sensors, they were considered in pairs, triples and groups of four sensors. The sensor triples provided the best result in terms of KL-divergence. We also found that the use of a variational training algorithm for the GMMs gave more reliable results.
Resumo:
This paper suggests a data envelopment analysis (DEA) model for selecting the most efficient alternative in advanced manufacturing technology in the presence of both cardinal and ordinal data. The paper explains the problem of using an iterative method for finding the most efficient alternative and proposes a new DEA model without the need of solving a series of LPs. A numerical example illustrates the model, and an application in technology selection with multi-inputs/multi-outputs shows the usefulness of the proposed approach. © 2012 Springer-Verlag London Limited.
Resumo:
Developers of interactive software are confronted by an increasing variety of software tools to help engineer the interactive aspects of software applications. Not only do these tools fall into different categories in terms of functionality, but within each category there is a growing number of competing tools with similar, although not identical, features. Choice of user interface development tool (UIDT) is therefore becoming increasingly complex.
Resumo:
Negative-ion mode electrospray ionization, ESI(-), with Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) was coupled to a Partial Least Squares (PLS) regression and variable selection methods to estimate the total acid number (TAN) of Brazilian crude oil samples. Generally, ESI(-)-FT-ICR mass spectra present a power of resolution of ca. 500,000 and a mass accuracy less than 1 ppm, producing a data matrix containing over 5700 variables per sample. These variables correspond to heteroatom-containing species detected as deprotonated molecules, [M - H](-) ions, which are identified primarily as naphthenic acids, phenols and carbazole analog species. The TAN values for all samples ranged from 0.06 to 3.61 mg of KOH g(-1). To facilitate the spectral interpretation, three methods of variable selection were studied: variable importance in the projection (VIP), interval partial least squares (iPLS) and elimination of uninformative variables (UVE). The UVE method seems to be more appropriate for selecting important variables, reducing the dimension of the variables to 183 and producing a root mean square error of prediction of 0.32 mg of KOH g(-1). By reducing the size of the data, it was possible to relate the selected variables with their corresponding molecular formulas, thus identifying the main chemical species responsible for the TAN values.
Resumo:
The genera Cochliomyia and Chrysomya contain both obligate and saprophagous flies, which allows the comparison of different feeding habits between closely related species. Among the different strategies for comparing these habits is the use of qPCR to investigate the expression levels of candidate genes involved in feeding behavior. To ensure an accurate measure of the levels of gene expression, it is necessary to normalize the amount of the target gene with the amount of a reference gene having a stable expression across the compared species. Since there is no universal gene that can be used as a reference in functional studies, candidate genes for qPCR data normalization were selected and validated in three Calliphoridae (Diptera) species, Cochliomyia hominivorax Coquerel, Cochliomyia macellaria Fabricius, and Chrysomya albiceps Wiedemann . The expression stability of six genes ( Actin, Gapdh, Rp49, Rps17, α -tubulin, and GstD1) was evaluated among species within the same life stage and between life stages within each species. The expression levels of Actin, Gapdh, and Rp49 were the most stable among the selected genes. These genes can be used as reliable reference genes for functional studies in Calliphoridae using similar experimental settings.
Resumo:
Background: Considering the broad variation in the expression of housekeeping genes among tissues and experimental situations, studies using quantitative RT-PCR require strict definition of adequate endogenous controls. For glioblastoma, the most common type of tumor in the central nervous system, there was no previous report regarding this issue. Results: Here we show that amongst seven frequently used housekeeping genes TBP and HPRT1 are adequate references for glioblastoma gene expression analysis. Evaluation of the expression levels of 12 target genes utilizing different endogenous controls revealed that the normalization method applied might introduce errors in the estimation of relative quantities. Genes presenting expression levels which do not significantly differ between tumor and normal tissues can be considered either increased or decreased if unsuitable reference genes are applied. Most importantly, genes showing significant differences in expression levels between tumor and normal tissues can be missed. We also demonstrated that the Holliday Junction Recognizing Protein, a novel DNA repair protein over expressed in lung cancer, is extremely over-expressed in glioblastoma, with a median change of about 134 fold. Conclusion: Altogether, our data show the relevance of previous validation of candidate control genes for each experimental model and indicate TBP plus HPRT1 as suitable references for studies on glioblastoma gene expression.
Resumo:
A variety of factors influence prey selection by predators. Because Barn Owls (Tyto alba) and Burrowing Owls (Athene cunicularia) differ in size and foraging tactics, we expected differential predation on small mammal prey. We hypothesized that the Barn Owl, all active predator, would prey on smaller and younger individuals than the Burrowing Owl, a sit-and-wait predator. We used pellet analyses to evaluate selection of small mammals by the two owls in relation to prey), species, age, and size at the Ecological Station of Itirapina, state of Sao Paulo, in southeastern Brazil. Small mammals constituted most of the prey individuals and biomass in the diet of Barn Owls. Although Burrowing Owls consumed a wider range of taxa, small mammals represented one-third of all biomass consumed. With respect. to small mammals, Barn Owls foraged selectively relative to prey species, size, and age. Burrowing Owls foraged opportunistically relative to prey species, but selectively relative to prey size and age. Barn Owls selected smaller and younger (juvenile and subadult) individuals of the delicate vesper mouse (Calomys tener) and Burrowing Owls preyed more oil larger and older (subadult only) individuals. morphology and behavior of both prey and predators may explain this differential predation. Our data suggest that the active predator feeds oil smaller and younger prey, and the sit-and-wait predator took relatively larger and older prey.
Resumo:
Background: The malaria parasite Plasmodium falciparum exhibits abundant genetic diversity, and this diversity is key to its success as a pathogen. Previous efforts to study genetic diversity in P. falciparum have begun to elucidate the demographic history of the species, as well as patterns of population structure and patterns of linkage disequilibrium within its genome. Such studies will be greatly enhanced by new genomic tools and recent large-scale efforts to map genomic variation. To that end, we have developed a high throughput single nucleotide polymorphism (SNP) genotyping platform for P. falciparum. Results: Using an Affymetrix 3,000 SNP assay array, we found roughly half the assays (1,638) yielded high quality, 100% accurate genotyping calls for both major and minor SNP alleles. Genotype data from 76 global isolates confirm significant genetic differentiation among continental populations and varying levels of SNP diversity and linkage disequilibrium according to geographic location and local epidemiological factors. We further discovered that nonsynonymous and silent (synonymous or noncoding) SNPs differ with respect to within-population diversity, interpopulation differentiation, and the degree to which allele frequencies are correlated between populations. Conclusions: The distinct population profile of nonsynonymous variants indicates that natural selection has a significant influence on genomic diversity in P. falciparum, and that many of these changes may reflect functional variants deserving of follow-up study. Our analysis demonstrates the potential for new high-throughput genotyping technologies to enhance studies of population structure, natural selection, and ultimately enable genome-wide association studies in P. falciparum to find genes underlying key phenotypic traits.
Resumo:
Background: Feature selection is a pattern recognition approach to choose important variables according to some criteria in order to distinguish or explain certain phenomena (i.e., for dimensionality reduction). There are many genomic and proteomic applications that rely on feature selection to answer questions such as selecting signature genes which are informative about some biological state, e. g., normal tissues and several types of cancer; or inferring a prediction network among elements such as genes, proteins and external stimuli. In these applications, a recurrent problem is the lack of samples to perform an adequate estimate of the joint probabilities between element states. A myriad of feature selection algorithms and criterion functions have been proposed, although it is difficult to point the best solution for each application. Results: The intent of this work is to provide an open-source multiplataform graphical environment for bioinformatics problems, which supports many feature selection algorithms, criterion functions and graphic visualization tools such as scatterplots, parallel coordinates and graphs. A feature selection approach for growing genetic networks from seed genes ( targets or predictors) is also implemented in the system. Conclusion: The proposed feature selection environment allows data analysis using several algorithms, criterion functions and graphic visualization tools. Our experiments have shown the software effectiveness in two distinct types of biological problems. Besides, the environment can be used in different pattern recognition applications, although the main concern regards bioinformatics tasks.