921 resultados para Document classification,Naive Bayes classifier,Verb-object pairs


Relevância:

30.00% 30.00%

Publicador:

Resumo:

Statistical computing when input/output is driven by a Graphical User Interface is considered. A proposal is made for automatic control ofcomputational flow to ensure that only strictly required computationsare actually carried on. The computational flow is modeled by a directed graph for implementation in any object-oriented programming language with symbolic manipulation capabilities. A complete implementation example is presented to compute and display frequency based piecewise linear density estimators such as histograms or frequency polygons.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We present a simple randomized procedure for the prediction of a binary sequence. The algorithm uses ideas from recent developments of the theory of the prediction of individual sequences. We show that if thesequence is a realization of a stationary and ergodic random process then the average number of mistakes converges, almost surely, to that of the optimum, given by the Bayes predictor.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The classical binary classification problem is investigatedwhen it is known in advance that the posterior probability function(or regression function) belongs to some class of functions. We introduceand analyze a method which effectively exploits this knowledge. The methodis based on minimizing the empirical risk over a carefully selected``skeleton'' of the class of regression functions. The skeleton is acovering of the class based on a data--dependent metric, especiallyfitted for classification. A new scale--sensitive dimension isintroduced which is more useful for the studied classification problemthan other, previously defined, dimension measures. This fact isdemonstrated by performance bounds for the skeleton estimate in termsof the new dimension.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper, we propose two active learning algorithms for semiautomatic definition of training samples in remote sensing image classification. Based on predefined heuristics, the classifier ranks the unlabeled pixels and automatically chooses those that are considered the most valuable for its improvement. Once the pixels have been selected, the analyst labels them manually and the process is iterated. Starting with a small and nonoptimal training set, the model itself builds the optimal set of samples which minimizes the classification error. We have applied the proposed algorithms to a variety of remote sensing data, including very high resolution and hyperspectral images, using support vector machines. Experimental results confirm the consistency of the methods. The required number of training samples can be reduced to 10% using the methods proposed, reaching the same level of accuracy as larger data sets. A comparison with a state-of-the-art active learning method, margin sampling, is provided, highlighting advantages of the methods proposed. The effect of spatial resolution and separability of the classes on the quality of the selection of pixels is also discussed.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper discusses the analysis of cases in which the inclusion or exclusion of a particular suspect, as a possible contributor to a DNA mixture, depends on the value of a variable (the number of contributors) that cannot be determined with certainty. It offers alternative ways to deal with such cases, including sensitivity analysis and object-oriented Bayesian networks, that separate uncertainty about the inclusion of the suspect from uncertainty about other variables. The paper presents a case study in which the value of DNA evidence varies radically depending on the number of contributors to a DNA mixture: if there are two contributors, the suspect is excluded; if there are three or more, the suspect is included; but the number of contributors cannot be determined with certainty. It shows how an object-oriented Bayesian network can accommodate and integrate varying perspectives on the unknown variable and how it can reduce the potential for bias by directing attention to relevant considerations and distinguishing different sources of uncertainty. It also discusses the challenge of presenting such evidence to lay audiences.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: Several studies have established Glioblastoma Multiforme (GBM) prognostic and predictive models based on age and Karnofsky Performance Status (KPS), while very few studies evaluated the prognostic and predictive significance of preoperative MR-imaging. However, to date, there is no simple preoperative GBM classification that also correlates with a highly prognostic genomic signature. Thus, we present for the first time a biologically relevant, and clinically applicable tumor Volume, patient Age, and KPS (VAK) GBM classification that can easily and non-invasively be determined upon patient admission. METHODS: We quantitatively analyzed the volumes of 78 GBM patient MRIs present in The Cancer Imaging Archive (TCIA) corresponding to patients in The Cancer Genome Atlas (TCGA) with VAK annotation. The variables were then combined using a simple 3-point scoring system to form the VAK classification. A validation set (N = 64) from both the TCGA and Rembrandt databases was used to confirm the classification. Transcription factor and genomic correlations were performed using the gene pattern suite and Ingenuity Pathway Analysis. RESULTS: VAK-A and VAK-B classes showed significant median survival differences in discovery (P = 0.007) and validation sets (P = 0.008). VAK-A is significantly associated with P53 activation, while VAK-B shows significant P53 inhibition. Furthermore, a molecular gene signature comprised of a total of 25 genes and microRNAs was significantly associated with the classes and predicted survival in an independent validation set (P = 0.001). A favorable MGMT promoter methylation status resulted in a 10.5 months additional survival benefit for VAK-A compared to VAK-B patients. CONCLUSIONS: The non-invasively determined VAK classification with its implication of VAK-specific molecular regulatory networks, can serve as a very robust initial prognostic tool, clinical trial selection criteria, and important step toward the refinement of genomics-based personalized therapy for GBM patients.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Among the types of remote sensing acquisitions, optical images are certainly one of the most widely relied upon data sources for Earth observation. They provide detailed measurements of the electromagnetic radiation reflected or emitted by each pixel in the scene. Through a process termed supervised land-cover classification, this allows to automatically yet accurately distinguish objects at the surface of our planet. In this respect, when producing a land-cover map of the surveyed area, the availability of training examples representative of each thematic class is crucial for the success of the classification procedure. However, in real applications, due to several constraints on the sample collection process, labeled pixels are usually scarce. When analyzing an image for which those key samples are unavailable, a viable solution consists in resorting to the ground truth data of other previously acquired images. This option is attractive but several factors such as atmospheric, ground and acquisition conditions can cause radiometric differences between the images, hindering therefore the transfer of knowledge from one image to another. The goal of this Thesis is to supply remote sensing image analysts with suitable processing techniques to ensure a robust portability of the classification models across different images. The ultimate purpose is to map the land-cover classes over large spatial and temporal extents with minimal ground information. To overcome, or simply quantify, the observed shifts in the statistical distribution of the spectra of the materials, we study four approaches issued from the field of machine learning. First, we propose a strategy to intelligently sample the image of interest to collect the labels only in correspondence of the most useful pixels. This iterative routine is based on a constant evaluation of the pertinence to the new image of the initial training data actually belonging to a different image. Second, an approach to reduce the radiometric differences among the images by projecting the respective pixels in a common new data space is presented. We analyze a kernel-based feature extraction framework suited for such problems, showing that, after this relative normalization, the cross-image generalization abilities of a classifier are highly increased. Third, we test a new data-driven measure of distance between probability distributions to assess the distortions caused by differences in the acquisition geometry affecting series of multi-angle images. Also, we gauge the portability of classification models through the sequences. In both exercises, the efficacy of classic physically- and statistically-based normalization methods is discussed. Finally, we explore a new family of approaches based on sparse representations of the samples to reciprocally convert the data space of two images. The projection function bridging the images allows a synthesis of new pixels with more similar characteristics ultimately facilitating the land-cover mapping across images.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: The majority of Haemosporida species infect birds or reptiles, but many important genera, including Plasmodium, infect mammals. Dipteran vectors shared by avian, reptilian and mammalian Haemosporida, suggest multiple invasions of Mammalia during haemosporidian evolution; yet, phylogenetic analyses have detected only a single invasion event. Until now, several important mammal-infecting genera have been absent in these analyses. This study focuses on the evolutionary origin of Polychromophilus, a unique malaria genus that only infects bats (Microchiroptera) and is transmitted by bat flies (Nycteribiidae). METHODS: Two species of Polychromophilus were obtained from wild bats caught in Switzerland. These were molecularly characterized using four genes (asl, clpc, coI, cytb) from the three different genomes (nucleus, apicoplast, mitochondrion). These data were then combined with data of 60 taxa of Haemosporida available in GenBank. Bayesian inference, maximum likelihood and a range of rooting methods were used to test specific hypotheses concerning the phylogenetic relationships between Polychromophilus and the other haemosporidian genera. RESULTS: The Polychromophilus melanipherus and Polychromophilus murinus samples show genetically distinct patterns and group according to species. The Bayesian tree topology suggests that the monophyletic clade of Polychromophilus falls within the avian/saurian clade of Plasmodium and directed hypothesis testing confirms the Plasmodium origin. CONCLUSION: Polychromophilus' ancestor was most likely a bird- or reptile-infecting Plasmodium before it switched to bats. The invasion of mammals as hosts has, therefore, not been a unique event in the evolutionary history of Haemosporida, despite the suspected costs of adapting to a new host. This was, moreover, accompanied by a switch in dipteran host.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Some populations of Pogonomyrmex harvester ants comprise genetically differentiated pairs of interbreeding lineages. Queens mate with males of their own and of the alternate lineage and produce pure-lineage offspring which develop into queens and inter-lineage offspring which develop into workers. Here we tested whether such genetic caste determination is associated with costs in terms of the ability to optimally allocate resources to the production of queens and workers. During the stage of colony founding, when only workers are produced, queens laid a high proportion of pure-lineage eggs but the large majority of these eggs failed to develop. As a consequence, the number of offspring produced by incipient colonies decreased linearly with the proportion of pure-lineage eggs laid by queens. Moreover, queens of the lineage most commonly represented in a given mating flight produced more pure-lineage eggs, in line with the view that they mate randomly with the two types of males and indiscriminately use their sperm. Altogether these results predict frequency-dependent selection on pairs of lineages because queens of the more common lineage will produce more pure-lineage eggs and their colonies be less successful during the stage of colony founding, which may be an important force maintaining the coexistence of pairs of lineages within populations.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Soil surveys are the main source of spatial information on soils and have a range of different applications, mainly in agriculture. The continuity of this activity has however been severely compromised, mainly due to a lack of governmental funding. The purpose of this study was to evaluate the feasibility of two different classifiers (artificial neural networks and a maximum likelihood algorithm) in the prediction of soil classes in the northwest of the state of Rio de Janeiro. Terrain attributes such as elevation, slope, aspect, plan curvature and compound topographic index (CTI) and indices of clay minerals, iron oxide and Normalized Difference Vegetation Index (NDVI), derived from Landsat 7 ETM+ sensor imagery, were used as discriminating variables. The two classifiers were trained and validated for each soil class using 300 and 150 samples respectively, representing the characteristics of these classes in terms of the discriminating variables. According to the statistical tests, the accuracy of the classifier based on artificial neural networks (ANNs) was greater than of the classic Maximum Likelihood Classifier (MLC). Comparing the results with 126 points of reference showed that the resulting ANN map (73.81 %) was superior to the MLC map (57.94 %). The main errors when using the two classifiers were caused by: a) the geological heterogeneity of the area coupled with problems related to the geological map; b) the depth of lithic contact and/or rock exposure, and c) problems with the environmental correlation model used due to the polygenetic nature of the soils. This study confirms that the use of terrain attributes together with remote sensing data by an ANN approach can be a tool to facilitate soil mapping in Brazil, primarily due to the availability of low-cost remote sensing data and the ease by which terrain attributes can be obtained.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Résumé Suite aux recentes avancées technologiques, les archives d'images digitales ont connu une croissance qualitative et quantitative sans précédent. Malgré les énormes possibilités qu'elles offrent, ces avancées posent de nouvelles questions quant au traitement des masses de données saisies. Cette question est à la base de cette Thèse: les problèmes de traitement d'information digitale à très haute résolution spatiale et/ou spectrale y sont considérés en recourant à des approches d'apprentissage statistique, les méthodes à noyau. Cette Thèse étudie des problèmes de classification d'images, c'est à dire de catégorisation de pixels en un nombre réduit de classes refletant les propriétés spectrales et contextuelles des objets qu'elles représentent. L'accent est mis sur l'efficience des algorithmes, ainsi que sur leur simplicité, de manière à augmenter leur potentiel d'implementation pour les utilisateurs. De plus, le défi de cette Thèse est de rester proche des problèmes concrets des utilisateurs d'images satellite sans pour autant perdre de vue l'intéret des méthodes proposées pour le milieu du machine learning dont elles sont issues. En ce sens, ce travail joue la carte de la transdisciplinarité en maintenant un lien fort entre les deux sciences dans tous les développements proposés. Quatre modèles sont proposés: le premier répond au problème de la haute dimensionalité et de la redondance des données par un modèle optimisant les performances en classification en s'adaptant aux particularités de l'image. Ceci est rendu possible par un système de ranking des variables (les bandes) qui est optimisé en même temps que le modèle de base: ce faisant, seules les variables importantes pour résoudre le problème sont utilisées par le classifieur. Le manque d'information étiquétée et l'incertitude quant à sa pertinence pour le problème sont à la source des deux modèles suivants, basés respectivement sur l'apprentissage actif et les méthodes semi-supervisées: le premier permet d'améliorer la qualité d'un ensemble d'entraînement par interaction directe entre l'utilisateur et la machine, alors que le deuxième utilise les pixels non étiquetés pour améliorer la description des données disponibles et la robustesse du modèle. Enfin, le dernier modèle proposé considère la question plus théorique de la structure entre les outputs: l'intègration de cette source d'information, jusqu'à présent jamais considérée en télédétection, ouvre des nouveaux défis de recherche. Advanced kernel methods for remote sensing image classification Devis Tuia Institut de Géomatique et d'Analyse du Risque September 2009 Abstract The technical developments in recent years have brought the quantity and quality of digital information to an unprecedented level, as enormous archives of satellite images are available to the users. However, even if these advances open more and more possibilities in the use of digital imagery, they also rise several problems of storage and treatment. The latter is considered in this Thesis: the processing of very high spatial and spectral resolution images is treated with approaches based on data-driven algorithms relying on kernel methods. In particular, the problem of image classification, i.e. the categorization of the image's pixels into a reduced number of classes reflecting spectral and contextual properties, is studied through the different models presented. The accent is put on algorithmic efficiency and the simplicity of the approaches proposed, to avoid too complex models that would not be used by users. The major challenge of the Thesis is to remain close to concrete remote sensing problems, without losing the methodological interest from the machine learning viewpoint: in this sense, this work aims at building a bridge between the machine learning and remote sensing communities and all the models proposed have been developed keeping in mind the need for such a synergy. Four models are proposed: first, an adaptive model learning the relevant image features has been proposed to solve the problem of high dimensionality and collinearity of the image features. This model provides automatically an accurate classifier and a ranking of the relevance of the single features. The scarcity and unreliability of labeled. information were the common root of the second and third models proposed: when confronted to such problems, the user can either construct the labeled set iteratively by direct interaction with the machine or use the unlabeled data to increase robustness and quality of the description of data. Both solutions have been explored resulting into two methodological contributions, based respectively on active learning and semisupervised learning. Finally, the more theoretical issue of structured outputs has been considered in the last model, which, by integrating outputs similarity into a model, opens new challenges and opportunities for remote sensing image processing.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper extends previous research [1] on the use of multivariate continuous data in comparative handwriting examinations, notably for gender classification. A database has been constructed by analyzing the contour shape of loop characters of type a and d by means of Fourier analysis, which allows characters to be described in a global way by a set of variables (e.g., Fourier descriptors). Sample handwritings were collected from right- and left-handed female and male writers. The results reported in this paper provide further arguments in support of the view that investigative settings in forensic science represent an area of application for which the Bayesian approach offers a logical framework. In particular, the Bayes factor is computed for settings that focus on inference of gender and handedness of the author of an incriminated handwritten text. An emphasis is placed on comparing the efficiency for investigative purposes of characters a and d.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This document Classifications and Pay Plans is produced by the State of Iowa Executive Branch, Department of Administrative Services. Informational document about the pay plan codes and classification codes, how to use them.