978 resultados para Multi-class steganalysis
Resumo:
This study shows the possibility offered by modern ultra-high performance supercritical fluid chromatography combined with tandem mass spectrometry in doping control analysis. A high throughput screening method was developed for 100 substances belonging to the challenging classes of anabolic agents, hormones and metabolic modulators, synthetic cannabinoids and glucocorticoids, which should be detected at low concentrations in urine. To selectively extract these doping agents from urine, a supported liquid extraction procedure was implemented in a 48-well plate format. At the tested concentration levels ranging from 0.5 to 5 ng/mL, the recoveries were better than 70% for 48-68% of the compounds and higher than 50% for 83-87% of the tested substances. Due to the numerous interferences related to isomers of steroids and ions produced by the loss of water in the electrospray source, the choice of SFC separation conditions was very challenging. After careful optimization, a Diol stationary phase was employed. The total analysis time for the screening assay was only 8 min, and interferences as well as susceptibility to matrix effect (ME) were minimized. With the developed method, about 70% of the compounds had relative ME within the range ±20%, at a concentration of 1 and 5 ng/mL. Finally, limits of detection achieved with the above-described strategy including 5-fold preconcentration were below 0.1 ng/mL for the majority of the tested compounds. Therefore, LODs were systematically better than the minimum required performance levels established by the World anti-doping agency, except for very few metabolites.
Resumo:
There are numerous text documents available in electronic form. More and more are becoming available every day. Such documents represent a massive amount of information that is easily accessible. Seeking value in this huge collection requires organization; much of the work of organizing documents can be automated through text classification. The accuracy and our understanding of such systems greatly influences their usefulness. In this paper, we seek 1) to advance the understanding of commonly used text classification techniques, and 2) through that understanding, improve the tools that are available for text classification. We begin by clarifying the assumptions made in the derivation of Naive Bayes, noting basic properties and proposing ways for its extension and improvement. Next, we investigate the quality of Naive Bayes parameter estimates and their impact on classification. Our analysis leads to a theorem which gives an explanation for the improvements that can be found in multiclass classification with Naive Bayes using Error-Correcting Output Codes. We use experimental evidence on two commonly-used data sets to exhibit an application of the theorem. Finally, we show fundamental flaws in a commonly-used feature selection algorithm and develop a statistics-based framework for text feature selection. Greater understanding of Naive Bayes and the properties of text allows us to make better use of it in text classification.
Resumo:
In this paper, an improved stochastic discrimination (SD) is introduced to reduce the error rate of the standard SD in the context of multi-class classification problem. The learning procedure of the improved SD consists of two stages. In the first stage, a standard SD, but with shorter learning period is carried out to identify an important space where all the misclassified samples are located. In the second stage, the standard SD is modified by (i) restricting sampling in the important space; and (ii) introducing a new discriminant function for samples in the important space. It is shown by mathematical derivation that the new discriminant function has the same mean, but smaller variance than that of standard SD for samples in the important space. It is also analyzed that the smaller the variance of the discriminant function, the lower the error rate of the classifier. Consequently, the proposed improved SD improves standard SD by its capability of achieving higher classification accuracy. Illustrative examples axe provided to demonstrate the effectiveness of the proposed improved SD.
Resumo:
La familia de algoritmos de Boosting son un tipo de técnicas de clasificación y regresión que han demostrado ser muy eficaces en problemas de Visión Computacional. Tal es el caso de los problemas de detección, de seguimiento o bien de reconocimiento de caras, personas, objetos deformables y acciones. El primer y más popular algoritmo de Boosting, AdaBoost, fue concebido para problemas binarios. Desde entonces, muchas han sido las propuestas que han aparecido con objeto de trasladarlo a otros dominios más generales: multiclase, multilabel, con costes, etc. Nuestro interés se centra en extender AdaBoost al terreno de la clasificación multiclase, considerándolo como un primer paso para posteriores ampliaciones. En la presente tesis proponemos dos algoritmos de Boosting para problemas multiclase basados en nuevas derivaciones del concepto margen. El primero de ellos, PIBoost, está concebido para abordar el problema descomponiéndolo en subproblemas binarios. Por un lado, usamos una codificación vectorial para representar etiquetas y, por otro, utilizamos la función de pérdida exponencial multiclase para evaluar las respuestas. Esta codificación produce un conjunto de valores margen que conllevan un rango de penalizaciones en caso de fallo y recompensas en caso de acierto. La optimización iterativa del modelo genera un proceso de Boosting asimétrico cuyos costes dependen del número de etiquetas separadas por cada clasificador débil. De este modo nuestro algoritmo de Boosting tiene en cuenta el desbalanceo debido a las clases a la hora de construir el clasificador. El resultado es un método bien fundamentado que extiende de manera canónica al AdaBoost original. El segundo algoritmo propuesto, BAdaCost, está concebido para problemas multiclase dotados de una matriz de costes. Motivados por los escasos trabajos dedicados a generalizar AdaBoost al terreno multiclase con costes, hemos propuesto un nuevo concepto de margen que, a su vez, permite derivar una función de pérdida adecuada para evaluar costes. Consideramos nuestro algoritmo como la extensión más canónica de AdaBoost para este tipo de problemas, ya que generaliza a los algoritmos SAMME, Cost-Sensitive AdaBoost y PIBoost. Por otro lado, sugerimos un simple procedimiento para calcular matrices de coste adecuadas para mejorar el rendimiento de Boosting a la hora de abordar problemas estándar y problemas con datos desbalanceados. Una serie de experimentos nos sirven para demostrar la efectividad de ambos métodos frente a otros conocidos algoritmos de Boosting multiclase en sus respectivas áreas. En dichos experimentos se usan bases de datos de referencia en el área de Machine Learning, en primer lugar para minimizar errores y en segundo lugar para minimizar costes. Además, hemos podido aplicar BAdaCost con éxito a un proceso de segmentación, un caso particular de problema con datos desbalanceados. Concluimos justificando el horizonte de futuro que encierra el marco de trabajo que presentamos, tanto por su aplicabilidad como por su flexibilidad teórica. Abstract The family of Boosting algorithms represents a type of classification and regression approach that has shown to be very effective in Computer Vision problems. Such is the case of detection, tracking and recognition of faces, people, deformable objects and actions. The first and most popular algorithm, AdaBoost, was introduced in the context of binary classification. Since then, many works have been proposed to extend it to the more general multi-class, multi-label, costsensitive, etc... domains. Our interest is centered in extending AdaBoost to two problems in the multi-class field, considering it a first step for upcoming generalizations. In this dissertation we propose two Boosting algorithms for multi-class classification based on new generalizations of the concept of margin. The first of them, PIBoost, is conceived to tackle the multi-class problem by solving many binary sub-problems. We use a vectorial codification to represent class labels and a multi-class exponential loss function to evaluate classifier responses. This representation produces a set of margin values that provide a range of penalties for failures and rewards for successes. The stagewise optimization of this model introduces an asymmetric Boosting procedure whose costs depend on the number of classes separated by each weak-learner. In this way the Boosting procedure takes into account class imbalances when building the ensemble. The resulting algorithm is a well grounded method that canonically extends the original AdaBoost. The second algorithm proposed, BAdaCost, is conceived for multi-class problems endowed with a cost matrix. Motivated by the few cost-sensitive extensions of AdaBoost to the multi-class field, we propose a new margin that, in turn, yields a new loss function appropriate for evaluating costs. Since BAdaCost generalizes SAMME, Cost-Sensitive AdaBoost and PIBoost algorithms, we consider our algorithm as a canonical extension of AdaBoost to this kind of problems. We additionally suggest a simple procedure to compute cost matrices that improve the performance of Boosting in standard and unbalanced problems. A set of experiments is carried out to demonstrate the effectiveness of both methods against other relevant Boosting algorithms in their respective areas. In the experiments we resort to benchmark data sets used in the Machine Learning community, firstly for minimizing classification errors and secondly for minimizing costs. In addition, we successfully applied BAdaCost to a segmentation task, a particular problem in presence of imbalanced data. We conclude the thesis justifying the horizon of future improvements encompassed in our framework, due to its applicability and theoretical flexibility.
Resumo:
Two algorithms, based onBayesian Networks (BNs), for bacterial subcellular location prediction, are explored in this paper: one predicts all locations for Gram+ bacteria and the other all locations for Gram- bacteria. Methods were evaluated using different numbers of residues (from the N-terminal 10 residues to the whole sequence) and residue representation (amino acid-composition, percentage amino acid-composition or normalised amino acid-composition). The accuracy of the best resulting BN was compared to PSORTB. The accuracy of this multi-location BN was roughly comparable to PSORTB; the difference in predictions is low, often less than 2%. The BN method thus represents both an important new avenue of methodological development for subcellular location prediction and a potentially value new tool of true utilitarian value for candidate subunit vaccine selection.
Resumo:
En este trabajo se propone un nuevo sistema híbrido para el análisis de sentimientos en clase múltiple basado en el uso del diccionario General Inquirer (GI) y un enfoque jerárquico del clasificador Logistic Model Tree (LMT). Este nuevo sistema se compone de tres capas, la capa bipolar (BL) que consta de un LMT (LMT-1) para la clasificación de la polaridad de sentimientos, mientras que la segunda capa es la capa de la Intensidad (IL) y comprende dos LMTs (LMT-2 y LMT3) para detectar por separado tres intensidades de sentimientos positivos y tres intensidades de sentimientos negativos. Sólo en la fase de construcción, la capa de Agrupación (GL) se utiliza para agrupar las instancias positivas y negativas mediante el empleo de 2 k-means, respectivamente. En la fase de Pre-procesamiento, los textos son segmentados por palabras que son etiquetadas, reducidas a sus raíces y sometidas finalmente al diccionario GI con el objetivo de contar y etiquetar sólo los verbos, los sustantivos, los adjetivos y los adverbios con 24 marcadores que se utilizan luego para calcular los vectores de características. En la fase de Clasificación de Sentimientos, los vectores de características se introducen primero al LMT-1, a continuación, se agrupan en GL según la etiqueta de clase, después se etiquetan estos grupos de forma manual, y finalmente las instancias positivas son introducidas a LMT-2 y las instancias negativas a LMT-3. Los tres árboles están entrenados y evaluados usando las bases de datos Movie Review y SenTube con validación cruzada estratificada de 10-pliegues. LMT-1 produce un árbol de 48 hojas y 95 de tamaño, con 90,88% de exactitud, mientras que tanto LMT-2 y LMT-3 proporcionan dos árboles de una hoja y uno de tamaño, con 99,28% y 99,37% de exactitud,respectivamente. Los experimentos muestran que la metodología de clasificación jerárquica propuesta da un mejor rendimiento en comparación con otros enfoques prevalecientes.
Resumo:
SUMMARY: A top scoring pair (TSP) classifier consists of a pair of variables whose relative ordering can be used for accurately predicting the class label of a sample. This classification rule has the advantage of being easily interpretable and more robust against technical variations in data, as those due to different microarray platforms. Here we describe a parallel implementation of this classifier which significantly reduces the training time, and a number of extensions, including a multi-class approach, which has the potential of improving the classification performance. AVAILABILITY AND IMPLEMENTATION: Full C++ source code and R package Rgtsp are freely available from http://lausanne.isb-sib.ch/~vpopovic/research/. The implementation relies on existing OpenMP libraries.
Resumo:
The purpose of our project is to contribute to earlier diagnosis of AD and better estimates of its severity by using automatic analysis performed through new biomarkers extracted from non-invasive intelligent methods. The methods selected in this case are speech biomarkers oriented to Sponta-neous Speech and Emotional Response Analysis. Thus the main goal of the present work is feature search in Spontaneous Speech oriented to pre-clinical evaluation for the definition of test for AD diagnosis by One-class classifier. One-class classifi-cation problem differs from multi-class classifier in one essen-tial aspect. In one-class classification it is assumed that only information of one of the classes, the target class, is available. In this work we explore the problem of imbalanced datasets that is particularly crucial in applications where the goal is to maximize recognition of the minority class as in medical diag-nosis. The use of information about outlier and Fractal Dimen-sion features improves the system performance.
Resumo:
Metabolomics as one of the most rapidly growing technologies in the "-omics" field denotes the comprehensive analysis of low molecular-weight compounds and their pathways. Cancer-specific alterations of the metabolome can be detected by high-throughput mass-spectrometric metabolite profiling and serve as a considerable source of new markers for the early differentiation of malignant diseases as well as their distinction from benign states. However, a comprehensive framework for the statistical evaluation of marker panels in a multi-class setting has not yet been established. We collected serum samples of 40 pancreatic carcinoma patients, 40 controls, and 23 pancreatitis patients according to standard protocols and generated amino acid profiles by routine mass-spectrometry. In an intrinsic three-class bioinformatic approach we compared these profiles, evaluated their selectivity and computed multi-marker panels combined with the conventional tumor marker CA 19-9. Additionally, we tested for non-inferiority and superiority to determine the diagnostic surplus value of our multi-metabolite marker panels. Compared to CA 19-9 alone, the combined amino acid-based metabolite panel had a superior selectivity for the discrimination of healthy controls, pancreatitis, and pancreatic carcinoma patients [Formula: see text] We combined highly standardized samples, a three-class study design, a high-throughput mass-spectrometric technique, and a comprehensive bioinformatic framework to identify metabolite panels selective for all three groups in a single approach. Our results suggest that metabolomic profiling necessitates appropriate evaluation strategies and-despite all its current limitations-can deliver marker panels with high selectivity even in multi-class settings.
Resumo:
Sentiment analysis has long focused on binary classification of text as either positive or negative. There has been few work on mapping sentiments or emotions into multiple dimensions. This paper studies a Bayesian modeling approach to multi-class sentiment classification and multidimensional sentiment distributions prediction. It proposes effective mechanisms to incorporate supervised information such as labeled feature constraints and document-level sentiment distributions derived from the training data into model learning. We have evaluated our approach on the datasets collected from the confession section of the Experience Project website where people share their life experiences and personal stories. Our results show that using the latent representation of the training documents derived from our approach as features to build a maximum entropy classifier outperforms other approaches on multi-class sentiment classification. In the more difficult task of multi-dimensional sentiment distributions prediction, our approach gives superior performance compared to a few competitive baselines. © 2012 ACM.
Resumo:
In the Hammersley-Aldous-Diaconis process, infinitely many particles sit in R and at most one particle is allowed at each position. A particle at x, whose nearest neighbor to the right is at y, jumps at rate y - x to a position uniformly distributed in the interval (x, y). The basic coupling between trajectories with different initial configuration induces a process with different classes of particles. We show that the invariant measures for the two-class process can be obtained as follows. First, a stationary M/M/1 queue is constructed as a function of two homogeneous Poisson processes, the arrivals with rate, and the (attempted) services with rate rho > lambda Then put first class particles at the instants of departures (effective services) and second class particles at the instants of unused services. The procedure is generalized for the n-class case by using n - 1 queues in tandem with n - 1 priority types of customers. A multi-line process is introduced; it consists of a coupling (different from Liggett's basic coupling), having as invariant measure the product of Poisson processes. The definition of the multi-line process involves the dual points of the space-time Poisson process used in the graphical construction of the reversed process. The coupled process is a transformation of the multi-line process and its invariant measure is the transformation described above of the product measure.
Resumo:
Discrete data representations are necessary, or at least convenient, in many machine learning problems. While feature selection (FS) techniques aim at finding relevant subsets of features, the goal of feature discretization (FD) is to find concise (quantized) data representations, adequate for the learning task at hand. In this paper, we propose two incremental methods for FD. The first method belongs to the filter family, in which the quality of the discretization is assessed by a (supervised or unsupervised) relevance criterion. The second method is a wrapper, where discretized features are assessed using a classifier. Both methods can be coupled with any static (unsupervised or supervised) discretization procedure and can be used to perform FS as pre-processing or post-processing stages. The proposed methods attain efficient representations suitable for binary and multi-class problems with different types of data, being competitive with existing methods. Moreover, using well-known FS methods with the features discretized by our techniques leads to better accuracy than with the features discretized by other methods or with the original features. (C) 2013 Elsevier B.V. All rights reserved.
Resumo:
The achievable region approach seeks solutions to stochastic optimisation problems by: (i) characterising the space of all possible performances(the achievable region) of the system of interest, and (ii) optimisingthe overall system-wide performance objective over this space. This isradically different from conventional formulations based on dynamicprogramming. The approach is explained with reference to a simpletwo-class queueing system. Powerful new methodologies due to the authorsand co-workers are deployed to analyse a general multiclass queueingsystem with parallel servers and then to develop an approach to optimalload distribution across a network of interconnected stations. Finally,the approach is used for the first time to analyse a class of intensitycontrol problems.