967 resultados para classification models
Resumo:
The diagnosis, grading and classification of tumours has benefited considerably from the development of DCE-MRI which is now essential to the adequate clinical management of many tumour types due to its capability in detecting active angiogenesis. Several strategies have been proposed for DCE-MRI evaluation. Visual inspection of contrast agent concentration curves vs time is a very simple yet operator dependent procedure, therefore more objective approaches have been developed in order to facilitate comparison between studies. In so called model free approaches, descriptive or heuristic information extracted from time series raw data have been used for tissue classification. The main issue concerning these schemes is that they have not a direct interpretation in terms of physiological properties of the tissues. On the other hand, model based investigations typically involve compartmental tracer kinetic modelling and pixel-by-pixel estimation of kinetic parameters via non-linear regression applied on region of interests opportunely selected by the physician. This approach has the advantage to provide parameters directly related to the pathophysiological properties of the tissue such as vessel permeability, local regional blood flow, extraction fraction, concentration gradient between plasma and extravascular-extracellular space. Anyway, nonlinear modelling is computational demanding and the accuracy of the estimates can be affected by the signal-to-noise ratio and by the initial solutions. The principal aim of this thesis is investigate the use of semi-quantitative and quantitative parameters for segmentation and classification of breast lesion. The objectives can be subdivided as follow: describe the principal techniques to evaluate time intensity curve in DCE-MRI with focus on kinetic model proposed in literature; to evaluate the influence in parametrization choice for a classic bi-compartmental kinetic models; to evaluate the performance of a method for simultaneous tracer kinetic modelling and pixel classification; to evaluate performance of machine learning techniques training for segmentation and classification of breast lesion.
Resumo:
The Thermodynamic Bethe Ansatz analysis is carried out for the extended-CP^N class of integrable 2-dimensional Non-Linear Sigma Models related to the low energy limit of the AdS_4xCP^3 type IIA superstring theory. The principal aim of this program is to obtain further non-perturbative consistency check to the S-matrix proposed to describe the scattering processes between the fundamental excitations of the theory by analyzing the structure of the Renormalization Group flow. As a noteworthy byproduct we eventually obtain a novel class of TBA models which fits in the known classification but with several important differences. The TBA framework allows the evaluation of some exact quantities related to the conformal UV limit of the model: effective central charge, conformal dimension of the perturbing operator and field content of the underlying CFT. The knowledge of this physical quantities has led to the possibility of conjecturing a perturbed CFT realization of the integrable models in terms of coset Kac-Moody CFT. The set of numerical tools and programs developed ad hoc to solve the problem at hand is also discussed in some detail with references to the code.
Resumo:
Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems. Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort. Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times. A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.
Resumo:
Membrane interactions of porphyrinic photosensitizers (PSs) are known to play a crucial role for PS efficiency in photodynamic therapy (PDT). In the current paper, the interactions between 15 different porphyrinic PSs with various hydrophilic/lipophilic properties and phospholipid bilayers were probed by NMR spectroscopy. Unilamellar vesicles consisting of dioleoyl-phosphatidyl-choline (DOPC) were used as membrane models. PS-membrane interactions were deduced from analysis of the main DOPC (1)H-NMR resonances (choline and lipid chain signals). Initial membrane adsorption of the PSs was indicated by induced changes to the DOPC choline signal, i.e. a split into inner and outer choline peaks. Based on this parameter, the PSs could be classified into two groups, Type-A PSs causing a split and the Type-B PSs causing no split. A further classification into two subgroups each, A1, A2 and B1, B2 was based on the observed time-dependent changes of the main DOPC NMR signals following initial PS adsorption. Four different time-correlated patterns were found indicating different levels and rates of PS penetration into the hydrophobic membrane interior. The type of interaction was mainly affected by the amphiphilicity and the overall lipophilicity of the applied PS structures. In conclusion, the NMR data provided valuable structural and dynamic insights into the PS-membrane interactions which allow deriving the structural constraints for high membrane affinity and high membrane penetration of a given PS. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
The advances in computational biology have made simultaneous monitoring of thousands of features possible. The high throughput technologies not only bring about a much richer information context in which to study various aspects of gene functions but they also present challenge of analyzing data with large number of covariates and few samples. As an integral part of machine learning, classification of samples into two or more categories is almost always of interest to scientists. In this paper, we address the question of classification in this setting by extending partial least squares (PLS), a popular dimension reduction tool in chemometrics, in the context of generalized linear regression based on a previous approach, Iteratively ReWeighted Partial Least Squares, i.e. IRWPLS (Marx, 1996). We compare our results with two-stage PLS (Nguyen and Rocke, 2002A; Nguyen and Rocke, 2002B) and other classifiers. We show that by phrasing the problem in a generalized linear model setting and by applying bias correction to the likelihood to avoid (quasi)separation, we often get lower classification error rates.
Resumo:
Suppose that we are interested in establishing simple, but reliable rules for predicting future t-year survivors via censored regression models. In this article, we present inference procedures for evaluating such binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values. Specifically, under various working models we derive consistent estimators for the above measures via substitution and cross validation estimation procedures. Furthermore, we provide large sample approximations to the distributions of these nonsmooth estimators without assuming that the working model is correctly specified. Confidence intervals, for example, for the difference of the precision measures between two competing rules can then be constructed. All the proposals are illustrated with two real examples and their finite sample properties are evaluated via a simulation study.
Resumo:
BACKGROUND Low-grade gliomas (LGGs) are rare brain neoplasms, with survival spanning up to a few decades. Thus, accurate evaluations on how biomarkers impact survival among patients with LGG require long-term studies on samples prospectively collected over a long period. METHODS The 210 adult LGGs collected in our databank were screened for IDH1 and IDH2 mutations (IDHmut), MGMT gene promoter methylation (MGMTmet), 1p/19q loss of heterozygosity (1p19qloh), and nuclear TP53 immunopositivity (TP53pos). Multivariate survival analyses with multiple imputation of missing data were performed using either histopathology or molecular markers. Both models were compared using Akaike's information criterion (AIC). The molecular model was reduced by stepwise model selection to filter out the most critical predictors. A third model was generated to assess for various marker combinations. RESULTS Molecular parameters were better survival predictors than histology (ΔAIC = 12.5, P< .001). Forty-five percent of studied patients died. MGMTmet was positively associated with IDHmut (P< .001). In the molecular model with marker combinations, IDHmut/MGMTmet combined status had a favorable impact on overall survival, compared with IDHwt (hazard ratio [HR] = 0.33, P< .01), and even more so the triple combination, IDHmut/MGMTmet/1p19qloh (HR = 0.18, P< .001). Furthermore, IDHmut/MGMTmet/TP53pos triple combination was a significant risk factor for malignant transformation (HR = 2.75, P< .05). CONCLUSION By integrating networks of activated molecular glioma pathways, the model based on genotype better predicts prognosis than histology and, therefore, provides a more reliable tool for standardizing future treatment strategies.
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays as well as next generation sequencing assays interrogating somatic mutation, insertion, deletion, translocation and structural rearrangements. Given the massive amount of data, a major challenge is to integrate information from multiple sources and formulate testable hypotheses. This thesis focuses on developing methodologies for integrative analyses of genomic assays profiled on the same set of samples. We have developed several novel methods for integrative biomarker identification and cancer classification. We introduce a regression-based approach to identify biomarkers predictive to therapy response or survival by integrating multiple assays including gene expression, methylation and copy number data through penalized regression. To identify key cancer-specific genes accounting for multiple mechanisms of regulation, we have developed the integIRTy software that provides robust and reliable inferences about gene alteration by automatically adjusting for sample heterogeneity as well as technical artifacts using Item Response Theory. To cope with the increasing need for accurate cancer diagnosis and individualized therapy, we have developed a robust and powerful algorithm called SIBER to systematically identify bimodally expressed genes using next generation RNAseq data. We have shown that prediction models built from these bimodal genes have the same accuracy as models built from all genes. Further, prediction models with dichotomized gene expression measurements based on their bimodal shapes still perform well. The effectiveness of outcome prediction using discretized signals paves the road for more accurate and interpretable cancer classification by integrating signals from multiple sources.
Resumo:
The geometries of a catchment constitute the basis for distributed physically based numerical modeling of different geoscientific disciplines. In this paper results from ground-penetrating radar (GPR) measurements, in terms of a 3D model of total sediment thickness and active layer thickness in a periglacial catchment in western Greenland, is presented. Using the topography, thickness and distribution of sediments is calculated. Vegetation classification and GPR measurements are used to scale active layer thickness from local measurements to catchment scale models. Annual maximum active layer thickness varies from 0.3 m in wetlands to 2.0 m in barren areas and areas of exposed bedrock. Maximum sediment thickness is estimated to be 12.3 m in the major valleys of the catchment. A method to correlate surface vegetation with active layer thickness is also presented. By using relatively simple methods, such as probing and vegetation classification, it is possible to upscale local point measurements to catchment scale models, in areas where the upper subsurface is relatively homogenous. The resulting spatial model of active layer thickness can be used in combination with the sediment model as a geometrical input to further studies of subsurface mass-transport and hydrological flow paths in the periglacial catchment through numerical modelling.
Resumo:
We present a novel approach using both sustained vowels and connected speech, to detect obstructive sleep apnea (OSA) cases within a homogeneous group of speakers. The proposed scheme is based on state-of-the-art GMM-based classifiers, and acknowledges specifically the way in which acoustic models are trained on standard databases, as well as the complexity of the resulting models and their adaptation to specific data. Our experimental database contains a suitable number of utterances and sustained speech from healthy (i.e control) and OSA Spanish speakers. Finally, a 25.1% relative reduction in classification error is achieved when fusing continuous and sustained speech classifiers. Index Terms: obstructive sleep apnea (OSA), gaussian mixture models (GMMs), background model (BM), classifier fusion.
Resumo:
Many existing engineering works model the statistical characteristics of the entities under study as normal distributions. These models are eventually used for decision making, requiring in practice the definition of the classification region corresponding to the desired confidence level. Surprisingly enough, however, a great amount of computer vision works using multidimensional normal models leave unspecified or fail to establish correct confidence regions due to misconceptions on the features of Gaussian functions or to wrong analogies with the unidimensional case. The resulting regions incur in deviations that can be unacceptable in high-dimensional models. Here we provide a comprehensive derivation of the optimal confidence regions for multivariate normal distributions of arbitrary dimensionality. To this end, firstly we derive the condition for region optimality of general continuous multidimensional distributions, and then we apply it to the widespread case of the normal probability density function. The obtained results are used to analyze the confidence error incurred by previous works related to vision research, showing that deviations caused by wrong regions may turn into unacceptable as dimensionality increases. To support the theoretical analysis, a quantitative example in the context of moving object detection by means of background modeling is given.
Resumo:
Los modelos de simulación de cultivos permiten analizar varias combinaciones de laboreo-rotación y explorar escenarios de manejo. El modelo DSSAT fue evaluado bajo condiciones de secano en un experimento de campo de 16 años en la semiárida España central. Se evaluó el efecto del sistema de laboreo y las rotaciones basadas en cereales de invierno, en el rendimiento del cultivo y la calidad del suelo. Los modelos CERES y CROPGRO se utilizaron para simular el crecimiento y rendimiento del cultivo, mientras que el modelo DSSAT CENTURY se utilizó en las simulaciones de SOC y SN. Tanto las observaciones de campo como las simulaciones con CERES-Barley, mostraron que el rendimiento en grano de la cebada era mas bajo para el cereal continuo (BB) que para las rotaciones de veza (VB) y barbecho (FB) en ambos sistemas de laboreo. El modelo predijo más nitrógeno disponible en el laboreo convencional (CT) que en el no laboreo (NT) conduciendo a un mayor rendimiento en el CT. El SOC y el SN en la capa superficial del suelo, fueron mayores en NT que en CT, y disminuyeron con la profundidad en los valores tanto observados como simulados. Las mejores combinaciones para las condiciones de secano estudiadas fueron CT-VB y CT-FB, pero CT presentó menor contenido en SN y SOC que NT. El efecto beneficioso del NT en SOC y SN bajo condiciones Mediterráneas semiáridas puede ser identificado por observaciones de campo y por simulaciones de modelos de cultivos. La simulación del balance de agua en sistemas de cultivo es una herramienta útil para estudiar como el agua puede ser utilizado eficientemente. La comparación del balance de agua de DSSAT , con una simple aproximación “tipping bucket”, con el modelo WAVE más mecanicista, el cual integra la ecuación de Richard , es un potente método para valorar el funcionamiento del modelo. Los parámetros de suelo fueron calibrados usando el método de optimización global Simulated Annealing (SA). Un lisímetro continuo de pesada en suelo desnudo suministró los valores observados de drenaje y evapotranspiración (ET) mientras que el contenido de agua en el suelo (SW) fue suministrado por sensores de capacitancia. Ambos modelos funcionaron bien después de la optimización de los parámetros de suelo con SA, simulando el balance de agua en el suelo para el período de calibración. Para el período de validación, los modelos optimizados predijeron bien el contenido de agua en el suelo y la evaporación del suelo a lo largo del tiempo. Sin embargo, el drenaje fue predicho mejor con WAVE que con DSSAT, el cual presentó mayores errores en los valores acumulados. Esto podría ser debido a la naturaleza mecanicista de WAVE frente a la naturaleza más funcional de DSSAT. Los buenos resultados de WAVE indican que, después de la calibración, este puede ser utilizado como "benchmark" para otros modelos para periodos en los que no haya medidas de campo del drenaje. El funcionamiento de DSSAT-CENTURY en la simulación de SOC y N depende fuertemente del proceso de inicialización. Se propuso como método alternativo (Met.2) la inicialización de las fracciones de SOC a partir de medidas de mineralización aparente del suelo (Napmin). El Met.2 se comparó con el método de inicialización de Basso et al. (2011) (Met.1), aplicando ambos métodos a un experimento de campo de 4 años en un área en regadío de España central. Nmin y Napmin fueron sobreestimados con el Met.1, ya que la fracción estable obtenida (SOC3) en las capas superficiales del suelo fue más baja que con Met.2. El N lixiviado simulado fue similar en los dos métodos, con buenos resultados en los tratamientos de barbecho y cebada. El Met.1 subestimó el SOC en la capa superficial del suelo cuando se comparó con una serie observada de 12 años. El crecimiento y rendimiento del cultivo fueron adecuadamente simulados con ambos métodos, pero el N en la parte aérea de la planta y en el grano fueron sobreestimados con el Met.1. Los resultados variaron significativamente con las fracciones iniciales de SOC, resaltando la importancia del método de inicialización. El Met.2 ofrece una alternativa para la inicialización del modelo CENTURY, mejorando la simulación de procesos de N en el suelo. La continua emergencia de nuevas variedades de híbridos modernos de maíz limita la aplicación de modelos de simulación de cultivos, ya que estos nuevos híbridos necesitan ser calibrados en el campo para ser adecuados para su uso en los modelos. El desarrollo de relaciones basadas en la duración del ciclo, simplificaría los requerimientos de calibración facilitando la rápida incorporación de nuevos cultivares en DSSAT. Seis híbridos de maiz (FAO 300 hasta FAO 700) fueron cultivados en un experimento de campo de dos años en un área semiárida de regadío en España central. Los coeficientes genéticos fueron obtenidos secuencialmente, comenzando con los parámetros de desarrollo fenológico (P1, P2, P5 and PHINT), seguido de los parámetros de crecimiento del cultivo (G2 and G3). Se continuó el procedimiento hasta que la salida de las simulaciones estuvo en concordancia con las observaciones fenológicas de campo. Después de la calibración, los parámetros simulados se ajustaron bien a los parámetros observados, con bajos RMSE en todos los casos. Los P1 y P5 calibrados, incrementaron con la duración del ciclo. P1 fue una función lineal del tiempo térmico (TT) desde emergencia hasta floración y P5 estuvo linealmente relacionada con el TT desde floración a madurez. No hubo diferencias significativas en PHINT entre híbridos de FAO-500 a 700 , ya que tuvieron un número de hojas similar. Como los coeficientes fenológicos estuvieron directamente relacionados con la duración del ciclo, sería posible desarrollar rangos y correlaciones que permitan estimar dichos coeficientes a partir de la clasificación del ciclo. ABSTRACT Crop simulation models allow analyzing various tillage-rotation combinations and exploring management scenarios. DSSAT model was tested under rainfed conditions in a 16-year field experiment in semiarid central Spain. The effect of tillage system and winter cereal-based rotations on the crop yield and soil quality was evaluated. The CERES and CROPGRO models were used to simulate crop growth and yield, while the DSSAT CENTURY was used in the SOC and SN simulations. Both field observations and CERES-Barley simulations, showed that barley grain yield was lower for continuous cereal (BB) than for vetch (VB) and fallow (FB) rotations for both tillage systems. The model predicted higher nitrogen availability in the conventional tillage (CT) than in the no tillage (NT) leading to a higher yield in the CT. The SOC and SN in the top layer, were higher in NT than in CT, and decreased with depth in both simulated and observed values. The best combinations for the dry land conditions studied were CT-VB and CT-FB, but CT presented lower SN and SOC content than NT. The beneficial effect of NT on SOC and SN under semiarid Mediterranean conditions can be identified by field observations and by crop model simulations. The simulation of the water balance in cropping systems is a useful tool to study how water can be used efficiently. The comparison of DSSAT soil water balance, with a simpler “tipping bucket” approach, with the more mechanistic WAVE model, which integrates Richard’s equation, is a powerful method to assess model performance. The soil parameters were calibrated by using the Simulated Annealing (SA) global optimizing method. A continuous weighing lysimeter in a bare fallow provided the observed values of drainage and evapotranspiration (ET) while soil water content (SW) was supplied by capacitance sensors. Both models performed well after optimizing soil parameters with SA, simulating the soil water balance components for the calibrated period. For the validation period, the optimized models predicted well soil water content and soil evaporation over time. However, drainage was predicted better by WAVE than by DSSAT, which presented larger errors in the cumulative values. That could be due to the mechanistic nature of WAVE against the more functional nature of DSSAT. The good results from WAVE indicate that, after calibration, it could be used as benchmark for other models for periods when no drainage field measurements are available. The performance of DSSAT-CENTURY when simulating SOC and N strongly depends on the initialization process. Initialization of the SOC pools from apparent soil N mineralization (Napmin) measurements was proposed as alternative method (Met.2). Method 2 was compared to the Basso et al. (2011) initialization method (Met.1), by applying both methods to a 4-year field experiment in a irrigated area of central Spain. Nmin and Napmin were overestimated by Met.1, since the obtained stable pool (SOC3) in the upper layers was lower than from Met.2. Simulated N leaching was similar for both methods, with good results in fallow and barley treatments. Method 1 underestimated topsoil SOC when compared with a 12-year observed serial. Crop growth and yield were properly simulated by both methods, but N in shoots and grain were overestimated by Met.1. Results varied significantly with the initial SOC pools, highlighting the importance of the initialization procedure. Method 2 offers an alternative to initialize the CENTURY model, enhancing the simulation of soil N processes. The continuous emergence of new varieties of modern maize hybrids limits the application of crop simulation models, since these new hybrids should be calibrated in the field to be suitable for model use. The development of relationships based on the cycle duration, would simplify the calibration requirements facilitating the rapid incorporation of new cultivars into DSSAT. Six maize hybrids (FAO 300 through FAO 700) were grown in a 2-year field experiment in a semiarid irrigated area of central Spain. Genetic coefficients were obtained sequentially, starting with the phenological development parameters (P1, P2, P5 and PHINT), followed by the crop growth parameters (G2 and G3). The procedure was continued until the simulated outputs were in good agreement with the field phenological observations. After calibration, simulated parameters matched observed parameters well, with low RMSE in most cases. The calibrated P1 and P5 increased with the duration of the cycle. P1 was a linear function of the thermal time (TT) from emergence to silking and P5 was linearly related with the TT from silking to maturity . There were no significant differences in PHINT between hybrids from FAO-500 to 700 , as they had similar leaf number. Since phenological coefficients were directly related with the cycle duration, it would be possible to develop ranges and correlations which allow to estimate such coefficients from the cycle classification.
Resumo:
Bayesian network classifiers are widely used in machine learning because they intuitively represent causal relations. Multi-label classification problems require each instance to be assigned a subset of a defined set of h labels. This problem is equivalent to finding a multi-valued decision function that predicts a vector of h binary classes. In this paper we obtain the decision boundaries of two widely used Bayesian network approaches for building multi-label classifiers: Multi-label Bayesian network classifiers built using the binary relevance method and Bayesian network chain classifiers. We extend our previous single-label results to multi-label chain classifiers, and we prove that, as expected, chain classifiers provide a more expressive model than the binary relevance method.