Biblioteca Digital

950 resultados para Cross-validation

QSAR modeling: um novo pacote computacional open source para gerar e validar modelos QSAR

Relevância:

60.00% 60.00%

Publicador:

Resumo:

QSAR modeling is a novel computer program developed to generate and validate QSAR or QSPR (quantitative structure- activity or property relationships) models. With QSAR modeling, users can build partial least squares (PLS) regression models, perform variable selection with the ordered predictors selection (OPS) algorithm, and validate models by using y-randomization and leave-N-out cross validation. An additional new feature is outlier detection carried out by simultaneous comparison of sample leverage with the respective Studentized residuals. The program was developed using Java version 6, and runs on any operating system that supports Java Runtime Environment version 6. The use of the program is illustrated. This program is available for download at lqta.iqm.unicamp.br.

PREDICTING THE BOILING POINT OF PCDD/Fs BY THE QSPR METHOD BASED ON THE MOLECULAR DISTANCE-EDGE VECTOR INDEX

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The quantitative structure property relationship (QSPR) for the boiling point (Tb) of polychlorinated dibenzo-p-dioxins and polychlorinated dibenzofurans (PCDD/Fs) was investigated. The molecular distance-edge vector (MDEV) index was used as the structural descriptor. The quantitative relationship between the MDEV index and Tb was modeled by using multivariate linear regression (MLR) and artificial neural network (ANN), respectively. Leave-one-out cross validation and external validation were carried out to assess the prediction performance of the models developed. For the MLR method, the prediction root mean square relative error (RMSRE) of leave-one-out cross validation and external validation was 1.77 and 1.23, respectively. For the ANN method, the prediction RMSRE of leave-one-out cross validation and external validation was 1.65 and 1.16, respectively. A quantitative relationship between the MDEV index and Tb of PCDD/Fs was demonstrated. Both MLR and ANN are practicable for modeling this relationship. The MLR model and ANN model developed can be used to predict the Tb of PCDD/Fs. Thus, the Tb of each PCDD/F was predicted by the developed models.

Kernel-Based Ranking. Methods for Learning and Performance Estimation

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Machine learning provides tools for automated construction of predictive models in data intensive areas of engineering and science. The family of regularized kernel methods have in the recent years become one of the mainstream approaches to machine learning, due to a number of advantages the methods share. The approach provides theoretically well-founded solutions to the problems of under- and overfitting, allows learning from structured data, and has been empirically demonstrated to yield high predictive performance on a wide range of application domains. Historically, the problems of classification and regression have gained the majority of attention in the field. In this thesis we focus on another type of learning problem, that of learning to rank. In learning to rank, the aim is from a set of past observations to learn a ranking function that can order new objects according to how well they match some underlying criterion of goodness. As an important special case of the setting, we can recover the bipartite ranking problem, corresponding to maximizing the area under the ROC curve (AUC) in binary classification. Ranking applications appear in a large variety of settings, examples encountered in this thesis include document retrieval in web search, recommender systems, information extraction and automated parsing of natural language. We consider the pairwise approach to learning to rank, where ranking models are learned by minimizing the expected probability of ranking any two randomly drawn test examples incorrectly. The development of computationally efficient kernel methods, based on this approach, has in the past proven to be challenging. Moreover, it is not clear what techniques for estimating the predictive performance of learned models are the most reliable in the ranking setting, and how the techniques can be implemented efficiently. The contributions of this thesis are as follows. First, we develop RankRLS, a computationally efficient kernel method for learning to rank, that is based on minimizing a regularized pairwise least-squares loss. In addition to training methods, we introduce a variety of algorithms for tasks such as model selection, multi-output learning, and cross-validation, based on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm, which is one of the most well established methods for learning to rank. Third, we study the combination of the empirical kernel map and reduced set approximation, which allows the large-scale training of kernel machines using linear solvers, and propose computationally efficient solutions to cross-validation when using the approach. Next, we explore the problem of reliable cross-validation when using AUC as a performance criterion, through an extensive simulation study. We demonstrate that the proposed leave-pair-out cross-validation approach leads to more reliable performance estimation than commonly used alternative approaches. Finally, we present a case study on applying machine learning to information extraction from biomedical literature, which combines several of the approaches considered in the thesis. The thesis is divided into two parts. Part I provides the background for the research work and summarizes the most central results, Part II consists of the five original research articles that are the main contribution of this thesis.

Spatial variability of the rainfall erosive potential in the State of Mato Grosso do Sul, Brazil

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Information about rainfall erosivity is important during soil and water conservation planning. Thus, the spatial variability of rainfall erosivity of the state Mato Grosso do Sul was analyzed using ordinary kriging interpolation. For this, three pluviograph stations were used to obtain the regression equations between the erosivity index and the rainfall coefficient EI30. The equations obtained were applied to 109 pluviometric stations, resulting in EI30 values. These values were analyzed from geostatistical technique, which can be divided into: descriptive statistics, adjust to semivariogram, cross-validation process and implementation of ordinary kriging to generate the erosivity map.Highest erosivity values were found in central and northeast regions of the State, while the lowest values were observed in the southern region. In addition, high annual precipitation values not necessarily produce higher erosivity values.

Comparison between hydrographically conditioned digital elevation models in the morphometric charaterization of watersheds

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The aim of this study was to compare the hydrographically conditioned digital elevation models (HCDEMs) generated from data of VNIR (Visible Near Infrared) sensor of ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer), of SRTM (Shuttle Radar Topography Mission) and topographical maps from IBGE in a scale of 1:50,000, processed in the Geographical Information System (GIS), aiming the morphometric characterization of watersheds. It was taken as basis the Sub-basin of São Bartolomeu River, obtaining morphometric characteristics from HCDEMs. Root Mean Square Error (RMSE) and cross validation were the statistics indexes used to evaluate the quality of HCDEMs. The percentage differences in the morphometric parameters obtained from these three different data sets were less than 10%, except for the mean slope (21%). In general, it was observed a good agreement between HCDEMs generated from remote sensing data and IBGE maps. The result of HCDEM ASTER was slightly higher than that from HCDEM SRTM. The HCDEM ASTER was more accurate than the HCDEM SRTM in basins with high altitudes and rugged terrain, by presenting frequency altimetry nearest to HCDEM IBGE, considered standard in this study.

Vocalization data mining for estimating swine stress conditions

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This study aimed to identify differences in swine vocalization pattern according to animal gender and different stress conditions. A total of 150 barrow males and 150 females (Dalland® genetic strain), aged 100 days, were used in the experiment. Pigs were exposed to different stressful situations: thirst (no access to water), hunger (no access to food), and thermal stress (THI exceeding 74). For the control treatment, animals were kept under a comfort situation (animals with full access to food and water, with environmental THI lower than 70). Acoustic signals were recorded every 30 minutes, totaling six samples for each stress situation. Afterwards, the audios were analyzed by Praat® 5.1.19 software, generating a sound spectrum. For determination of stress conditions, data were processed by WEKA® 3.5 software, using the decision tree algorithm C4.5, known as J48 in the software environment, considering cross-validation with samples of 10% (10-fold cross-validation). According to the Decision Tree, the acoustic most important attribute for the classification of stress conditions was sound Intensity (root node). It was not possible to identify, using the tested attributes, the animal gender by vocal register. A decision tree was generated for recognition of situations of swine hunger, thirst, and heat stress from records of sound intensity, Pitch frequency, and Formant 1.

Methodology for spatialization of intense rainfall equation parameters

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The aim of this study was to generate maps of intense rainfall equation parameters using interpolated maximum intense rainfall data. The study area comprised Espírito Santo State, Brazil. A total of 59 intense rainfall equations were used to interpolate maximum intense rainfall, with a 1 x 1 km spatial resolution. Maximum intense rainfall was interpolated considering recurrence of 2; 5; 10; 20; 50 and 100 years, and duration of 10; 20; 30; 40; 50; 60; 120; 240; 360; 420; 660; 720; 900; 1,140; 1,380 and 1,440 minutes, resulting in 96 maps of maximum intense rainfall. The used interpolators were inverse distance weighting and ordinary kriging, for which significance level (p-value) and coefficient of determination (R²) were evaluated for the cross-validation data, choosing the method that presented better R² to generate maps. Finally, maps of maximum intense precipitation were used to estimate, cell by cell, the intense rainfall equation parameters. In comparison with literature data, the mean percentage error of estimated intense rainfall equations was 13.8%. Maps of spatialized parameters, obtained in this study, are of simple use; once they are georeferenced, they may be imported into any geographic information system to be used for a specific area of interest.

Gene trio signatures as molecular markers to predict response to doxorubicin cyclophosphamide neoadjuvant chemotherapy in breast cancerpatients

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In breast cancer patients submitted to neoadjuvant chemotherapy (4 cycles of doxorubicin and cyclophosphamide, AC), expression of groups of three genes (gene trio signatures) could distinguish responsive from non-responsive tumors, as demonstrated by cDNA microarray profiling in a previous study by our group. In the current study, we determined if the expression of the same genes would retain the predictive strength, when analyzed by a more accessible technique (real-time RT-PCR). We evaluated 28 samples already analyzed by cDNA microarray, as a technical validation procedure, and 14 tumors, as an independent biological validation set. All patients received neoadjuvant chemotherapy (4 AC). Among five trio combinations previously identified, defined by nine genes individually investigated (BZRP, CLPTM1,MTSS1, NOTCH1, NUP210, PRSS11, RPL37A, SMYD2, and XLHSRF-1), the most accurate were established by RPL37A, XLHSRF-1based trios, with NOTCH1 or NUP210. Both trios correctly separated 86% of tumors (87% sensitivity and 80% specificity for predicting response), according to their response to chemotherapy (82% in a leave-one-out cross-validation method). Using the pre-established features obtained by linear discriminant analysis, 71% samples from the biological validation set were also correctly classified by both trios (72% sensitivity; 66% specificity). Furthermore, we explored other gene combinations to achieve a higher accuracy in the technical validation group (as a training set). A new trio, MTSS1, RPL37 and SMYD2, correctly classified 93% of samples from the technical validation group (95% sensitivity and 80% specificity; 86% accuracy by the cross-validation method) and 79% from the biological validation group (72% sensitivity and 100% specificity). Therefore, the combined expression of MTSS1, RPL37 and SMYD2, as evaluated by real-time RT-PCR, is a potential candidate to predict response to neoadjuvant doxorubicin and cyclophosphamide in breast cancer patients.

Classification of brain tumor extracts by high resolution ¹H MRS using partial least squares discriminant analysis

Relevância:

60.00% 60.00%

Publicador:

Resumo:

High resolution proton nuclear magnetic resonance spectroscopy (¹H MRS) can be used to detect biochemical changes in vitro caused by distinct pathologies. It can reveal distinct metabolic profiles of brain tumors although the accurate analysis and classification of different spectra remains a challenge. In this study, the pattern recognition method partial least squares discriminant analysis (PLS-DA) was used to classify 11.7 T ¹H MRS spectra of brain tissue extracts from patients with brain tumors into four classes (high-grade neuroglial, low-grade neuroglial, non-neuroglial, and metastasis) and a group of control brain tissue. PLS-DA revealed 9 metabolites as the most important in group differentiation: γ-aminobutyric acid, acetoacetate, alanine, creatine, glutamate/glutamine, glycine, myo-inositol, N-acetylaspartate, and choline compounds. Leave-one-out cross-validation showed that PLS-DA was efficient in group characterization. The metabolic patterns detected can be explained on the basis of previous multimodal studies of tumor metabolism and are consistent with neoplastic cell abnormalities possibly related to high turnover, resistance to apoptosis, osmotic stress and tumor tendency to use alternative energetic pathways such as glycolysis and ketogenesis.

Manual and semi-automatic quantification of in vivo ¹H-MRS data for the classification of human primary brain tumors

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In vivo proton magnetic resonance spectroscopy (¹H-MRS) is a technique capable of assessing biochemical content and pathways in normal and pathological tissue. In the brain, ¹H-MRS complements the information given by magnetic resonance images. The main goal of the present study was to assess the accuracy of ¹H-MRS for the classification of brain tumors in a pilot study comparing results obtained by manual and semi-automatic quantification of metabolites. In vivo single-voxel ¹H-MRS was performed in 24 control subjects and 26 patients with brain neoplasms that included meningiomas, high-grade neuroglial tumors and pilocytic astrocytomas. Seven metabolite groups (lactate, lipids, N-acetyl-aspartate, glutamate and glutamine group, total creatine, total choline, myo-inositol) were evaluated in all spectra by two methods: a manual one consisting of integration of manually defined peak areas, and the advanced method for accurate, robust and efficient spectral fitting (AMARES), a semi-automatic quantification method implemented in the jMRUI software. Statistical methods included discriminant analysis and the leave-one-out cross-validation method. Both manual and semi-automatic analyses detected differences in metabolite content between tumor groups and controls (P < 0.005). The classification accuracy obtained with the manual method was 75% for high-grade neuroglial tumors, 55% for meningiomas and 56% for pilocytic astrocytomas, while for the semi-automatic method it was 78, 70, and 98%, respectively. Both methods classified all control subjects correctly. The study demonstrated that ¹H-MRS accurately differentiated normal from tumoral brain tissue and confirmed the superiority of the semi-automatic quantification method.

Scalable Feature Selection Applications for Genome-Wide Association Studies of Complex Diseases

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Personalized medicine will revolutionize our capabilities to combat disease. Working toward this goal, a fundamental task is the deciphering of geneticvariants that are predictive of complex diseases. Modern studies, in the formof genome-wide association studies (GWAS) have aﬀorded researchers with the opportunity to reveal new genotype-phenotype relationships through the extensive scanning of genetic variants. These studies typically contain over half a million genetic features for thousands of individuals. Examining this with methods other than univariate statistics is a challenging task requiring advanced algorithms that are scalable to the genome-wide level. In the future, next-generation sequencing studies (NGS) will contain an even larger number of common and rare variants. Machine learning-based feature selection algorithms have been shown to have the ability to eﬀectively create predictive models for various genotype-phenotype relationships. This work explores the problem of selecting genetic variant subsets that are the most predictive of complex disease phenotypes through various feature selection methodologies, including ﬁlter, wrapper and embedded algorithms. The examined machine learning algorithms were demonstrated to not only be eﬀective at predicting the disease phenotypes, but also doing so eﬃciently through the use of computational shortcuts. While much of the work was able to be run on high-end desktops, some work was further extended so that it could be implemented on parallel computers helping to assure that they will also scale to the NGS data sets. Further, these studies analyzed the relationships between various feature selection methods and demonstrated the need for careful testing when selecting an algorithm. It was shown that there is no universally optimal algorithm for variant selection in GWAS, but rather methodologies need to be selected based on the desired outcome, such as the number of features to be included in the prediction model. It was also demonstrated that without proper model validation, for example using nested cross-validation, the models can result in overly-optimistic prediction accuracies and decreased generalization ability. It is through the implementation and application of machine learning methods that one can extract predictive genotype–phenotype relationships and biological insights from genetic data sets.

An artificial neural network model for prediction of quality characteristics of apples during convective dehydration

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this study, the effects of hot-air drying conditions on color, water holding capacity, and total phenolic content of dried apple were investigated using artificial neural network as an intelligent modeling system. After that, a genetic algorithm was used to optimize the drying conditions. Apples were dried at different temperatures (40, 60, and 80 °C) and at three air flow-rates (0.5, 1, and 1.5 m/s). Applying the leave-one-out cross validation methodology, simulated and experimental data were in good agreement presenting an error < 2.4 %. Quality index optimal values were found at 62.9 °C and 1.0 m/s using genetic algorithm.

Evaluation of Machine Learning Classifiers for Mobile Network Intrusion Detection Systems

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Mobile malwares are increasing with the growing number of Mobile users. Mobile malwares can perform several operations which lead to cybersecurity threats such as, stealing financial or personal information, installing malicious applications, sending premium SMS, creating backdoors, keylogging and crypto-ransomware attacks. Knowing the fact that there are many illegitimate Applications available on the App stores, most of the mobile users remain careless about the security of their Mobile devices and become the potential victim of these threats. Previous studies have shown that not every antivirus is capable of detecting all the threats; due to the fact that Mobile malwares use advance techniques to avoid detection. A Network-based IDS at the operator side will bring an extra layer of security to the subscribers and can detect many advanced threats by analyzing their traffic patterns. Machine Learning(ML) will provide the ability to these systems to detect unknown threats for which signatures are not yet known. This research is focused on the evaluation of Machine Learning classifiers in Network-based Intrusion detection systems for Mobile Networks. In this study, different techniques of Network-based intrusion detection with their advantages, disadvantages and state of the art in Hybrid solutions are discussed. Finally, a ML based NIDS is proposed which will work as a subsystem, to Network-based IDS deployed by Mobile Operators, that can help in detecting unknown threats and reducing false positives. In this research, several ML classifiers were implemented and evaluated. This study is focused on Android-based malwares, as Android is the most popular OS among users, hence most targeted by cyber criminals. Supervised ML algorithms based classifiers were built using the dataset which contained the labeled instances of relevant features. These features were extracted from the traffic generated by samples of several malware families and benign applications. These classifiers were able to detect malicious traffic patterns with the TPR upto 99.6% during Cross-validation test. Also, several experiments were conducted to detect unknown malware traffic and to detect false positives. These classifiers were able to detect unknown threats with the Accuracy of 97.5%. These classifiers could be integrated with current NIDS', which use signatures, statistical or knowledge-based techniques to detect malicious traffic. Technique to integrate the output from ML classifier with traditional NIDS is discussed and proposed for future work.

A maximum likelihood framework for protein design

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Affiliation: Claudia Kleinman, Nicolas Rodrigue & Hervé Philippe : Département de biochimie, Faculté de médecine, Université de Montréal

Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Affiliation: Département de Biochimie, Université de Montréal

«
1
2
...
5
6
7
8
9
10
11
...
63
64
»