920 resultados para Least-Squares prediction
Resumo:
The objective of this study was to investigate, in a population of crossbred cattle, the obtainment of the non-additive genetic effects for the characteristics weight at 205 and 390 days and scrotal circumference, and to evaluate the consideration of these effects in the prediction of breeding values of sires using different estimation methodologies. In method 1, the data were pre-adjusted for the non-additive effects obtained by least squares means method in a model that considered the direct additive, maternal and non-additive fixed genetic effects, the direct and total maternal heterozygosities, and epistasis. In method 2, the non-additive effects were considered covariates in genetic model. Genetic values for adjusted and non-adjusted data were predicted considering additive direct and maternal effects, and for weight at 205 days, also the permanent environmental effect, as random effects in the model. The breeding values of the categories of sires considered for the weight characteristic at 205 days were organized in files, in order to verify alterations in the magnitude of the predictions and ranking of animals in the two methods of correction data for the non-additives effects. The non-additive effects were not similar in magnitude and direction in the two estimation methods used, nor for the characteristics evaluated. Pearson and Spearman correlations between breeding values were higher than 0.94, and the use of different methods does not imply changes in the selection of animals.
Resumo:
PPAR delta is a nuclear receptor that, when activated, regulates the metabolism of carbohydrates and lipids and is related to metabolic syndrome and type 2 diabetes. To understand the main interactions between ligands and PPAR delta, we have constructed 2D and 3D QSAR models and compared them with HOMO, LUMO and electrostatic potential maps of the compounds studied, as well as docking results. All QSAR models showed good statistical parameters and prediction outcomes. The QSAR models were used to predict the biological activity of an external test set, and the predicted values are in good agreement with the experimental results. Furthermore, we employed all maps to evaluate the possible interactions between the ligands and PPAR delta. These predictive QSAR models, along with the HOMO, LUMO and MEP maps, can provide insights into the structural and chemical properties that are needed in the design of new PPAR delta ligands that have improved biological activity and can be employed to treat metabolic diseases.
Resumo:
Current methods for quality control of sugar cane are performed in extracted juice using several methodologies, often requiring appreciable time and chemicals (eventually toxic), making the methods not green and expensive. The present study proposes the use of X-ray spectrometry together with chemometric methods as an innovative and alternative technique for determining sugar cane quality parameters, specifically sucrose concentration, POL, and fiber content. Measurements in stem, leaf, and juice were performed, and those applied directly in stem provided the best results. Prediction models for sugar cane stem determinations with a single 60 s irradiation using portable X-ray fluorescence equipment allows estimating the % sucrose, % fiber, and POL simultaneously. Average relative deviations in the prediction step of around 8% are acceptable if considering that field measurements were done. These results may indicate the best period to cut a particular crop as well as for evaluating the quality of sugar cane for the sugar and alcohol industries.
Resumo:
Metals price risk management is a key issue related to financial risk in metal markets because of uncertainty of commodity price fluctuation, exchange rate, interest rate changes and huge price risk either to metals’ producers or consumers. Thus, it has been taken into account by all participants in metal markets including metals’ producers, consumers, merchants, banks, investment funds, speculators, traders and so on. Managing price risk provides stable income for both metals’ producers and consumers, so it increases the chance that a firm will invest in attractive projects. The purpose of this research is to evaluate risk management strategies in the copper market. The main tools and strategies of price risk management are hedging and other derivatives such as futures contracts, swaps and options contracts. Hedging is a transaction designed to reduce or eliminate price risk. Derivatives are financial instruments, whose returns are derived from other financial instruments and they are commonly used for managing financial risks. Although derivatives have been around in some form for centuries, their growth has accelerated rapidly during the last 20 years. Nowadays, they are widely used by financial institutions, corporations, professional investors, and individuals. This project is focused on the over-the-counter (OTC) market and its products such as exotic options, particularly Asian options. The first part of the project is a description of basic derivatives and risk management strategies. In addition, this part discusses basic concepts of spot and futures (forward) markets, benefits and costs of risk management and risks and rewards of positions in the derivative markets. The second part considers valuations of commodity derivatives. In this part, the options pricing model DerivaGem is applied to Asian call and put options on London Metal Exchange (LME) copper because it is important to understand how Asian options are valued and to compare theoretical values of the options with their market observed values. Predicting future trends of copper prices is important and would be essential to manage market price risk successfully. Therefore, the third part is a discussion about econometric commodity models. Based on this literature review, the fourth part of the project reports the construction and testing of an econometric model designed to forecast the monthly average price of copper on the LME. More specifically, this part aims at showing how LME copper prices can be explained by means of a simultaneous equation structural model (two-stage least squares regression) connecting supply and demand variables. A simultaneous econometric model for the copper industry is built: {█(Q_t^D=e^((-5.0485))∙P_((t-1))^((-0.1868) )∙〖GDP〗_t^((1.7151) )∙e^((0.0158)∙〖IP〗_t ) @Q_t^S=e^((-3.0785))∙P_((t-1))^((0.5960))∙T_t^((0.1408))∙P_(OIL(t))^((-0.1559))∙〖USDI〗_t^((1.2432))∙〖LIBOR〗_((t-6))^((-0.0561))@Q_t^D=Q_t^S )┤ P_((t-1))^CU=e^((-2.5165))∙〖GDP〗_t^((2.1910))∙e^((0.0202)∙〖IP〗_t )∙T_t^((-0.1799))∙P_(OIL(t))^((0.1991))∙〖USDI〗_t^((-1.5881))∙〖LIBOR〗_((t-6))^((0.0717) Where, Q_t^D and Q_t^Sare world demand for and supply of copper at time t respectively. P(t-1) is the lagged price of copper, which is the focus of the analysis in this part. GDPt is world gross domestic product at time t, which represents aggregate economic activity. In addition, industrial production should be considered here, so the global industrial production growth that is noted as IPt is included in the model. Tt is the time variable, which is a useful proxy for technological change. A proxy variable for the cost of energy in producing copper is the price of oil at time t, which is noted as POIL(t ) . USDIt is the U.S. dollar index variable at time t, which is an important variable for explaining the copper supply and copper prices. At last, LIBOR(t-6) is the 6-month lagged 1-year London Inter bank offering rate of interest. Although, the model can be applicable for different base metals' industries, the omitted exogenous variables such as the price of substitute or a combined variable related to the price of substitutes have not been considered in this study. Based on this econometric model and using a Monte-Carlo simulation analysis, the probabilities that the monthly average copper prices in 2006 and 2007 will be greater than specific strike price of an option are defined. The final part evaluates risk management strategies including options strategies, metal swaps and simple options in relation to the simulation results. The basic options strategies such as bull spreads, bear spreads and butterfly spreads, which are created by using both call and put options in 2006 and 2007 are evaluated. Consequently, each risk management strategy in 2006 and 2007 is analyzed based on the day of data and the price prediction model. As a result, applications stemming from this project include valuing Asian options, developing a copper price prediction model, forecasting and planning, and decision making for price risk management in the copper market.
Resumo:
We present an independent calibration model for the determination of biogenic silica (BSi) in sediments, developed from analysis of synthetic sediment mixtures and application of Fourier transform infrared spectroscopy (FTIRS) and partial least squares regression (PLSR) modeling. In contrast to current FTIRS applications for quantifying BSi, this new calibration is independent from conventional wet-chemical techniques and their associated measurement uncertainties. This approach also removes the need for developing internal calibrations between the two methods for individual sediments records. For the independent calibration, we produced six series of different synthetic sediment mixtures using two purified diatom extracts, with one extract mixed with quartz sand, calcite, 60/40 quartz/calcite and two different natural sediments, and a second extract mixed with one of the natural sediments. A total of 306 samples—51 samples per series—yielded BSi contents ranging from 0 to 100 %. The resulting PLSR calibration model between the FTIR spectral information and the defined BSi concentration of the synthetic sediment mixtures exhibits a strong cross-validated correlation ( R2cv = 0.97) and a low root-mean square error of cross-validation (RMSECV = 4.7 %). Application of the independent calibration to natural lacustrine and marine sediments yields robust BSi reconstructions. At present, the synthetic mixtures do not include the variation in organic matter that occurs in natural samples, which may explain the somewhat lower prediction accuracy of the calibration model for organic-rich samples.
Resumo:
Linear- and unimodal-based inference models for mean summer temperatures (partial least squares, weighted averaging, and weighted averaging partial least squares models) were applied to a high-resolution pollen and cladoceran stratigraphy from Gerzensee, Switzerland. The time-window of investigation included the Allerød, the Younger Dryas, and the Preboreal. Characteristic major and minor oscillations in the oxygen-isotope stratigraphy, such as the Gerzensee oscillation, the onset and end of the Younger Dryas stadial, and the Preboreal oscillation, were identified by isotope analysis of bulk-sediment carbonates of the same core and were used as independent indicators for hemispheric or global scale climatic change. In general, the pollen-inferred mean summer temperature reconstruction using all three inference models follows the oxygen-isotope curve more closely than the cladoceran curve. The cladoceran-inferred reconstruction suggests generally warmer summers than the pollen-based reconstructions, which may be an effect of terrestrial vegetation not being in equilibrium with climate due to migrational lags during the Late Glacial and early Holocene. Allerød summer temperatures range between 11 and 12°C based on pollen, whereas the cladoceran-inferred temperatures lie between 11 and 13°C. Pollen and cladocera-inferred reconstructions both suggest a drop to 9–10°C at the beginning of the Younger Dryas. Although the Allerød–Younger Dryas transition lasted 150–160 years in the oxygen-isotope stratigraphy, the pollen-inferred cooling took 180–190 years and the cladoceran-inferred cooling lasted 250–260 years. The pollen-inferred summer temperature rise to 11.5–12°C at the transition from the Younger Dryas to the Preboreal preceded the oxygen-isotope signal by several decades, whereas the cladoceran-inferred warming lagged. Major discrepancies between the pollen- and cladoceran-inference models are observed for the Preboreal, where the cladoceran-inference model suggests mean summer temperatures of up to 14–15°C. Both pollen- and cladoceran-inferred reconstructions suggest a cooling that may be related to the Gerzensee oscillation, but there is no evidence for a cooling synchronous with the Preboreal oscillation as recorded in the oxygen-isotope record. For the Gerzensee oscillation the inferred cooling was ca. 1 and 0.5°C based on pollen and cladocera, respectively, which lies well within the inherent prediction errors of the inference models.
Resumo:
Surface sediments from 68 small lakes in the Alps and 9 well-dated sediment core samples that cover a gradient of total phosphorus (TP) concentrations of 6 to 520 μg TP l-1 were studied for diatom, chrysophyte cyst, cladocera, and chironomid assemblages. Inference models for mean circulation log10 TP were developed for diatoms, chironomids, and benthic cladocera using weighted-averaging partial least squares. After screening for outliers, the final transfer functions have coefficients of determination (r2, as assessed by cross-validation, of 0.79 (diatoms), 0.68 (chironomids), and 0.49 (benthic cladocera). Planktonic cladocera and chrysophytes show very weak relationships to TP and no TP inference models were developed for these biota. Diatoms showed the best relationship with TP, whereas the other biota all have large secondary gradients, suggesting that variables other than TP have a strong influence on their composition and abundance. Comparison with other diatom – TP inference models shows that our model has high predictive power and a low root mean squared error of prediction, as assessed by cross-validation.
Resumo:
Multi-dimensional Bayesian network classifiers (MBCs) are probabilistic graphical models recently proposed to deal with multi-dimensional classification problems, where each instance in the data set has to be assigned to more than one class variable. In this paper, we propose a Markov blanket-based approach for learning MBCs from data. Basically, it consists of determining the Markov blanket around each class variable using the HITON algorithm, then specifying the directionality over the MBC subgraphs. Our approach is applied to the prediction problem of the European Quality of Life-5 Dimensions (EQ-5D) from the 39-item Parkinson’s Disease Questionnaire (PDQ-39) in order to estimate the health-related quality of life of Parkinson’s patients. Fivefold cross-validation experiments were carried out on randomly generated synthetic data sets, Yeast data set, as well as on a real-world Parkinson’s disease data set containing 488 patients. The experimental study, including comparison with additional Bayesian network-based approaches, back propagation for multi-label learning, multi-label k-nearest neighbor, multinomial logistic regression, ordinary least squares, and censored least absolute deviations, shows encouraging results in terms of predictive accuracy as well as the identification of dependence relationships among class and feature variables.
Resumo:
Most empirical disciplines promote the reuse and sharing of datasets, as it leads to greater possibility of replication. While this is increasingly the case in Empirical Software Engineering, some of the most popular bug-fix datasets are now known to be biased. This raises two significants concerns: first, that sample bias may lead to underperforming prediction models, and second, that the external validity of the studies based on biased datasets may be suspect. This issue has raised considerable consternation in the ESE literature in recent years. However, there is a confounding factor of these datasets that has not been examined carefully: size. Biased datasets are sampling only some of the data that could be sampled, and doing so in a biased fashion; but biased samples could be smaller, or larger. Smaller data sets in general provide less reliable bases for estimating models, and thus could lead to inferior model performance. In this setting, we ask the question, what affects performance more? bias, or size? We conduct a detailed, large-scale meta-analysis, using simulated datasets sampled with bias from a high-quality dataset which is relatively free of bias. Our results suggest that size always matters just as much bias direction, and in fact much more than bias direction when considering information-retrieval measures such as AUC and F-score. This indicates that at least for prediction models, even when dealing with sampling bias, simply finding larger samples can sometimes be sufficient. Our analysis also exposes the complexity of the bias issue, and raises further issues to be explored in the future.
Resumo:
So far, the majority of reports on on-line measurement considered soil properties with direct spectral responses in near infrared spectroscopy (NIRS). This work reports on the results of on-line measurement of soil properties with indirect spectral responses, e.g. pH, cation exchange capacity (CEC), exchangeable calcium (Caex) and exchangeable magnesium (Mgex) in one field in Bedfordshire in the UK. The on-line sensor consisted of a subsoiler coupled with an AgroSpec mobile, fibre type, visible and near infrared (vis–NIR) spectrophotometer (tec5 Technology for Spectroscopy, Germany), with a measurement range 305–2200 nm to acquire soil spectra in diffuse reflectance mode. General calibration models for the studied soil properties were developed with a partial least squares regression (PLSR) with one-leave-out cross validation, using spectra measured under non-mobile laboratory conditions of 160 soil samples collected from different fields in four farms in Europe, namely, Czech Republic, Denmark, Netherland and UK. A group of 25 samples independent from the calibration set was used as independent validation set. Higher accuracy was obtained for laboratory scanning as compared to on-line scanning of the 25 independent samples. The prediction accuracy for the laboratory and on-line measurements was classified as excellent/very good for pH (RPD = 2.69 and 2.14 and r2 = 0.86 and 0.78, respectively), and moderately good for CEC (RPD = 1.77 and 1.61 and r2 = 0.68 and 0.62, respectively) and Mgex (RPD = 1.72 and 1.49 and r2 = 0.66 and 0.67, respectively). For Caex, very good accuracy was calculated for laboratory method (RPD = 2.19 and r2 = 0.86), as compared to the poor accuracy reported for the on-line method (RPD = 1.30 and r2 = 0.61). The ability of collecting large number of data points per field area (about 12,800 point per 21 ha) and the simultaneous analysis of several soil properties without direct spectral response in the NIR range at relatively high operational speed and appreciable accuracy, encourage the recommendation of the on-line measurement system for site specific fertilisation.
Resumo:
El presente proyecto de fin de carrera describe y analiza el estudio integral del efecto de las vibraciones producidas por voladuras superficiales realizadas en el proyecto del “Tercer Juego de Esclusas” ejecutado para la Expansión del Canal de Panamá. Se recopilan un total de 53 registros, data generada por el monitoreo de 7 sismógrafos en 10 voladuras de producción realizadas en el año 2010. El fenómeno vibratorio tiene dos parámetros fundamentales, la velocidad pico-partícula (PPV) y la frecuencia dominante, los cuales caracterizan cuan dañino puede ser éste frente a su influencia sobre las estructuras civiles; por ello, se pretende caracterizarlas y fundamentalmente predecirlas, lo que permitirá su debido control. En función a lo expuesto, el estudio realizado consta de dos partes, la primera describe el comportamiento del terreno mediante la estimación de la ley de atenuación de la velocidad pico-partícula a través del uso de la regresión lineal por mínimos cuadrados; la segunda detalla un procedimiento validable para la predicción de la frecuencia dominante y del pseudo-espectro de respuesta de velocidad (PVRS) basada en la teoría de Newmark & Hall. Se ha obtenido: (i) la ley de atenuación del terreno para distintos grados de fiabilidad, (ii) herramientas de diseño de voladuras basadas en la relación de carga – distancia, (iii) la demostración que los valores de PPV se ajustan a una distribución log-normal, (iv) el mapa de isolíneas de PPV para el área de estudio, (v) una técnica detallada y válida para la predicción de la frecuencia dominante y del espectro de respuesta, (vi) formulaciones matemáticas de los factores de amplificación para el desplazamiento, velocidad y aceleración, (vii) mapa de isolíneas de amplificación para el área de estudio. A partir de los resultados obtenidos se proporciona información útil para su uso en el diseño y control de las voladuras posteriores del proyecto. ABSTRACT This project work describes and analyzes the comprehensive study of the effect of the vibrations produced by surface blasting carried out in the "Third Set of Locks" project executed for the expansion of the Panama Canal. A total of 53 records were collected, with the data generated by the monitoring of 7 seismographs in 10 production blasts carried out in 2010. The vibratory phenomenon has two fundamental parameters, the peak-particle velocity (PPV) and the dominant frequency, which characterize how damaging this can be compared to their influence on structures, which is why this is intended to characterize and predict fundamentally, that which allows proper control. Based on the above, the study consists of two parts; the first describes the behavior of the terrain by estimating the attenuation law for peak-particle velocity by using the ordinary least squares regression analysis, the second details a validable procedure for the prediction of the dominant frequency and pseudo-velocity response spectrum (PVRS) based on the theory of Newmark & Hall. The following have been obtained: (i) the attenuation law of the terrain for different degrees of reliability, (ii) blast design tools based on charge-distance ratio, (iii) the demonstration that the values of PPV conform to a log-normal distribution, (iv) the map of isolines of PPV for the area of study (v) detailed and valid technique for predicting the dominant frequency response spectrum, (vi) mathematical formulations of the amplification factors for displacement, velocity and acceleration, (vii) amplification of isolines map for the study area. From the results obtained, the study provides useful information for use in the design and control of blasting for subsequent projects.
Resumo:
Hoy en día, con la evolución continua y rápida de las tecnologías de la información y los dispositivos de computación, se recogen y almacenan continuamente grandes volúmenes de datos en distintos dominios y a través de diversas aplicaciones del mundo real. La extracción de conocimiento útil de una cantidad tan enorme de datos no se puede realizar habitualmente de forma manual, y requiere el uso de técnicas adecuadas de aprendizaje automático y de minería de datos. La clasificación es una de las técnicas más importantes que ha sido aplicada con éxito a varias áreas. En general, la clasificación se compone de dos pasos principales: en primer lugar, aprender un modelo de clasificación o clasificador a partir de un conjunto de datos de entrenamiento, y en segundo lugar, clasificar las nuevas instancias de datos utilizando el clasificador aprendido. La clasificación es supervisada cuando todas las etiquetas están presentes en los datos de entrenamiento (es decir, datos completamente etiquetados), semi-supervisada cuando sólo algunas etiquetas son conocidas (es decir, datos parcialmente etiquetados), y no supervisada cuando todas las etiquetas están ausentes en los datos de entrenamiento (es decir, datos no etiquetados). Además, aparte de esta taxonomía, el problema de clasificación se puede categorizar en unidimensional o multidimensional en función del número de variables clase, una o más, respectivamente; o también puede ser categorizado en estacionario o cambiante con el tiempo en función de las características de los datos y de la tasa de cambio subyacente. A lo largo de esta tesis, tratamos el problema de clasificación desde tres perspectivas diferentes, a saber, clasificación supervisada multidimensional estacionaria, clasificación semisupervisada unidimensional cambiante con el tiempo, y clasificación supervisada multidimensional cambiante con el tiempo. Para llevar a cabo esta tarea, hemos usado básicamente los clasificadores Bayesianos como modelos. La primera contribución, dirigiéndose al problema de clasificación supervisada multidimensional estacionaria, se compone de dos nuevos métodos de aprendizaje de clasificadores Bayesianos multidimensionales a partir de datos estacionarios. Los métodos se proponen desde dos puntos de vista diferentes. El primer método, denominado CB-MBC, se basa en una estrategia de envoltura de selección de variables que es voraz y hacia delante, mientras que el segundo, denominado MB-MBC, es una estrategia de filtrado de variables con una aproximación basada en restricciones y en el manto de Markov. Ambos métodos han sido aplicados a dos problemas reales importantes, a saber, la predicción de los inhibidores de la transcriptasa inversa y de la proteasa para el problema de infección por el virus de la inmunodeficiencia humana tipo 1 (HIV-1), y la predicción del European Quality of Life-5 Dimensions (EQ-5D) a partir de los cuestionarios de la enfermedad de Parkinson con 39 ítems (PDQ-39). El estudio experimental incluye comparaciones de CB-MBC y MB-MBC con los métodos del estado del arte de la clasificación multidimensional, así como con métodos comúnmente utilizados para resolver el problema de predicción de la enfermedad de Parkinson, a saber, la regresión logística multinomial, mínimos cuadrados ordinarios, y mínimas desviaciones absolutas censuradas. En ambas aplicaciones, los resultados han sido prometedores con respecto a la precisión de la clasificación, así como en relación al análisis de las estructuras gráficas que identifican interacciones conocidas y novedosas entre las variables. La segunda contribución, referida al problema de clasificación semi-supervisada unidimensional cambiante con el tiempo, consiste en un método nuevo (CPL-DS) para clasificar flujos de datos parcialmente etiquetados. Los flujos de datos difieren de los conjuntos de datos estacionarios en su proceso de generación muy rápido y en su aspecto de cambio de concepto. Es decir, los conceptos aprendidos y/o la distribución subyacente están probablemente cambiando y evolucionando en el tiempo, lo que hace que el modelo de clasificación actual sea obsoleto y deba ser actualizado. CPL-DS utiliza la divergencia de Kullback-Leibler y el método de bootstrapping para cuantificar y detectar tres tipos posibles de cambio: en las predictoras, en la a posteriori de la clase o en ambas. Después, si se detecta cualquier cambio, un nuevo modelo de clasificación se aprende usando el algoritmo EM; si no, el modelo de clasificación actual se mantiene sin modificaciones. CPL-DS es general, ya que puede ser aplicado a varios modelos de clasificación. Usando dos modelos diferentes, el clasificador naive Bayes y la regresión logística, CPL-DS se ha probado con flujos de datos sintéticos y también se ha aplicado al problema real de la detección de código malware, en el cual los nuevos ficheros recibidos deben ser continuamente clasificados en malware o goodware. Los resultados experimentales muestran que nuestro método es efectivo para la detección de diferentes tipos de cambio a partir de los flujos de datos parcialmente etiquetados y también tiene una buena precisión de la clasificación. Finalmente, la tercera contribución, sobre el problema de clasificación supervisada multidimensional cambiante con el tiempo, consiste en dos métodos adaptativos, a saber, Locally Adpative-MB-MBC (LA-MB-MBC) y Globally Adpative-MB-MBC (GA-MB-MBC). Ambos métodos monitorizan el cambio de concepto a lo largo del tiempo utilizando la log-verosimilitud media como métrica y el test de Page-Hinkley. Luego, si se detecta un cambio de concepto, LA-MB-MBC adapta el actual clasificador Bayesiano multidimensional localmente alrededor de cada nodo cambiado, mientras que GA-MB-MBC aprende un nuevo clasificador Bayesiano multidimensional. El estudio experimental realizado usando flujos de datos sintéticos multidimensionales indica los méritos de los métodos adaptativos propuestos. ABSTRACT Nowadays, with the ongoing and rapid evolution of information technology and computing devices, large volumes of data are continuously collected and stored in different domains and through various real-world applications. Extracting useful knowledge from such a huge amount of data usually cannot be performed manually, and requires the use of adequate machine learning and data mining techniques. Classification is one of the most important techniques that has been successfully applied to several areas. Roughly speaking, classification consists of two main steps: first, learn a classification model or classifier from an available training data, and secondly, classify the new incoming unseen data instances using the learned classifier. Classification is supervised when the whole class values are present in the training data (i.e., fully labeled data), semi-supervised when only some class values are known (i.e., partially labeled data), and unsupervised when the whole class values are missing in the training data (i.e., unlabeled data). In addition, besides this taxonomy, the classification problem can be categorized into uni-dimensional or multi-dimensional depending on the number of class variables, one or more, respectively; or can be also categorized into stationary or streaming depending on the characteristics of the data and the rate of change underlying it. Through this thesis, we deal with the classification problem under three different settings, namely, supervised multi-dimensional stationary classification, semi-supervised unidimensional streaming classification, and supervised multi-dimensional streaming classification. To accomplish this task, we basically used Bayesian network classifiers as models. The first contribution, addressing the supervised multi-dimensional stationary classification problem, consists of two new methods for learning multi-dimensional Bayesian network classifiers from stationary data. They are proposed from two different points of view. The first method, named CB-MBC, is based on a wrapper greedy forward selection approach, while the second one, named MB-MBC, is a filter constraint-based approach based on Markov blankets. Both methods are applied to two important real-world problems, namely, the prediction of the human immunodeficiency virus type 1 (HIV-1) reverse transcriptase and protease inhibitors, and the prediction of the European Quality of Life-5 Dimensions (EQ-5D) from 39-item Parkinson’s Disease Questionnaire (PDQ-39). The experimental study includes comparisons of CB-MBC and MB-MBC against state-of-the-art multi-dimensional classification methods, as well as against commonly used methods for solving the Parkinson’s disease prediction problem, namely, multinomial logistic regression, ordinary least squares, and censored least absolute deviations. For both considered case studies, results are promising in terms of classification accuracy as well as regarding the analysis of the learned MBC graphical structures identifying known and novel interactions among variables. The second contribution, addressing the semi-supervised uni-dimensional streaming classification problem, consists of a novel method (CPL-DS) for classifying partially labeled data streams. Data streams differ from the stationary data sets by their highly rapid generation process and their concept-drifting aspect. That is, the learned concepts and/or the underlying distribution are likely changing and evolving over time, which makes the current classification model out-of-date requiring to be updated. CPL-DS uses the Kullback-Leibler divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classification model is learned using the expectation-maximization algorithm; otherwise, the current classification model is kept unchanged. CPL-DS is general as it can be applied to several classification models. Using two different models, namely, naive Bayes classifier and logistic regression, CPL-DS is tested with synthetic data streams and applied to the real-world problem of malware detection, where the new received files should be continuously classified into malware or goodware. Experimental results show that our approach is effective for detecting different kinds of drift from partially labeled data streams, as well as having a good classification performance. Finally, the third contribution, addressing the supervised multi-dimensional streaming classification problem, consists of two adaptive methods, namely, Locally Adaptive-MB-MBC (LA-MB-MBC) and Globally Adaptive-MB-MBC (GA-MB-MBC). Both methods monitor the concept drift over time using the average log-likelihood score and the Page-Hinkley test. Then, if a drift is detected, LA-MB-MBC adapts the current multi-dimensional Bayesian network classifier locally around each changed node, whereas GA-MB-MBC learns a new multi-dimensional Bayesian network classifier from scratch. Experimental study carried out using synthetic multi-dimensional data streams shows the merits of both proposed adaptive methods.
Resumo:
The impact of the Parkinson's disease and its treatment on the patients' health-related quality of life can be estimated either by means of generic measures such as the european quality of Life-5 Dimensions (EQ-5D) or specific measures such as the 8-item Parkinson's disease questionnaire (PDQ-8). In clinical studies, PDQ-8 could be used in detriment of EQ-5D due to the lack of resources, time or clinical interest in generic measures. Nevertheless, PDQ-8 cannot be applied in cost-effectiveness analyses which require generic measures and quantitative utility scores, such as EQ-5D. To deal with this problem, a commonly used solution is the prediction of EQ-5D from PDQ-8. In this paper, we propose a new probabilistic method to predict EQ-5D from PDQ-8 using multi-dimensional Bayesian network classifiers. Our approach is evaluated using five-fold cross-validation experiments carried out on a Parkinson's data set containing 488 patients, and is compared with two additional Bayesian network-based approaches, two commonly used mapping methods namely, ordinary least squares and censored least absolute deviations, and a deterministic model. Experimental results are promising in terms of predictive performance as well as the identification of dependence relationships among EQ-5D and PDQ-8 items that the mapping approaches are unable to detect
Resumo:
Diferentes abordagens teóricas têm sido utilizadas em estudos de sistemas biomoleculares com o objetivo de contribuir com o tratamento de diversas doenças. Para a dor neuropática, por exemplo, o estudo de compostos que interagem com o receptor sigma-1 (Sig-1R) pode elucidar os principais fatores associados à atividade biológica dos mesmos. Nesse propósito, estudos de Relações Quantitativas Estrutura-Atividade (QSAR) utilizando os métodos de regressão por Mínimos Quadrados Parciais (PLS) e Rede Neural Artificial (ANN) foram aplicados a 64 antagonistas do Sig-1R pertencentes à classe de 1-arilpirazóis. Modelos PLS e ANN foram utilizados com o objetivo de descrever comportamentos lineares e não lineares, respectivamente, entre um conjunto de descritores e a atividade biológica dos compostos selecionados. O modelo PLS foi obtido com 51 compostos no conjunto treinamento e 13 compostos no conjunto teste (r² = 0,768, q² = 0,684 e r²teste = 0,785). Testes de leave-N-out, randomização da atividade biológica e detecção de outliers confirmaram a robustez e estabilidade dos modelos e mostraram que os mesmos não foram obtidos por correlações ao acaso. Modelos também foram gerados a partir da Rede Neural Artificial Perceptron de Multicamadas (MLP-ANN), sendo que a arquitetura 6-12-1, treinada com as funções de transferência tansig-tansig, apresentou a melhor resposta para a predição da atividade biológica dos compostos (r²treinamento = 0,891, r²validação = 0,852 e r²teste = 0,793). Outra abordagem foi utilizada para simular o ambiente de membranas sinápticas utilizando bicamadas lipídicas compostas por POPC, DOPE, POPS e colesterol. Os estudos de dinâmica molecular desenvolvidos mostraram que altas concentrações de colesterol induzem redução da área por lipídeo e difusão lateral e aumento na espessura da membrana e nos valores de parâmetro de ordem causados pelo ordenamento das cadeias acil dos fosfolipídeos. As bicamadas lipídicas obtidas podem ser usadas para simular interações entre lipídeos e pequenas moléculas ou proteínas contribuindo para as pesquisas associadas a doenças como Alzheimer e Parkinson. As abordagens usadas nessa tese são essenciais para o desenvolvimento de novas pesquisas em Química Medicinal Computacional.
Resumo:
In the present paper, a methodology is proposed for obtaining empirical equations describing the sound absorption characteristics of an absorbing material obtained from natural fibers, specifically from coconut. The method, which was previously applied to other materials, requires performing measurements of air-flow resistivity and of acoustic impedance for samples of the material under study. The equations that govern the acoustic behavior of the material are then derived by means of a least-squares fit of the acoustic impedance and of the propagation constant. These results can be useful since they allow the empirically obtained analytical equations to be easily incorporated in prediction and simulation models of acoustic systems for noise control that incorporate the studied materials.