856 resultados para regression algorithm
Resumo:
Using the classical Parzen window (PW) estimate as the target function, the sparse kernel density estimator is constructed in a forward-constrained regression (FCR) manner. The proposed algorithm selects significant kernels one at a time, while the leave-one-out (LOO) test score is minimized subject to a simple positivity constraint in each forward stage. The model parameter estimation in each forward stage is simply the solution of jackknife parameter estimator for a single parameter, subject to the same positivity constraint check. For each selected kernels, the associated kernel width is updated via the Gauss-Newton method with the model parameter estimate fixed. The proposed approach is simple to implement and the associated computational cost is very low. Numerical examples are employed to demonstrate the efficacy of the proposed approach.
Resumo:
In this brief, we propose an orthogonal forward regression (OFR) algorithm based on the principles of the branch and bound (BB) and A-optimality experimental design. At each forward regression step, each candidate from a pool of candidate regressors, referred to as S, is evaluated in turn with three possible decisions: 1) one of these is selected and included into the model; 2) some of these remain in S for evaluation in the next forward regression step; and 3) the rest are permanently eliminated from S. Based on the BB principle in combination with an A-optimality composite cost function for model structure determination, a simple adaptive diagnostics test is proposed to determine the decision boundary between 2) and 3). As such the proposed algorithm can significantly reduce the computational cost in the A-optimality OFR algorithm. Numerical examples are used to demonstrate the effectiveness of the proposed algorithm.
Resumo:
In this paper we propose an efficient two-level model identification method for a large class of linear-in-the-parameters models from the observational data. A new elastic net orthogonal forward regression (ENOFR) algorithm is employed at the lower level to carry out simultaneous model selection and elastic net parameter estimation. The two regularization parameters in the elastic net are optimized using a particle swarm optimization (PSO) algorithm at the upper level by minimizing the leave one out (LOO) mean square error (LOOMSE). Illustrative examples are included to demonstrate the effectiveness of the new approaches.
Resumo:
In this letter, a Box-Cox transformation-based radial basis function (RBF) neural network is introduced using the RBF neural network to represent the transformed system output. Initially a fixed and moderate sized RBF model base is derived based on a rank revealing orthogonal matrix triangularization (QR decomposition). Then a new fast identification algorithm is introduced using Gauss-Newton algorithm to derive the required Box-Cox transformation, based on a maximum likelihood estimator. The main contribution of this letter is to explore the special structure of the proposed RBF neural network for computational efficiency by utilizing the inverse of matrix block decomposition lemma. Finally, the Box-Cox transformation-based RBF neural network, with good generalization and sparsity, is identified based on the derived optimal Box-Cox transformation and a D-optimality-based orthogonal forward regression algorithm. The proposed algorithm and its efficacy are demonstrated with an illustrative example in comparison with support vector machine regression.
Resumo:
SNP genotyping arrays have been developed to characterize single-nucleotide polymorphisms (SNPs) and DNA copy number variations (CNVs). The quality of the inferences about copy number can be affected by many factors including batch effects, DNA sample preparation, signal processing, and analytical approach. Nonparametric and model-based statistical algorithms have been developed to detect CNVs from SNP genotyping data. However, these algorithms lack specificity to detect small CNVs due to the high false positive rate when calling CNVs based on the intensity values. Association tests based on detected CNVs therefore lack power even if the CNVs affecting disease risk are common. In this research, by combining an existing Hidden Markov Model (HMM) and the logistic regression model, a new genome-wide logistic regression algorithm was developed to detect CNV associations with diseases. We showed that the new algorithm is more sensitive and can be more powerful in detecting CNV associations with diseases than an existing popular algorithm, especially when the CNV association signal is weak and a limited number of SNPs are located in the CNV.^
Resumo:
Phase equilibrium data regression is an unavoidable task necessary to obtain the appropriate values for any model to be used in separation equipment design for chemical process simulation and optimization. The accuracy of this process depends on different factors such as the experimental data quality, the selected model and the calculation algorithm. The present paper summarizes the results and conclusions achieved in our research on the capabilities and limitations of the existing GE models and about strategies that can be included in the correlation algorithms to improve the convergence and avoid inconsistencies. The NRTL model has been selected as a representative local composition model. New capabilities of this model, but also several relevant limitations, have been identified and some examples of the application of a modified NRTL equation have been discussed. Furthermore, a regression algorithm has been developed that allows for the advisable simultaneous regression of all the condensed phase equilibrium regions that are present in ternary systems at constant T and P. It includes specific strategies designed to avoid some of the pitfalls frequently found in commercial regression tools for phase equilibrium calculations. Most of the proposed strategies are based on the geometrical interpretation of the lowest common tangent plane equilibrium criterion, which allows an unambiguous comprehension of the behavior of the mixtures. The paper aims to show all the work as a whole in order to reveal the necessary efforts that must be devoted to overcome the difficulties that still exist in the phase equilibrium data regression problem.
Resumo:
Background: Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. The contact number of a residue in a folded protein is a measure of its exposure to the local environment, and is defined as the number of C-beta atoms in other residues within a sphere around the C-beta atom of the residue of interest. Contact number is partly conserved between protein folds and thus is useful for protein fold and structure prediction. In turn, each residue's contact number can be partially predicted from primary amino acid sequence, assisting tertiary fold analysis from sequence data. In this study, we provide a more accurate contact number prediction method from protein primary sequence. Results: We predict contact number from protein sequence using a novel support vector regression algorithm. Using protein local sequences with multiple sequence alignments (PSI-BLAST profiles), we demonstrate a correlation coefficient between predicted and observed contact numbers of 0.70, which outperforms previously achieved accuracies. Including additional information about sequence weight and amino acid composition further improves prediction accuracies significantly with the correlation coefficient reaching 0.73. If residues are classified as being either contacted or non-contacted, the prediction accuracies are all greater than 77%, regardless of the choice of classification thresholds. Conclusion: The successful application of support vector regression to the prediction of protein contact number reported here, together with previous applications of this approach to the prediction of protein accessible surface area and B-factor profile, suggests that a support vector regression approach may be very useful for determining the structure-function relation between primary sequence and higher order consecutive protein structural and functional properties.
Resumo:
L'application de classifieurs linéaires à l'analyse des données d'imagerie cérébrale (fMRI) a mené à plusieurs percées intéressantes au cours des dernières années. Ces classifieurs combinent linéairement les réponses des voxels pour détecter et catégoriser différents états du cerveau. Ils sont plus agnostics que les méthodes d'analyses conventionnelles qui traitent systématiquement les patterns faibles et distribués comme du bruit. Dans le présent projet, nous utilisons ces classifieurs pour valider une hypothèse portant sur l'encodage des sons dans le cerveau humain. Plus précisément, nous cherchons à localiser des neurones, dans le cortex auditif primaire, qui détecteraient les modulations spectrales et temporelles présentes dans les sons. Nous utilisons les enregistrements fMRI de sujets soumis à 49 modulations spectro-temporelles différentes. L'analyse fMRI au moyen de classifieurs linéaires n'est pas standard, jusqu'à maintenant, dans ce domaine. De plus, à long terme, nous avons aussi pour objectif le développement de nouveaux algorithmes d'apprentissage automatique spécialisés pour les données fMRI. Pour ces raisons, une bonne partie des expériences vise surtout à étudier le comportement des classifieurs. Nous nous intéressons principalement à 3 classifieurs linéaires standards, soient l'algorithme machine à vecteurs de support (linéaire), l'algorithme régression logistique (régularisée) et le modèle bayésien gaussien naïf (variances partagées).
Resumo:
Inferring the spatial expansion dynamics of invading species from molecular data is notoriously difficult due to the complexity of the processes involved. For these demographic scenarios, genetic data obtained from highly variable markers may be profitably combined with specific sampling schemes and information from other sources using a Bayesian approach. The geographic range of the introduced toad Bufo marinus is still expanding in eastern and northern Australia, in each case from isolates established around 1960. A large amount of demographic and historical information is available on both expansion areas. In each area, samples were collected along a transect representing populations of different ages and genotyped at 10 microsatellite loci. Five demographic models of expansion, differing in the dispersal pattern for migrants and founders and in the number of founders, were considered. Because the demographic history is complex, we used an approximate Bayesian method, based on a rejection-regression algorithm. to formally test the relative likelihoods of the five models of expansion and to infer demographic parameters. A stepwise migration-foundation model with founder events was statistically better supported than other four models in both expansion areas. Posterior distributions supported different dynamics of expansion in the studied areas. Populations in the eastern expansion area have a lower stable effective population size and have been founded by a smaller number of individuals than those in the northern expansion area. Once demographically stabilized, populations exchange a substantial number of effective migrants per generation in both expansion areas, and such exchanges are larger in northern than in eastern Australia. The effective number of migrants appears to be considerably lower than that of founders in both expansion areas. We found our inferences to be relatively robust to various assumptions on marker. demographic, and historical features. The method presented here is the only robust, model-based method available so far, which allows inferring complex population dynamics over a short time scale. It also provides the basis for investigating the interplay between population dynamics, drift, and selection in invasive species.
Resumo:
A modified radial basis function (RBF) neural network and its identification algorithm based on observational data with heterogeneous noise are introduced. The transformed system output of Box-Cox is represented by the RBF neural network. To identify the model from observational data, the singular value decomposition of the full regression matrix consisting of basis functions formed by system input data is initially carried out and a new fast identification method is then developed using Gauss-Newton algorithm to derive the required Box-Cox transformation, based on a maximum likelihood estimator (MLE) for a model base spanned by the largest eigenvectors. Finally, the Box-Cox transformation-based RBF neural network, with good generalisation and sparsity, is identified based on the derived optimal Box-Cox transformation and an orthogonal forward regression algorithm using a pseudo-PRESS statistic to select a sparse RBF model with good generalisation. The proposed algorithm and its efficacy are demonstrated with numerical examples.
Resumo:
The modelling of a nonlinear stochastic dynamical processes from data involves solving the problems of data gathering, preprocessing, model architecture selection, learning or adaptation, parametric evaluation and model validation. For a given model architecture such as associative memory networks, a common problem in non-linear modelling is the problem of "the curse of dimensionality". A series of complementary data based constructive identification schemes, mainly based on but not limited to an operating point dependent fuzzy models, are introduced in this paper with the aim to overcome the curse of dimensionality. These include (i) a mixture of experts algorithm based on a forward constrained regression algorithm; (ii) an inherent parsimonious delaunay input space partition based piecewise local lineal modelling concept; (iii) a neurofuzzy model constructive approach based on forward orthogonal least squares and optimal experimental design and finally (iv) the neurofuzzy model construction algorithm based on basis functions that are Bézier Bernstein polynomial functions and the additive decomposition. Illustrative examples demonstrate their applicability, showing that the final major hurdle in data based modelling has almost been removed.
Resumo:
Considering the Wald, score, and likelihood ratio asymptotic test statistics, we analyze a multivariate null intercept errors-in-variables regression model, where the explanatory and the response variables are subject to measurement errors, and a possible structure of dependency between the measurements taken within the same individual are incorporated, representing a longitudinal structure. This model was proposed by Aoki et al. (2003b) and analyzed under the bayesian approach. In this article, considering the classical approach, we analyze asymptotic test statistics and present a simulation study to compare the behavior of the three test statistics for different sample sizes, parameter values and nominal levels of the test. Also, closed form expressions for the score function and the Fisher information matrix are presented. We consider two real numerical illustrations, the odontological data set from Hadgu and Koch (1999), and a quality control data set.
Resumo:
A identificação e descrição dos caracteres litológicos de uma formação são indispensáveis à avaliação de formações complexas. Com este objetivo, tem sido sistematicamente usada a combinação de ferramentas nucleares em poços não-revestidos. Os perfis resultantes podem ser considerados como a interação entre duas fases distintas: • Fase de transporte da radiação desde a fonte até um ou mais detectores, através da formação. • Fase de detecção, que consiste na coleção da radiação, sua transformação em pulsos de corrente e, finalmente, na distribuição espectral destes pulsos. Visto que a presença do detector não afeta fortemente o resultado do transporte da radiação, cada fase pode ser simulada independentemente uma da outra, o que permite introduzir um novo tipo de modelamento que desacopla as duas fases. Neste trabalho, a resposta final é simulada combinando soluções numéricas do transporte com uma biblioteca de funções resposta do detector, para diferentes energias incidentes e para cada arranjo específico de fontes e detectores. O transporte da radiação é calculado através do algoritmo de elementos finitos (FEM), na forma de fluxo escalar 2½-D, proveniente da solução numérica da aproximação de difusão para multigrupos da equação de transporte de Boltzmann, no espaço de fase, dita aproximação P1, onde a variável direção é expandida em termos dos polinômios ortogonais de Legendre. Isto determina a redução da dimensionalidade do problema, tornando-o mais compatível com o algoritmo FEM, onde o fluxo dependa exclusivamente da variável espacial e das propriedades físicas da formação. A função resposta do detector NaI(Tl) é obtida independentemente pelo método Monte Carlo (MC) em que a reconstrução da vida de uma partícula dentro do cristal cintilador é feita simulando, interação por interação, a posição, direção e energia das diferentes partículas, com a ajuda de números aleatórios aos quais estão associados leis de probabilidades adequadas. Os possíveis tipos de interação (Rayleigh, Efeito fotoelétrico, Compton e Produção de pares) são determinados similarmente. Completa-se a simulação quando as funções resposta do detector são convolvidas com o fluxo escalar, produzindo como resposta final, o espectro de altura de pulso do sistema modelado. Neste espectro serão selecionados conjuntos de canais denominados janelas de detecção. As taxas de contagens em cada janela apresentam dependências diferenciadas sobre a densidade eletrônica e a fitologia. Isto permite utilizar a combinação dessas janelas na determinação da densidade e do fator de absorção fotoelétrico das formações. De acordo com a metodologia desenvolvida, os perfis, tanto em modelos de camadas espessas quanto finas, puderam ser simulados. O desempenho do método foi testado em formações complexas, principalmente naquelas em que a presença de minerais de argila, feldspato e mica, produziram efeitos consideráveis capazes de perturbar a resposta final das ferramentas. Os resultados mostraram que as formações com densidade entre 1.8 e 4.0 g/cm3 e fatores de absorção fotoelétrico no intervalo de 1.5 a 5 barns/e-, tiveram seus caracteres físicos e litológicos perfeitamente identificados. As concentrações de Potássio, Urânio e Tório, puderam ser obtidas com a introdução de um novo sistema de calibração, capaz de corrigir os efeitos devidos à influência de altas variâncias e de correlações negativas, observadas principalmente no cálculo das concentrações em massa de Urânio e Potássio. Na simulação da resposta da sonda CNL, utilizando o algoritmo de regressão polinomial de Tittle, foi verificado que, devido à resolução vertical limitada por ela apresentada, as camadas com espessuras inferiores ao espaçamento fonte - detector mais distante tiveram os valores de porosidade aparente medidos erroneamente. Isto deve-se ao fato do algoritmo de Tittle aplicar-se exclusivamente a camadas espessas. Em virtude desse erro, foi desenvolvido um método que leva em conta um fator de contribuição determinado pela área relativa de cada camada dentro da zona de máxima informação. Assim, a porosidade de cada ponto em subsuperfície pôde ser determinada convolvendo estes fatores com os índices de porosidade locais, porém supondo cada camada suficientemente espessa a fim de adequar-se ao algoritmo de Tittle. Por fim, as limitações adicionais impostas pela presença de minerais perturbadores, foram resolvidas supondo a formação como que composta por um mineral base totalmente saturada com água, sendo os componentes restantes considerados perturbações sobre este caso base. Estes resultados permitem calcular perfis sintéticos de poço, que poderão ser utilizados em esquemas de inversão com o objetivo de obter uma avaliação quantitativa mais detalhada de formações complexas.