874 resultados para correlation-based feature selection


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Photo-mosaicing techniques have become popular for seafloor mapping in various marine science applications. However, the common methods cannot accurately map regions with high relief and topographical variations. Ortho-mosaicing borrowed from photogrammetry is an alternative technique that enables taking into account the 3-D shape of the terrain. A serious bottleneck is the volume of elevation information that needs to be estimated from the video data, fused, and processed for the generation of a composite ortho-photo that covers a relatively large seafloor area. We present a framework that combines the advantages of dense depth-map and 3-D feature estimation techniques based on visual motion cues. The main goal is to identify and reconstruct certain key terrain feature points that adequately represent the surface with minimal complexity in the form of piecewise planar patches. The proposed implementation utilizes local depth maps for feature selection, while tracking over several views enables 3-D reconstruction by bundle adjustment. Experimental results with synthetic and real data validate the effectiveness of the proposed approach

Relevância:

100.00% 100.00%

Publicador:

Resumo:

It is well known that cointegration between the level of two variables (labeled Yt and yt in this paper) is a necessary condition to assess the empirical validity of a present-value model (PV and PVM, respectively, hereafter) linking them. The work on cointegration has been so prevalent that it is often overlooked that another necessary condition for the PVM to hold is that the forecast error entailed by the model is orthogonal to the past. The basis of this result is the use of rational expectations in forecasting future values of variables in the PVM. If this condition fails, the present-value equation will not be valid, since it will contain an additional term capturing the (non-zero) conditional expected value of future error terms. Our article has a few novel contributions, but two stand out. First, in testing for PVMs, we advise to split the restrictions implied by PV relationships into orthogonality conditions (or reduced rank restrictions) before additional tests on the value of parameters. We show that PV relationships entail a weak-form common feature relationship as in Hecq, Palm, and Urbain (2006) and in Athanasopoulos, Guillén, Issler and Vahid (2011) and also a polynomial serial-correlation common feature relationship as in Cubadda and Hecq (2001), which represent restrictions on dynamic models which allow several tests for the existence of PV relationships to be used. Because these relationships occur mostly with nancial data, we propose tests based on generalized method of moment (GMM) estimates, where it is straightforward to propose robust tests in the presence of heteroskedasticity. We also propose a robust Wald test developed to investigate the presence of reduced rank models. Their performance is evaluated in a Monte-Carlo exercise. Second, in the context of asset pricing, we propose applying a permanent-transitory (PT) decomposition based on Beveridge and Nelson (1981), which focus on extracting the long-run component of asset prices, a key concept in modern nancial theory as discussed in Alvarez and Jermann (2005), Hansen and Scheinkman (2009), and Nieuwerburgh, Lustig, Verdelhan (2010). Here again we can exploit the results developed in the common cycle literature to easily extract permament and transitory components under both long and also short-run restrictions. The techniques discussed herein are applied to long span annual data on long- and short-term interest rates and on price and dividend for the U.S. economy. In both applications we do not reject the existence of a common cyclical feature vector linking these two series. Extracting the long-run component shows the usefulness of our approach and highlights the presence of asset-pricing bubbles.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This study aimed to determine the best auxiliary trait for indirect selection of soybean grain yield, through path analysis and in avoidance of the adverse effects of multicollinearity and expected response. Seventy-nine F5 soybean genotypes from the cross FT-Cometa x Bossier were used. The populations were distributed on the field was the families inserted with replicated controls. Primary and secondary traits of grain yield were evaluated in four phenotypically superior plants per family. The traits number of pods, height and number of nodes were considered as the most important, showing the best combination of direct effect and genotypic correlation. The number of pods achieved the highest expected gain through the estimation method based on the selection differential. On the other hand, plant height, by the method based on selection intensity, was not a good indicator of the most productive plants.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Several recent studies in literature have identified brain morphological alterations associated to Borderline Personality Disorder (BPD) patients. These findings are reported by studies based on voxel-based-morphometry analysis of structural MRI data, comparing mean gray-matter concentration between groups of BPD patients and healthy controls. On the other hand, mean differences between groups are not informative about the discriminative value of neuroimaging data to predict the group of individual subjects. In this paper, we go beyond mean differences analyses, and explore to what extent individual BPD patients can be differentiated from controls (25 subjects in each group), using a combination of automated-morphometric tools for regional cortical thickness/volumetric estimation and Support Vector Machine classifier. The approach included a feature selection step in order to identify the regions containing most discriminative information. The accuracy of this classifier was evaluated using the leave-one-subject-out procedure. The brain regions indicated as containing relevant information to discriminate groups were the orbitofrontal, rostral anterior cingulate, posterior cingulate, middle temporal cortices, among others. These areas, which are distinctively involved in emotional and affect regulation of BPD patients, were the most informative regions to achieve both sensitivity and specificity values of 80% in SVM classification. The findings suggest that this new methodology can add clinical and potential diagnostic value to neuroimaging of psychiatric disorders. (C) 2012 Elsevier Ltd. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The present work belongs to the PRANA project, the first extensive field campaign of observation of atmospheric emission spectra covering the Far InfraRed spectral region, for more than two years. The principal deployed instrument is REFIR-PAD, a Fourier transform spectrometer used by us to study Antarctic cloud properties. A dataset covering the whole 2013 has been analyzed and, firstly, a selection of good quality spectra is performed, using, as thresholds, radiance values in few chosen spectral regions. These spectra are described in a synthetic way averaging radiances in selected intervals, converting them into BTs and finally considering the differences between each pair of them. A supervised feature selection algorithm is implemented with the purpose to select the features really informative about the presence, the phase and the type of cloud. Hence, training and test sets are collected, by means of Lidar quick-looks. The supervised classification step of the overall monthly datasets is performed using a SVM. On the base of this classification and with the help of Lidar observations, 29 non-precipitating ice cloud case studies are selected. A single spectrum, or at most an average over two or three spectra, is processed by means of the retrieval algorithm RT-RET, exploiting some main IR window channels, in order to extract cloud properties. Retrieved effective radii and optical depths are analyzed, to compare them with literature studies and to evaluate possible seasonal trends. Finally, retrieval output atmospheric profiles are used as inputs for simulations, assuming two different crystal habits, with the aim to examine our ability to reproduce radiances in the FIR. Substantial mis-estimations are found for FIR micro-windows: a high variability is observed in the spectral pattern of simulation deviations from measured spectra and an effort to link these deviations to cloud parameters has been performed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Nowadays communication is switching from a centralized scenario, where communication media like newspapers, radio, TV programs produce information and people are just consumers, to a completely different decentralized scenario, where everyone is potentially an information producer through the use of social networks, blogs, forums that allow a real-time worldwide information exchange. These new instruments, as a result of their widespread diffusion, have started playing an important socio-economic role. They are the most used communication media and, as a consequence, they constitute the main source of information enterprises, political parties and other organizations can rely on. Analyzing data stored in servers all over the world is feasible by means of Text Mining techniques like Sentiment Analysis, which aims to extract opinions from huge amount of unstructured texts. This could lead to determine, for instance, the user satisfaction degree about products, services, politicians and so on. In this context, this dissertation presents new Document Sentiment Classification methods based on the mathematical theory of Markov Chains. All these approaches bank on a Markov Chain based model, which is language independent and whose killing features are simplicity and generality, which make it interesting with respect to previous sophisticated techniques. Every discussed technique has been tested in both Single-Domain and Cross-Domain Sentiment Classification areas, comparing performance with those of other two previous works. The performed analysis shows that some of the examined algorithms produce results comparable with the best methods in literature, with reference to both single-domain and cross-domain tasks, in $2$-classes (i.e. positive and negative) Document Sentiment Classification. However, there is still room for improvement, because this work also shows the way to walk in order to enhance performance, that is, a good novel feature selection process would be enough to outperform the state of the art. Furthermore, since some of the proposed approaches show promising results in $2$-classes Single-Domain Sentiment Classification, another future work will regard validating these results also in tasks with more than $2$ classes.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An investigation was conducted to evaluate the impact of experimental designs and spatial analyses (single-trial models) of the response to selection for grain yield in the northern grains region of Australia (Queensland and northern New South Wales). Two sets of multi-environment experiments were considered. One set, based on 33 trials conducted from 1994 to 1996, was used to represent the testing system of the wheat breeding program and is referred to as the multi-environment trial (MET). The second set, based on 47 trials conducted from 1986 to 1993, sampled a more diverse set of years and management regimes and was used to represent the target population of environments (TPE). There were 18 genotypes in common between the MET and TPE sets of trials. From indirect selection theory, the phenotypic correlation coefficient between the MET and TPE single-trial adjusted genotype means [r(p(MT))] was used to determine the effect of the single-trial model on the expected indirect response to selection for grain yield in the TPE based on selection in the MET. Five single-trial models were considered: randomised complete block (RCB), incomplete block (IB), spatial analysis (SS), spatial analysis with a measurement error (SSM) and a combination of spatial analysis and experimental design information to identify the preferred (PF) model. Bootstrap-resampling methodology was used to construct multiple MET data sets, ranging in size from 2 to 20 environments per MET sample. The size and environmental composition of the MET and the single-trial model influenced the r(p(MT)). On average, the PF model resulted in a higher r(p(MT)) than the IB, SS and SSM models, which were in turn superior to the RCB model for MET sizes based on fewer than ten environments. For METs based on ten or more environments, the r(p(MT)) was similar for all single-trial models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One of the main challenges of classifying clinical data is determining how to handle missing features. Most research favours imputing of missing values or neglecting records that include missing data, both of which can degrade accuracy when missing values exceed a certain level. In this research we propose a methodology to handle data sets with a large percentage of missing values and with high variability in which particular data are missing. Feature selection is effected by picking variables sequentially in order of maximum correlation with the dependent variable and minimum correlation with variables already selected. Classification models are generated individually for each test case based on its particular feature set and the matching data values available in the training population. The method was applied to real patients' anonymous mental-health data where the task was to predict the suicide risk judgement clinicians would give for each patient's data, with eleven possible outcome classes: zero to ten, representing no risk to maximum risk. The results compare favourably with alternative methods and have the advantage of ensuring explanations of risk are based only on the data given, not imputed data. This is important for clinical decision support systems using human expertise for modelling and explaining predictions.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thanks to recent advances in molecular biology, allied to an ever increasing amount of experimental data, the functional state of thousands of genes can now be extracted simultaneously by using methods such as cDNA microarrays and RNA-Seq. Particularly important related investigations are the modeling and identification of gene regulatory networks from expression data sets. Such a knowledge is fundamental for many applications, such as disease treatment, therapeutic intervention strategies and drugs design, as well as for planning high-throughput new experiments. Methods have been developed for gene networks modeling and identification from expression profiles. However, an important open problem regards how to validate such approaches and its results. This work presents an objective approach for validation of gene network modeling and identification which comprises the following three main aspects: (1) Artificial Gene Networks (AGNs) model generation through theoretical models of complex networks, which is used to simulate temporal expression data; (2) a computational method for gene network identification from the simulated data, which is founded on a feature selection approach where a target gene is fixed and the expression profile is observed for all other genes in order to identify a relevant subset of predictors; and (3) validation of the identified AGN-based network through comparison with the original network. The proposed framework allows several types of AGNs to be generated and used in order to simulate temporal expression data. The results of the network identification method can then be compared to the original network in order to estimate its properties and accuracy. Some of the most important theoretical models of complex networks have been assessed: the uniformly-random Erdos-Renyi (ER), the small-world Watts-Strogatz (WS), the scale-free Barabasi-Albert (BA), and geographical networks (GG). The experimental results indicate that the inference method was sensitive to average degree k variation, decreasing its network recovery rate with the increase of k. The signal size was important for the inference method to get better accuracy in the network identification rate, presenting very good results with small expression profiles. However, the adopted inference method was not sensible to recognize distinct structures of interaction among genes, presenting a similar behavior when applied to different network topologies. In summary, the proposed framework, though simple, was adequate for the validation of the inferred networks by identifying some properties of the evaluated method, which can be extended to other inference methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background: The inference of gene regulatory networks (GRNs) from large-scale expression profiles is one of the most challenging problems of Systems Biology nowadays. Many techniques and models have been proposed for this task. However, it is not generally possible to recover the original topology with great accuracy, mainly due to the short time series data in face of the high complexity of the networks and the intrinsic noise of the expression measurements. In order to improve the accuracy of GRNs inference methods based on entropy (mutual information), a new criterion function is here proposed. Results: In this paper we introduce the use of generalized entropy proposed by Tsallis, for the inference of GRNs from time series expression profiles. The inference process is based on a feature selection approach and the conditional entropy is applied as criterion function. In order to assess the proposed methodology, the algorithm is applied to recover the network topology from temporal expressions generated by an artificial gene network (AGN) model as well as from the DREAM challenge. The adopted AGN is based on theoretical models of complex networks and its gene transference function is obtained from random drawing on the set of possible Boolean functions, thus creating its dynamics. On the other hand, DREAM time series data presents variation of network size and its topologies are based on real networks. The dynamics are generated by continuous differential equations with noise and perturbation. By adopting both data sources, it is possible to estimate the average quality of the inference with respect to different network topologies, transfer functions and network sizes. Conclusions: A remarkable improvement of accuracy was observed in the experimental results by reducing the number of false connections in the inferred topology by the non-Shannon entropy. The obtained best free parameter of the Tsallis entropy was on average in the range 2.5 <= q <= 3.5 (hence, subextensive entropy), which opens new perspectives for GRNs inference methods based on information theory and for investigation of the nonextensivity of such networks. The inference algorithm and criterion function proposed here were implemented and included in the DimReduction software, which is freely available at http://sourceforge.net/projects/dimreduction and http://code.google.com/p/dimreduction/.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

There are many techniques for electricity market price forecasting. However, most of them are designed for expected price analysis rather than price spike forecasting. An effective method of predicting the occurrence of spikes has not yet been observed in the literature so far. In this paper, a data mining based approach is presented to give a reliable forecast of the occurrence of price spikes. Combined with the spike value prediction techniques developed by the same authors, the proposed approach aims at providing a comprehensive tool for price spike forecasting. In this paper, feature selection techniques are firstly described to identify the attributes relevant to the occurrence of spikes. A simple introduction to the classification techniques is given for completeness. Two algorithms: support vector machine and probability classifier are chosen to be the spike occurrence predictors and are discussed in details. Realistic market data are used to test the proposed model with promising results.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Genetic improvement of common bean nutritional quality has advantages in marketing and can contribute to society as a food source. The objective of this study was to evaluate the genetic variability for grain yield, calcium and iron concentrations in grains of inbred common bean lines obtained by different breeding methods. For this, 136 F7 inbred lines were obtained using the Pedigree method and 136 F7 inbred lines were obtained using the Single-Seed Descent (SSD) method. The lines showed genetic variability for grain yield, and concentrations of calcium and iron independently of the method of advancing segregating populations. The Pedigree method allows obtaining a greater number of lines with high grain yield. Selection using the SSD method allows the identification of a larger number of lines with high concentrations of calcium and iron in grains. Weak negative correlations were found between grain yield and calcium concentration (r = -0.0994) and grain yield and iron concentration (r = -0.3926). Several lines show genetic superiority for grain yield and concentrations of calcium and iron in grains and their selection can result in new common bean cultivars with high nutritional quality.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Low noise surfaces have been increasingly considered as a viable and cost-effective alternative to acoustical barriers. However, road planners and administrators frequently lack information on the correlation between the type of road surface and the resulting noise emission profile. To address this problem, a method to identify and classify different types of road pavements was developed, whereby near field road noise is analyzed using statistical learning methods. The vehicle rolling sound signal near the tires and close to the road surface was acquired by two microphones in a special arrangement which implements the Close-Proximity method. A set of features, characterizing the properties of the road pavement, was extracted from the corresponding sound profiles. A feature selection method was used to automatically select those that are most relevant in predicting the type of pavement, while reducing the computational cost. A set of different types of road pavement segments were tested and the performance of the classifier was evaluated. Results of pavement classification performed during a road journey are presented on a map, together with geographical data. This procedure leads to a considerable improvement in the quality of road pavement noise data, thereby increasing the accuracy of road traffic noise prediction models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A organização automática de mensagens de correio electrónico é um desafio actual na área da aprendizagem automática. O número excessivo de mensagens afecta cada vez mais utilizadores, especialmente os que usam o correio electrónico como ferramenta de comunicação e trabalho. Esta tese aborda o problema da organização automática de mensagens de correio electrónico propondo uma solução que tem como objectivo a etiquetagem automática de mensagens. A etiquetagem automática é feita com recurso às pastas de correio electrónico anteriormente criadas pelos utilizadores, tratando-as como etiquetas, e à sugestão de múltiplas etiquetas para cada mensagem (top-N). São estudadas várias técnicas de aprendizagem e os vários campos que compõe uma mensagem de correio electrónico são analisados de forma a determinar a sua adequação como elementos de classificação. O foco deste trabalho recai sobre os campos textuais (o assunto e o corpo das mensagens), estudando-se diferentes formas de representação, selecção de características e algoritmos de classificação. É ainda efectuada a avaliação dos campos de participantes através de algoritmos de classificação que os representam usando o modelo vectorial ou como um grafo. Os vários campos são combinados para classificação utilizando a técnica de combinação de classificadores Votação por Maioria. Os testes são efectuados com um subconjunto de mensagens de correio electrónico da Enron e um conjunto de dados privados disponibilizados pelo Institute for Systems and Technologies of Information, Control and Communication (INSTICC). Estes conjuntos são analisados de forma a perceber as características dos dados. A avaliação do sistema é realizada através da percentagem de acerto dos classificadores. Os resultados obtidos apresentam melhorias significativas em comparação com os trabalhos relacionados.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertação para obtenção do grau de Mestre em Engenharia Informática