880 resultados para Feature selection process
Resumo:
Background: A current challenge in gene annotation is to define the gene function in the context of the network of relationships instead of using single genes. The inference of gene networks (GNs) has emerged as an approach to better understand the biology of the system and to study how several components of this network interact with each other and keep their functions stable. However, in general there is no sufficient data to accurately recover the GNs from their expression levels leading to the curse of dimensionality, in which the number of variables is higher than samples. One way to mitigate this problem is to integrate biological data instead of using only the expression profiles in the inference process. Nowadays, the use of several biological information in inference methods had a significant increase in order to better recover the connections between genes and reduce the false positives. What makes this strategy so interesting is the possibility of confirming the known connections through the included biological data, and the possibility of discovering new relationships between genes when observed the expression data. Although several works in data integration have increased the performance of the network inference methods, the real contribution of adding each type of biological information in the obtained improvement is not clear. Methods: We propose a methodology to include biological information into an inference algorithm in order to assess its prediction gain by using biological information and expression profile together. We also evaluated and compared the gain of adding four types of biological information: (a) protein-protein interaction, (b) Rosetta stone fusion proteins, (c) KEGG and (d) KEGG+GO. Results and conclusions: This work presents a first comparison of the gain in the use of prior biological information in the inference of GNs by considering the eukaryote (P. falciparum) organism. Our results indicates that information based on direct interaction can produce a higher improvement in the gain than data about a less specific relationship as GO or KEGG. Also, as expected, the results show that the use of biological information is a very important approach for the improvement of the inference. We also compared the gain in the inference of the global network and only the hubs. The results indicates that the use of biological information can improve the identification of the most connected proteins.
Resumo:
This thesis presents a creative and practical approach to dealing with the problem of selection bias. Selection bias may be the most important vexing problem in program evaluation or in any line of research that attempts to assert causality. Some of the greatest minds in economics and statistics have scrutinized the problem of selection bias, with the resulting approaches – Rubin’s Potential Outcome Approach(Rosenbaum and Rubin,1983; Rubin, 1991,2001,2004) or Heckman’s Selection model (Heckman, 1979) – being widely accepted and used as the best fixes. These solutions to the bias that arises in particular from self selection are imperfect, and many researchers, when feasible, reserve their strongest causal inference for data from experimental rather than observational studies. The innovative aspect of this thesis is to propose a data transformation that allows measuring and testing in an automatic and multivariate way the presence of selection bias. The approach involves the construction of a multi-dimensional conditional space of the X matrix in which the bias associated with the treatment assignment has been eliminated. Specifically, we propose the use of a partial dependence analysis of the X-space as a tool for investigating the dependence relationship between a set of observable pre-treatment categorical covariates X and a treatment indicator variable T, in order to obtain a measure of bias according to their dependence structure. The measure of selection bias is then expressed in terms of inertia due to the dependence between X and T that has been eliminated. Given the measure of selection bias, we propose a multivariate test of imbalance in order to check if the detected bias is significant, by using the asymptotical distribution of inertia due to T (Estadella et al. 2005) , and by preserving the multivariate nature of data. Further, we propose the use of a clustering procedure as a tool to find groups of comparable units on which estimate local causal effects, and the use of the multivariate test of imbalance as a stopping rule in choosing the best cluster solution set. The method is non parametric, it does not call for modeling the data, based on some underlying theory or assumption about the selection process, but instead it calls for using the existing variability within the data and letting the data to speak. The idea of proposing this multivariate approach to measure selection bias and test balance comes from the consideration that in applied research all aspects of multivariate balance, not represented in the univariate variable- by-variable summaries, are ignored. The first part contains an introduction to evaluation methods as part of public and private decision process and a review of the literature of evaluation methods. The attention is focused on Rubin Potential Outcome Approach, matching methods, and briefly on Heckman’s Selection Model. The second part focuses on some resulting limitations of conventional methods, with particular attention to the problem of how testing in the correct way balancing. The third part contains the original contribution proposed , a simulation study that allows to check the performance of the method for a given dependence setting and an application to a real data set. Finally, we discuss, conclude and explain our future perspectives.
Resumo:
Advances in biomedical signal acquisition systems for motion analysis have led to lowcost and ubiquitous wearable sensors which can be used to record movement data in different settings. This implies the potential availability of large amounts of quantitative data. It is then crucial to identify and to extract the information of clinical relevance from the large amount of available data. This quantitative and objective information can be an important aid for clinical decision making. Data mining is the process of discovering such information in databases through data processing, selection of informative data, and identification of relevant patterns. The databases considered in this thesis store motion data from wearable sensors (specifically accelerometers) and clinical information (clinical data, scores, tests). The main goal of this thesis is to develop data mining tools which can provide quantitative information to the clinician in the field of movement disorders. This thesis will focus on motor impairment in Parkinson's disease (PD). Different databases related to Parkinson subjects in different stages of the disease were considered for this thesis. Each database is characterized by the data recorded during a specific motor task performed by different groups of subjects. The data mining techniques that were used in this thesis are feature selection (a technique which was used to find relevant information and to discard useless or redundant data), classification, clustering, and regression. The aims were to identify high risk subjects for PD, characterize the differences between early PD subjects and healthy ones, characterize PD subtypes and automatically assess the severity of symptoms in the home setting.
Resumo:
Interaction between differentiating neurons and the extracellular environment guides the establishment of cell polarity during nervous system development. Developing neurons read the physical properties of the local substrate in a contact-dependent manner and retrieve essential guidance cues. In previous works we demonstrated that PC12 cell interaction with nanogratings (alternating lines of ridges and grooves of submicron size) promotes bipolarity and alignment to the substrate topography. Here, we investigate the role of focal adhesions, cell contractility, and actin dynamics in this process. Exploiting nanoimprint lithography techniques and a cyclic olefin copolymer, we engineered biocompatible nanostructured substrates designed for high-resolution live-cell microscopy. Our results reveal that neuronal polarization and contact guidance are based on a geometrical constraint of focal adhesions resulting in an angular modulation of their maturation and persistence. We report on ROCK1/2-myosin-II pathway activity and demonstrate that ROCK-mediated contractility contributes to polarity selection during neuronal differentiation. Importantly, the selection process confined the generation of actin-supported membrane protrusions and the initiation of new neurites at the poles. Maintenance of the established polarity was independent from NGF stimulation. Altogether our results imply that focal adhesions and cell contractility stably link the topographical configuration of the extracellular environment to a corresponding neuronal polarity state.
Resumo:
The selection of oviposition sites by syrphids and other aphidophagous insects is influenced by the presence of con- and heterospecific competitors. Chemical cues play a role in this selection process, some of them being volatile semiochemicals. Yet, little is known about the identity and specificity of chemical signals that are involved in the searching behavior of these predators. In this study, we used olfactometer bioassays to explore the olfactory responses of gravid females and larvae of the syrphid Sphaerophoria rueppellii, focussing on volatiles from conspecific immature stages, as well as odors from immature stages of the competing coccinellid Adalia bipunctata. In addition, a multiple-choice oviposition experiment was conducted to study if females respond differently when they can also sense their competitors through visual or tactile cues. Results showed that volatiles from plants and aphids did not affect the behavior of second-instars, whereas adult females strongly preferred odors from aphid colonies without competitors. Odors from conspecific immature stages had a repellent effect on S. rueppellii adult females, whereas their choices were not affected by volatiles coming from immature heterospecific A. bipunctata. The results imply that the syrphid uses odors to avoid sites that are already occupied by conspecifics. They did not avoid the odor of the heterospecific competitor, although in close vicinity they were found to avoid laying eggs on leaves that had traces of the coccinellid. Apparently adult syrphids do not rely greatly on volatile semiochemicals to detect the coccinellid, but rather use other stimuli at close range (e. g., visual or non-volatile compounds) to avoid this competitor.
Resumo:
Before rural local government units were established in Thailand, reform debates within the country faced a crucial issue: Candidates at the rural sub-district levels might adopt electioneering methods such as vote buying and the patronage system of the local political and economic elite, the methods that had been used in the national elections. In fact, the results of the 2006 survey in this paper, which followed the introduction of direct elections in rural local government units in 2003, contrast with the result anticipated during the debates on political reform. The preliminary data of the survey shows that the decentralization process and the introduction of the direct election system in the rural areas had some effect in changing the selection process of the local elite in Thailand.
Resumo:
The main purpose of a gene interaction network is to map the relationships of the genes that are out of sight when a genomic study is tackled. DNA microarrays allow the measure of gene expression of thousands of genes at the same time. These data constitute the numeric seed for the induction of the gene networks. In this paper, we propose a new approach to build gene networks by means of Bayesian classifiers, variable selection and bootstrap resampling. The interactions induced by the Bayesian classifiers are based both on the expression levels and on the phenotype information of the supervised variable. Feature selection and bootstrap resampling add reliability and robustness to the overall process removing the false positive findings. The consensus among all the induced models produces a hierarchy of dependences and, thus, of variables. Biologists can define the depth level of the model hierarchy so the set of interactions and genes involved can vary from a sparse to a dense set. Experimental results show how these networks perform well on classification tasks. The biological validation matches previous biological findings and opens new hypothesis for future studies
Resumo:
In recent years, international cooperation processes have become a key mechanism for companies to internationalise their innovative activities, par ticularly in the case of small businesses whose size reduces their possibilities of developing internationalisation strategies autonomously in the same way as larger companies. In Spain, the existence of two parallel programmes with similar structures oriented towards Europe (EUREKA) and Latin America (IBEROEKA) raises the question as to whether the fact that companies participate in only one (unipolar) or both (bipolar) of these programmes is the result of a selection process, which, in turn, results in the existence of different collectives with different efficiency parameters. The aim of this study is to provide a comparative analysis based on the final reports of Spanish companies that have participated in the EUREKA programme. Two groups of companies were compared: one comprising companies that have only had international experience in Europe (EUREKA); and another formed by companies that have also carried out IBEROEKA projects. The conclusions confirm that the behaviour of both groups of companies differs substantially and reveal the importance of geographical perspective in the analysis of international cooperation in technology. This disparate behaviour is a relevant aspect that must be taken into account when designing policies to promote international technological cooperation.
Resumo:
Using the Bayesian approach as the model selection criteria, the main purpose in this study is to establish a practical road accident model that can provide a better interpretation and prediction performance. For this purpose we are using a structural explanatory model with autoregressive error term. The model estimation is carried out through Bayesian inference and the best model is selected based on the goodness of fit measures. To cross validate the model estimation further prediction analysis were done. As the road safety measures the number of fatal accidents in Spain, during 2000-2011 were employed. The results of the variable selection process show that the factors explaining fatal road accidents are mainly exposure, economic factors, and surveillance and legislative measures. The model selection shows that the impact of economic factors on fatal accidents during the period under study has been higher compared to surveillance and legislative measures.
Resumo:
Nonlinear analysis tools for studying and characterizing the dynamics of physiological signals have gained popularity, mainly because tracking sudden alterations of the inherent complexity of biological processes might be an indicator of altered physiological states. Typically, in order to perform an analysis with such tools, the physiological variables that describe the biological process under study are used to reconstruct the underlying dynamics of the biological processes. For that goal, a procedure called time-delay or uniform embedding is usually employed. Nonetheless, there is evidence of its inability for dealing with non-stationary signals, as those recorded from many physiological processes. To handle with such a drawback, this paper evaluates the utility of non-conventional time series reconstruction procedures based on non uniform embedding, applying them to automatic pattern recognition tasks. The paper compares a state of the art non uniform approach with a novel scheme which fuses embedding and feature selection at once, searching for better reconstructions of the dynamics of the system. Moreover, results are also compared with two classic uniform embedding techniques. Thus, the goal is comparing uniform and non uniform reconstruction techniques, including the one proposed in this work, for pattern recognition in biomedical signal processing tasks. Once the state space is reconstructed, the scheme followed characterizes with three classic nonlinear dynamic features (Largest Lyapunov Exponent, Correlation Dimension and Recurrence Period Density Entropy), while classification is carried out by means of a simple k-nn classifier. In order to test its generalization capabilities, the approach was tested with three different physiological databases (Speech Pathologies, Epilepsy and Heart Murmurs). In terms of the accuracy obtained to automatically detect the presence of pathologies, and for the three types of biosignals analyzed, the non uniform techniques used in this work lightly outperformed the results obtained using the uniform methods, suggesting their usefulness to characterize non-stationary biomedical signals in pattern recognition applications. On the other hand, in view of the results obtained and its low computational load, the proposed technique suggests its applicability for the applications under study.
Resumo:
Piotr Omenzetter and Simon Hoell’s work within the Lloyd’s Register Foundation Centre for Safety and Reliability Engineering at the University of Aberdeen is supported by Lloyd’s Register Foundation. The Foundation helps to protect life and property by supporting engineering-related education, public engagement and the application of research.
Resumo:
A rápida evolução do hardware demanda uma evolução contínua dos compiladores. Um processo de ajuste deve ser realizado pelos projetistas de compiladores para garantir que o código gerado pelo compilador mantenha uma determinada qualidade, seja em termos de tempo de processamento ou outra característica pré-definida. Este trabalho visou automatizar o processo de ajuste de compiladores por meio de técnicas de aprendizado de máquina. Como resultado os planos de compilação obtidos usando aprendizado de máquina com as características propostas produziram código para programas cujos valores para os tempos de execução se aproximaram daqueles seguindo o plano padrão utilizado pela LLVM.
Resumo:
Hypertrophic cardiomyopathy (HCM) is a cardiovascular disease where the heart muscle is partially thickened and blood flow is - potentially fatally - obstructed. It is one of the leading causes of sudden cardiac death in young people. Electrocardiography (ECG) and Echocardiography (Echo) are the standard tests for identifying HCM and other cardiac abnormalities. The American Heart Association has recommended using a pre-participation questionnaire for young athletes instead of ECG or Echo tests due to considerations of cost and time involved in interpreting the results of these tests by an expert cardiologist. Initially we set out to develop a classifier for automated prediction of young athletes’ heart conditions based on the answers to the questionnaire. Classification results and further in-depth analysis using computational and statistical methods indicated significant shortcomings of the questionnaire in predicting cardiac abnormalities. Automated methods for analyzing ECG signals can help reduce cost and save time in the pre-participation screening process by detecting HCM and other cardiac abnormalities. Therefore, the main goal of this dissertation work is to identify HCM through computational analysis of 12-lead ECG. ECG signals recorded on one or two leads have been analyzed in the past for classifying individual heartbeats into different types of arrhythmia as annotated primarily in the MIT-BIH database. In contrast, we classify complete sequences of 12-lead ECGs to assign patients into two groups: HCM vs. non-HCM. The challenges and issues we address include missing ECG waves in one or more leads and the dimensionality of a large feature-set. We address these by proposing imputation and feature-selection methods. We develop heartbeat-classifiers by employing Random Forests and Support Vector Machines, and propose a method to classify full 12-lead ECGs based on the proportion of heartbeats classified as HCM. The results from our experiments show that the classifiers developed using our methods perform well in identifying HCM. Thus the two contributions of this thesis are the utilization of computational and statistical methods for discovering shortcomings in a current screening procedure and the development of methods to identify HCM through computational analysis of 12-lead ECG signals.
Resumo:
Understanding and predicting the distribution of organisms in heterogeneous environments lies at the heart of ecology, and the theory of density-dependent habitat selection (DDHS) provides ecologists with an inferential framework linking evolution and population dynamics. Current theory does not allow for temporal variation in habitat quality, a serious limitation when confronted with real ecological systems. We develop both a stochastic equivalent of the ideal free distribution to study how spatial patterns of habitat use depend on the magnitude and spatial correlation of environmental stochasticity and also a stochastic habitat selection rule. The emerging patterns are confronted with deterministic predictions based on isodar analysis, an established empirical approach to the analysis of habitat selection patterns. Our simulations highlight some consistent patterns of habitat use, indicating that it is possible to make inferences about the habitat selection process based on observed patterns of habitat use. However, isodar analysis gives results that are contingent on the magnitude and spatial correlation of environmental stochasticity. Hence, DDHS is better revealed by a measure of habitat selectivity than by empirical isodars. The detection of DDHS is but a small component of isodar theory, which remains an important conceptual framework for linking evolutionary strategies in behavior and population dynamics.
Resumo:
Document classification is a supervised machine learning process, where predefined category labels are assigned to documents based on the hypothesis derived from training set of labelled documents. Documents cannot be directly interpreted by a computer system unless they have been modelled as a collection of computable features. Rogati and Yang [M. Rogati and Y. Yang, Resource selection for domain-specific cross-lingual IR, in SIGIR 2004: Proceedings of the 27th annual international conference on Research and Development in Information Retrieval, ACM Press, Sheffied: United Kingdom, pp. 154-161.] pointed out that the effectiveness of document classification system may vary in different domains. This implies that the quality of document model contributes to the effectiveness of document classification. Conventionally, model evaluation is accomplished by comparing the effectiveness scores of classifiers on model candidates. However, this kind of evaluation methods may encounter either under-fitting or over-fitting problems, because the effectiveness scores are restricted by the learning capacities of classifiers. We propose a model fitness evaluation method to determine whether a model is sufficient to distinguish positive and negative instances while still competent to provide satisfactory effectiveness with a small feature subset. Our experiments demonstrated how the fitness of models are assessed. The results of our work contribute to the researches of feature selection, dimensionality reduction and document classification.