921 resultados para tree-based
Resumo:
The identification of genes essential for survival is important for the understanding of the minimal requirements for cellular life and for drug design. As experimental studies with the purpose of building a catalog of essential genes for a given organism are time-consuming and laborious, a computational approach which could predict gene essentiality with high accuracy would be of great value. We present here a novel computational approach, called NTPGE (Network Topology-based Prediction of Gene Essentiality), that relies on the network topology features of a gene to estimate its essentiality. The first step of NTPGE is to construct the integrated molecular network for a given organism comprising protein physical, metabolic and transcriptional regulation interactions. The second step consists in training a decision-tree-based machine-learning algorithm on known essential and non-essential genes of the organism of interest, considering as learning attributes the network topology information for each of these genes. Finally, the decision-tree classifier generated is applied to the set of genes of this organism to estimate essentiality for each gene. We applied the NTPGE approach for discovering the essential genes in Escherichia coli and then assessed its performance. (C) 2007 Elsevier B.V. All rights reserved.
Resumo:
Background: The genome-wide identification of both morbid genes, i.e., those genes whose mutations cause hereditary human diseases, and druggable genes, i.e., genes coding for proteins whose modulation by small molecules elicits phenotypic effects, requires experimental approaches that are time-consuming and laborious. Thus, a computational approach which could accurately predict such genes on a genome-wide scale would be invaluable for accelerating the pace of discovery of causal relationships between genes and diseases as well as the determination of druggability of gene products.Results: In this paper we propose a machine learning-based computational approach to predict morbid and druggable genes on a genome-wide scale. For this purpose, we constructed a decision tree-based meta-classifier and trained it on datasets containing, for each morbid and druggable gene, network topological features, tissue expression profile and subcellular localization data as learning attributes. This meta-classifier correctly recovered 65% of known morbid genes with a precision of 66% and correctly recovered 78% of known druggable genes with a precision of 75%. It was than used to assign morbidity and druggability scores to genes not known to be morbid and druggable and we showed a good match between these scores and literature data. Finally, we generated decision trees by training the J48 algorithm on the morbidity and druggability datasets to discover cellular rules for morbidity and druggability and, among the rules, we found that the number of regulating transcription factors and plasma membrane localization are the most important factors to morbidity and druggability, respectively.Conclusions: We were able to demonstrate that network topological features along with tissue expression profile and subcellular localization can reliably predict human morbid and druggable genes on a genome-wide scale. Moreover, by constructing decision trees based on these data, we could discover cellular rules governing morbidity and druggability.
Resumo:
The security of the two party Diffie-Hellman key exchange protocol is currently based on the discrete logarithm problem (DLP). However, it can also be built upon the elliptic curve discrete logarithm problem (ECDLP). Most proposed secure group communication schemes employ the DLP-based Diffie-Hellman protocol. This paper proposes the ECDLP-based Diffie-Hellman protocols for secure group communication and evaluates their performance on wireless ad hoc networks. The proposed schemes are compared at the same security level with DLP-based group protocols under different channel conditions. Our experiments and analysis show that the Tree-based Group Elliptic Curve Diffie-Hellman (TGECDH) protocol is the best in overall performance for secure group communication among the four schemes discussed in the paper. Low communication overhead, relatively low computation load and short packets are the main reasons for the good performance of the TGECDH protocol.
Resumo:
Fuzzy community detection is to identify fuzzy communities in a network, which are groups of vertices in the network such that the membership of a vertex in one community is in [0,1] and that the sum of memberships of vertices in all communities equals to 1. Fuzzy communities are pervasive in social networks, but only a few works have been done for fuzzy community detection. Recently, a one-step forward extension of Newman’s Modularity, the most popular quality function for disjoint community detection, results into the Generalized Modularity (GM) that demonstrates good performance in finding well-known fuzzy communities. Thus, GMis chosen as the quality function in our research. We first propose a generalized fuzzy t-norm modularity to investigate the effect of different fuzzy intersection operators on fuzzy community detection, since the introduction of a fuzzy intersection operation is made feasible by GM. The experimental results show that the Yager operator with a proper parameter value performs better than the product operator in revealing community structure. Then, we focus on how to find optimal fuzzy communities in a network by directly maximizing GM, which we call it Fuzzy Modularity Maximization (FMM) problem. The effort on FMM problem results into the major contribution of this thesis, an efficient and effective GM-based fuzzy community detection method that could automatically discover a fuzzy partition of a network when it is appropriate, which is much better than fuzzy partitions found by existing fuzzy community detection methods, and a crisp partition of a network when appropriate, which is competitive with partitions resulted from the best disjoint community detections up to now. We address FMM problem by iteratively solving a sub-problem called One-Step Modularity Maximization (OSMM). We present two approaches for solving this iterative procedure: a tree-based global optimizer called Find Best Leaf Node (FBLN) and a heuristic-based local optimizer. The OSMM problem is based on a simplified quadratic knapsack problem that can be solved in linear time; thus, a solution of OSMM can be found in linear time. Since the OSMM algorithm is called within FBLN recursively and the structure of the search tree is non-deterministic, we can see that the FMM/FBLN algorithm runs in a time complexity of at least O (n2). So, we also propose several highly efficient and very effective heuristic algorithms namely FMM/H algorithms. We compared our proposed FMM/H algorithms with two state-of-the-art community detection methods, modified MULTICUT Spectral Fuzzy c-Means (MSFCM) and Genetic Algorithm with a Local Search strategy (GALS), on 10 real-world data sets. The experimental results suggest that the H2 variant of FMM/H is the best performing version. The H2 algorithm is very competitive with GALS in producing maximum modularity partitions and performs much better than MSFCM. On all the 10 data sets, H2 is also 2-3 orders of magnitude faster than GALS. Furthermore, by adopting a simply modified version of the H2 algorithm as a mutation operator, we designed a genetic algorithm for fuzzy community detection, namely GAFCD, where elite selection and early termination are applied. The crossover operator is designed to make GAFCD converge fast and to enhance GAFCD’s ability of jumping out of local minimums. Experimental results on all the data sets show that GAFCD uncovers better community structure than GALS.
Resumo:
Effective automatic summarization usually requires simulating human reasoning such as abstraction or relevance reasoning. In this paper we describe a solution for this type of reasoning in the particular case of surveillance of the behavior of a dynamic system using sensor data. The paper first presents the approach describing the required type of knowledge with a possible representation. This includes knowledge about the system structure, behavior, interpretation and saliency. Then, the paper shows the inference algorithm to produce a summarization tree based on the exploitation of the physical characteristics of the system. The paper illustrates how the method is used in the context of automatic generation of summaries of behavior in an application for basin surveillance in the presence of river floods.
Resumo:
Electrocardiography (ECG) has been recently proposed as biometric trait for identification purposes. Intra-individual variations of ECG might affect identification performance. These variations are mainly due to Heart Rate Variability (HRV). In particular, HRV causes changes in the QT intervals along the ECG waveforms. This work is aimed at analysing the influence of seven QT interval correction methods (based on population models) on the performance of ECG-fiducial-based identification systems. In addition, we have also considered the influence of training set size, classifier, classifier ensemble as well as the number of consecutive heartbeats in a majority voting scheme. The ECG signals used in this study were collected from thirty-nine subjects within the Physionet open access database. Public domain software was used for fiducial points detection. Results suggested that QT correction is indeed required to improve the performance. However, there is no clear choice among the seven explored approaches for QT correction (identification rate between 0.97 and 0.99). MultiLayer Perceptron and Support Vector Machine seemed to have better generalization capabilities, in terms of classification performance, with respect to Decision Tree-based classifiers. No such strong influence of the training-set size and the number of consecutive heartbeats has been observed on the majority voting scheme.
Resumo:
The economiser is a critical component for efficient operation of coal-fired power stations. It consists of a large system of water-filled tubes which extract heat from the exhaust gases. When it fails, usually due to erosion causing a leak, the entire power station must be shut down to effect repairs. Not only are such repairs highly expensive, but the overall repair costs are significantly affected by fluctuations in electricity market prices, due to revenue lost during the outage. As a result, decisions about when to repair an economiser can alter the repair costs by millions of dollars. Therefore, economiser repair decisions are critical and must be optimised. However, making optimal repair decisions is difficult because economiser leaks are a type of interactive failure. If left unfixed, a leak in a tube can cause additional leaks in adjacent tubes which will need more time to repair. In addition, when choosing repair times, one also needs to consider a number of other uncertain inputs such as future electricity market prices and demands. Although many different decision models and methodologies have been developed, an effective decision-making method specifically for economiser repairs has yet to be defined. In this paper, we describe a Decision Tree based method to meet this need. An industrial case study is presented to demonstrate the application of our method.
Resumo:
Genetic variation is the resource animal breeders exploit in stock improvement programs. Both the process of selection and husbandry practices employed in aquaculture will erode genetic variation levels overtime, hence the critical resource can be lost and this may compromise future genetic gains in breeding programs. The amount of genetic variation in five lines of Sydney Rock Oyster (SRO) that had been selected for QX (Queensland unknown) disease resistance were examined and compared with that in a wild reference population using seven specific SRO microsatellite loci. The five selected lines had significantly lower levels of genetic diversity than did the wild reference population with allelic diversity declining approximately 80%, but impacts on heterozygosity per locus were less severe. Significant deficiencies in heterozygotes were detected at six of the seven loci in both mass selected lines and the wild reference population. Against this trend however, a significant excess of heterozygotes was recorded at three loci Sgo9, Sgo14 and Sgo21 in three QX disease resistant lines (#2, #5 and #13). All populations were significantly genetic differentiated from each other based on pairwise FST values. A neighbour joining tree based on DA genetic distances showed a clear separation between all culture and wild populations. Results of this study show clearly, that the impacts of the stock improvement program for SRO has significantly eroded natural levels of genetic variation in the culture lines. This could compromise long-term genetic gains and affect sustainability of the SRO breeding program over the long-term.
Resumo:
A novel m-ary tree based approach is presented to solve asset management decisions which are combinatorial in nature. The approach introduces a new dynamic constraint based control mechanism which is capable of excluding infeasible solutions from the solution space. The approach also provides a solution to the challenges with ordering of assets decisions.
Resumo:
The benefits of applying tree-based methods to the purpose of modelling financial assets as opposed to linear factor analysis are increasingly being understood by market practitioners. Tree-based models such as CART (classification and regression trees) are particularly well suited to analysing stock market data which is noisy and often contains non-linear relationships and high-order interactions. CART was originally developed in the 1980s by medical researchers disheartened by the stringent assumptions applied by traditional regression analysis (Brieman et al. [1984]). In the intervening years, CART has been successfully applied to many areas of finance such as the classification of financial distress of firms (see Frydman, Altman and Kao [1985]), asset allocation (see Sorensen, Mezrich and Miller [1996]), equity style timing (see Kao and Shumaker [1999]) and stock selection (see Sorensen, Miller and Ooi [2000])...
Resumo:
With increasing interest shown by Universities in workplace learning, especially in STEM disciplines, an issue has arisen amongst educators and industry partners regarding authentic assessment tasks for work integrated learning (WIL) subjects. This paper describes the use of a matrix, which is also available as a decision-tree, based on the features of the WIL experience, in order to facilitate the selection of appropriate assessment strategies. The matrix divides the WIL experiences into seven categories, based on such factors as: the extent to which the experience is compulsory, required for membership of a professional body or elective; whether the student is undertaking a project, or embedding in a professional culture; and other key aspects of the WIL experience. One important variable is linked to the fundamental purpose of the assessment. This question revolves around the focus of the assessment: whether on the person (student development); the process (professional conduct/language); or the product (project, assignment, literature review, report, software). The matrix has been trialed at QUT in the Faculty of Science and Technology, and also at the University of Surrey, UK, and has proven to have good applicability in both universities.
Resumo:
We examine the security of the 64-bit lightweight block cipher PRESENT-80 against related-key differential attacks. With a computer search we are able to prove that for any related-key differential characteristic on full-round PRESENT-80, the probability of the characteristic only in the 64-bit state is not higher than 2−64. To overcome the exponential (in the state and key sizes) computational complexity of the search we use truncated differences, however as the key schedule is not nibble oriented, we switch to actual differences and apply early abort techniques to prune the tree-based search. With a new method called extended split approach we are able to make the whole search feasible and we implement and run it in real time. Our approach targets the PRESENT-80 cipher however,with small modifications can be reused for other lightweight ciphers as well.
Resumo:
Primary biliary cirrhosis (PBC) and autoimmune cholangitis (AIC) are serologic expressions of an autoimmune liver disease affecting biliary ductular cells. Previously we screened a phage-displayed random peptide library with polyclonal IgG from 2 Australian patients with PBC and derived peptides that identified a single conformational (discontinuous) epitope in the inner lipoyl domain of the E2 subunit of the pyruvate dehydrogenase complex (PDC-E2), the characteristic autoantigen in PBC. Here we have used phage display to investigate the reactivity of PBC sera from 2 ethnically and geographically distinct populations, Japanese and Australian, and the 2 serologic expressions, PBC and AIC. Random 7-mer and 12-mer peptide libraries were biopanned with IgG from 3 Japanese patients with PBC and 3 with AIC who did not have anti-PDC-E2. The phage clones (phagotopes) obtained were tested by capture enzyme-linked immunosorbent assay (ELISA) for reactivity with affinity-purified anti-PDC-E2, and compared with those obtained from Australian patients with PBC. Peptide sequences of the derived phagotopes and sequences derived by biopanning with irrelevant antisera were aligned to develop a guide tree based on physicochemical similarity. Both Australian and Japanese PBC-derived phagotopes were distributed in branches of the guide tree that contained the peptide sequences MH and FV previously identified as part of an immunodominant conformational epitope of PDC-E2, indicating that epitope selection was not influenced by the racial origin of the PBC sera. Biopanning with either PBC or AIC-derived IgG yielded phagotopes that reacted with anti-PDC-E2 by capture ELISA, further establishing that there is a similar autoimmune targeting in PBC and AIC.
Resumo:
This thesis studies human gene expression space using high throughput gene expression data from DNA microarrays. In molecular biology, high throughput techniques allow numerical measurements of expression of tens of thousands of genes simultaneously. In a single study, this data is traditionally obtained from a limited number of sample types with a small number of replicates. For organism-wide analysis, this data has been largely unavailable and the global structure of human transcriptome has remained unknown. This thesis introduces a human transcriptome map of different biological entities and analysis of its general structure. The map is constructed from gene expression data from the two largest public microarray data repositories, GEO and ArrayExpress. The creation of this map contributed to the development of ArrayExpress by identifying and retrofitting the previously unusable and missing data and by improving the access to its data. It also contributed to creation of several new tools for microarray data manipulation and establishment of data exchange between GEO and ArrayExpress. The data integration for the global map required creation of a new large ontology of human cell types, disease states, organism parts and cell lines. The ontology was used in a new text mining and decision tree based method for automatic conversion of human readable free text microarray data annotations into categorised format. The data comparability and minimisation of the systematic measurement errors that are characteristic to each lab- oratory in this large cross-laboratories integrated dataset, was ensured by computation of a range of microarray data quality metrics and exclusion of incomparable data. The structure of a global map of human gene expression was then explored by principal component analysis and hierarchical clustering using heuristics and help from another purpose built sample ontology. A preface and motivation to the construction and analysis of a global map of human gene expression is given by analysis of two microarray datasets of human malignant melanoma. The analysis of these sets incorporate indirect comparison of statistical methods for finding differentially expressed genes and point to the need to study gene expression on a global level.
Resumo:
This PhD research has proposed new machine learning techniques to improve human action recognition based on local features. Several novel video representation and classification techniques have been proposed to increase the performance with lower computational complexity. The major contributions are the construction of new feature representation techniques, based on advanced machine learning techniques such as multiple instance dictionary learning, Latent Dirichlet Allocation (LDA) and Sparse coding. A Binary-tree based classification technique was also proposed to deal with large amounts of action categories. These techniques are not only improving the classification accuracy with constrained computational resources but are also robust to challenging environmental conditions. These developed techniques can be easily extended to a wide range of video applications to provide near real-time performance.