5 resultados para Feature taxonomy

em Duke University


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Although many feature selection methods for classification have been developed, there is a need to identify genes in high-dimensional data with censored survival outcomes. Traditional methods for gene selection in classification problems have several drawbacks. First, the majority of the gene selection approaches for classification are single-gene based. Second, many of the gene selection procedures are not embedded within the algorithm itself. The technique of random forests has been found to perform well in high-dimensional data settings with survival outcomes. It also has an embedded feature to identify variables of importance. Therefore, it is an ideal candidate for gene selection in high-dimensional data with survival outcomes. In this paper, we develop a novel method based on the random forests to identify a set of prognostic genes. We compare our method with several machine learning methods and various node split criteria using several real data sets. Our method performed well in both simulations and real data analysis.Additionally, we have shown the advantages of our approach over single-gene-based approaches. Our method incorporates multivariate correlations in microarray data for survival outcomes. The described method allows us to better utilize the information available from microarray data with survival outcomes.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

BACKGROUND: A hierarchical taxonomy of organisms is a prerequisite for semantic integration of biodiversity data. Ideally, there would be a single, expansive, authoritative taxonomy that includes extinct and extant taxa, information on synonyms and common names, and monophyletic supraspecific taxa that reflect our current understanding of phylogenetic relationships. DESCRIPTION: As a step towards development of such a resource, and to enable large-scale integration of phenotypic data across vertebrates, we created the Vertebrate Taxonomy Ontology (VTO), a semantically defined taxonomic resource derived from the integration of existing taxonomic compilations, and freely distributed under a Creative Commons Zero (CC0) public domain waiver. The VTO includes both extant and extinct vertebrates and currently contains 106,947 taxonomic terms, 22 taxonomic ranks, 104,736 synonyms, and 162,400 cross-references to other taxonomic resources. Key challenges in constructing the VTO included (1) extracting and merging names, synonyms, and identifiers from heterogeneous sources; (2) structuring hierarchies of terms based on evolutionary relationships and the principle of monophyly; and (3) automating this process as much as possible to accommodate updates in source taxonomies. CONCLUSIONS: The VTO is the primary source of taxonomic information used by the Phenoscape Knowledgebase (http://phenoscape.org/), which integrates genetic and evolutionary phenotype data across both model and non-model vertebrates. The VTO is useful for inferring phenotypic changes on the vertebrate tree of life, which enables queries for candidate genes for various episodes in vertebrate evolution.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A juvenile cranium of Homunculus patagonicus Ameghino, 1891a from the late Early Miocene of Santa Cruz Province (Argentina) provides the first evidence of developing cranial anatomy for any fossil platyrrhine. The specimen preserves the rostral part of the cranium with deciduous and permanent alveoli and teeth. The dental eruption sequence in the new specimen and a reassessment of eruption patterns in living and fossil platyrrhines suggest that the ancestral platyrrhine pattern of tooth replacement was for the permanent incisors to erupt before M(1), not an accelerated molar eruption (before the incisors) as recently proposed. Two genera and species of Santacrucian monkeys are now generally recognized: H. patagonicus Ameghino, 1891a and Killikaike blakei Tejedor et al., 2006. Taxonomic allocation of Santacrucian monkeys to these species encounters two obstacles: 1) the (now lost) holotype and a recently proposed neotype of H. patagonicus are mandibles from different localities and different geologic members of the Santa Cruz Formation, separated by approximately 0.7 million years, whereas the holotype of K. blakei is a rostral part of a cranium without a mandible; 2) no Santacrucian monkey with associated cranium and mandible has ever been found. Bearing in mind these uncertainties, our examination of the new specimen as well as other cranial specimens of Santacrucian monkeys establishes the overall dental and cranial similarity between the holotype of Killikaike blakei, adult cranial material previously referred to H. patagonicus, and the new juvenile specimen. This leads us to conclude that Killikaike blakei is a junior subjective synonym of H. patagonicus.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.

While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.

For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.