5 resultados para Feature vectors
em Duke University
Resumo:
A tree-based dictionary learning model is developed for joint analysis of imagery and associated text. The dictionary learning may be applied directly to the imagery from patches, or to general feature vectors extracted from patches or superpixels (using any existing method for image feature extraction). Each image is associated with a path through the tree (from root to a leaf), and each of the multiple patches in a given image is associated with one node in that path. Nodes near the tree root are shared between multiple paths, representing image characteristics that are common among different types of images. Moving toward the leaves, nodes become specialized, representing details in image classes. If available, words (text) are also jointly modeled, with a path-dependent probability over words. The tree structure is inferred via a nested Dirichlet process, and a retrospective stick-breaking sampler is used to infer the tree depth and width.
Resumo:
BACKGROUND: MicroRNAs (miRNAs) are small non-coding RNAs that post-transcriptionally regulate gene expression in a variety of organisms, including insects, vertebrates, and plants. miRNAs play important roles in cell development and differentiation as well as in the cellular response to stress and infection. To date, there are limited reports of miRNA identification in mosquitoes, insects that act as essential vectors for the transmission of many human pathogens, including flaviviruses. West Nile virus (WNV) and dengue virus, members of the Flaviviridae family, are primarily transmitted by Aedes and Culex mosquitoes. Using high-throughput deep sequencing, we examined the miRNA repertoire in Ae. albopictus cells and Cx. quinquefasciatus mosquitoes. RESULTS: We identified a total of 65 miRNAs in the Ae. albopictus C7/10 cell line and 77 miRNAs in Cx. quinquefasciatus mosquitoes, the majority of which are conserved in other insects such as Drosophila melanogaster and Anopheles gambiae. The most highly expressed miRNA in both mosquito species was miR-184, a miRNA conserved from insects to vertebrates. Several previously reported Anopheles miRNAs, including miR-1890 and miR-1891, were also found in Culex and Aedes, and appear to be restricted to mosquitoes. We identified seven novel miRNAs, arising from nine different precursors, in C7/10 cells and Cx. quinquefasciatus mosquitoes, two of which have predicted orthologs in An. gambiae. Several of these novel miRNAs reside within a ~350 nt long cluster present in both Aedes and Culex. miRNA expression was confirmed by primer extension analysis. To determine whether flavivirus infection affects miRNA expression, we infected female Culex mosquitoes with WNV. Two miRNAs, miR-92 and miR-989, showed significant changes in expression levels following WNV infection. CONCLUSIONS: Aedes and Culex mosquitoes are important flavivirus vectors. Recent advances in both mosquito genomics and high-throughput sequencing technologies enabled us to interrogate the miRNA profile in these two species. Here, we provide evidence for over 60 conserved and seven novel mosquito miRNAs, expanding upon our current understanding of insect miRNAs. Undoubtedly, some of the miRNAs identified will have roles not only in mosquito development, but also in mediating viral infection in the mosquito host.
Resumo:
Although many feature selection methods for classification have been developed, there is a need to identify genes in high-dimensional data with censored survival outcomes. Traditional methods for gene selection in classification problems have several drawbacks. First, the majority of the gene selection approaches for classification are single-gene based. Second, many of the gene selection procedures are not embedded within the algorithm itself. The technique of random forests has been found to perform well in high-dimensional data settings with survival outcomes. It also has an embedded feature to identify variables of importance. Therefore, it is an ideal candidate for gene selection in high-dimensional data with survival outcomes. In this paper, we develop a novel method based on the random forests to identify a set of prognostic genes. We compare our method with several machine learning methods and various node split criteria using several real data sets. Our method performed well in both simulations and real data analysis.Additionally, we have shown the advantages of our approach over single-gene-based approaches. Our method incorporates multivariate correlations in microarray data for survival outcomes. The described method allows us to better utilize the information available from microarray data with survival outcomes.
Resumo:
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.