990 resultados para Co-clustering
Resumo:
We present a new co-clustering problem of images and visual features. The problem involves a set of non-object images in addition to a set of object images and features to be co-clustered. Co-clustering is performed in a way that maximises discrimination of object images from non-object images, thus emphasizing discriminative features. This provides a way of obtaining perceptual joint-clusters of object images and features. We tackle the problem by simultaneously boosting multiple strong classifiers which compete for images by their expertise. Each boosting classifier is an aggregation of weak-learners, i.e. simple visual features. The obtained classifiers are useful for object detection tasks which exhibit multimodalities, e.g. multi-category and multi-view object detection tasks. Experiments on a set of pedestrian images and a face data set demonstrate that the method yields intuitive image clusters with associated features and is much superior to conventional boosting classifiers in object detection tasks.
Resumo:
This project is a step forward in the study of text mining where enhanced text representation with semantic information plays a significant role. It develops effective methods of entity-oriented retrieval, semantic relation identification and text clustering utilizing semantically annotated data. These methods are based on enriched text representation generated by introducing semantic information extracted from Wikipedia into the input text data. The proposed methods are evaluated against several start-of-art benchmarking methods on real-life data-sets. In particular, this thesis improves the performance of entity-oriented retrieval, identifies different lexical forms for an entity relation and handles clustering documents with multiple feature spaces.
Resumo:
High-Order Co-Clustering (HOCC) methods have attracted high attention in recent years because of their ability to cluster multiple types of objects simultaneously using all available information. During the clustering process, HOCC methods exploit object co-occurrence information, i.e., inter-type relationships amongst different types of objects as well as object affinity information, i.e., intra-type relationships amongst the same types of objects. However, it is difficult to learn accurate intra-type relationships in the presence of noise and outliers. Existing HOCC methods consider the p nearest neighbours based on Euclidean distance for the intra-type relationships, which leads to incomplete and inaccurate intra-type relationships. In this paper, we propose a novel HOCC method that incorporates multiple subspace learning with a heterogeneous manifold ensemble to learn complete and accurate intra-type relationships. Multiple subspace learning reconstructs the similarity between any pair of objects that belong to the same subspace. The heterogeneous manifold ensemble is created based on two-types of intra-type relationships learnt using p-nearest-neighbour graph and multiple subspaces learning. Moreover, in order to make sure the robustness of clustering process, we introduce a sparse error matrix into matrix decomposition and develop a novel iterative algorithm. Empirical experiments show that the proposed method achieves improved results over the state-of-art HOCC methods for FScore and NMI.
Resumo:
Major histocompatibility complex (MHC) class II molecules displayed clustered patterns at the surfaces of T (HUT-102B2) and B (JY) lymphoma cells characterized by interreceptor distances in the micrometer range as detected by scanning force microscopy of immunogold-labeled antigens. Electron microscopy revealed that a fraction of the MHC class II molecules was also heteroclustered with MHC class I antigens at the same hierarchical level as described by the scanning force microscopy data, after specifically and sequentially labeling the antigens with 30- and 15-nm immunogold beads. On JY cells the estimated fraction of co-clustered HLA II was 0.61, whereas that of the HLA I was 0.24. Clusterization of the antigens was detected by the deviation of their spatial distribution from the Poissonian distribution representing the random case. Fluorescence resonance energy transfer measurements also confirmed partial co-clustering of the HLA class I and II molecules at another hierarchical level characterized by the 2- to 10-nm Förster distance range and providing fine details of the molecular organization of receptors. The larger-scale topological organization of the MHC class I and II antigens may reflect underlying membrane lipid domains and may fulfill significant functions in cell-to-cell contacts and signal transduction.
Resumo:
Upon overexpression of integrin αvβ3 and its engagement by vitronectin, we previously showed enhanced adhesion, proliferation, and motility of human ovarian cancer cells. By studying differential expression of genes possibly related to these tumor biological events, we identified the epidermal growth-factor receptor (EGF-R) to be under control of αvβ3 expression levels. Thus in the present study we characterized αvβ3-dependent changes of EGF-R and found significant upregulation of its expression and activity which was reflected by prominent changes of EGF-R promoter activity. Upon disruption of DNA-binding motifs for the transcription factors p53, ETF, the repressor ETR, p50, and c-rel, respectively, we sought to identify DNA elements contributing to αvβ3-mediated EGF-R promoter induction. Both, the p53- and ETF-mutant, while exhibiting considerably lower EGF-R promoter activity than the wild type promoter, retained inducibility by αvβ3. Mutation of the repressor motif ETR, as expected, enhanced EGF-R promoter activity with a further moderate increase upon αvβ3 elevation. The p50-mutant displayed EGF-R promoter activity almost comparable to that of the wild type promoter with no impairment of induction by αvβ3. However, the activity of an EGF-R promoter mutant displaying a disrupted c-rel-binding motif did not only prominently decline, but, moreover, was not longer responsive to enhanced αvβ3, involving this DNA element in αvβ3-dependent EGF-R upregulation. Moreover, αvβ3 did not only increase the EGF-R but, moreover, also led to obvious co-clustering on the cancer cell surface. By studying αvβ3/EGF-R-effects on the focal adhesion kinase (FAK) and the mitogen activated protein kinases (MAPK) p44/42 (erk−1/erk−2), having important functions in synergistic crosstalk between integrins and growth-factor receptors, we found for both significant enhancement of expression and activity upon αvβ3/VN interaction and cell stimulation by EGF. Upregulation of the EGF-R by integrin αvβ3, both receptor molecules with a well-defined role as targets for cancer treatment, might represent an additional mechanism to adapt synergistic receptor signaling and crosstalk in response to an altered tumor cell microenvironment during ovarian cancer progression.
Resumo:
In this work, we present the challenges associated with the two-way recommendation methods in social networks and the solutions. We discuss them from the perspective of community-type social networks such as online dating networks.
Resumo:
互联网个性化推荐系统(Internet personal recommender systems)是根据用户的兴趣推荐最相关的互联网信息给用户的系统。在网上信息过载矛盾越来越严重、用户信息检索的个性化需求日益增强的现状下,推荐系统已经在搜索引擎、电子商务、网上社区等互联网关键应用中起到了关键性的作用,并且越来越受到重视。 然而,在大型网站上部署一个成熟推荐系统的代价依然很大,需要大量的计算和存储资源,推荐的准确性也依然有很大提升空间和需求,这就为推荐系统的研究提供了很多挑战。在这些挑战中推荐算法的准确性和可扩展性一直是该领域最为关注的两个问题,所谓推荐的准确性是指推荐的信息中用户真正感兴趣的比例,而可扩展性指的是系统能否在可容忍的时间和空间复杂度内处理海量的数据。如何在提高算法推荐准确性的同时增强算法的可扩展性是推荐系统改进的主要研究目标。然而,目前学术界的研究更多侧重于提高推荐算法的准确性,而对于可扩展性,很多准确性很高的算法由于需要比较复杂的计算,处理大规模动态数据的能力往往比较有限,并且它们的评测实验中并没有将可扩展性纳入到评价范畴,导致这些算法目前还很难在工业界大规模应用。 本论文的研究试图解决这一问题。通过在推荐算法中借鉴增量学习(Incremental learning)的思想,即考虑最新的训练数据来更新原有的机器学习模型,不需要或仅需要参考部分旧的训练数据,相对于使用全部数据也即批量的处理方式,增量式改进可以大大降低模型更新的复杂度,从而可以大幅度提高推荐算法在遇到新的训练数据时推荐模型更新的效率,降低计算代价,使得推荐模型的更新可以更加及时,进而提高推荐结果的准确性。具体来说,我们在提出了两种新的增量式协同过滤算法的同时,采用增量式学习的方法对目前准确性最好的若干推荐算法进行加速,特别是提高这些算法面对新的训练数据的更新模型的速度和效率,从而为这些算法的大规模的应用提供了可能。另一方面,新的训练数据包含了最新的用户兴趣,因此相对于旧的训练数据,算法在做更新时应给予更高的权重,这样才能做到推荐的结果在考虑到用户长期兴趣的同时,特别考虑用户近期的兴趣,从而使得推荐结果更加准确。这两方面归纳起来,我们旨在通过增量式学习使得推荐算法在更新时更加高效和精确,真正适用于互联网上海量数据的推荐,同时对其他增量式推荐系统方面的研究也具有借鉴意义。我们的改进工作主要包括以下几个方面: 基于主题模型的增量式推荐算法。主题模型,特别是概率隐含主题模型(PLSA)是一种广泛应用于推荐系统的主流方法,在文本推荐、图像推荐以及协同过滤推荐领域都有着很好的推荐效果。目前制约PLSA算法取得更大成功的重要因素就是PLSA算法更新的复杂度过高,使得学习模型的更新只能做批量式处理,这样就导致推荐的时效性不高,也没有办法体现用户的最新的兴趣和整体的最新动态。我们提出了一种增量式学习方法,可以应用于文本分析领域和协同过滤领域,当有新的训练数据到来时,对于基于文本的推荐,增量式更新方法仅寻找最相关的用户和文本以及涉及到的单词进行主题分布的更新,并给予新的文本以更高权重;对于协同过滤,我们的方法仅对当前用户所评分过得物品以及当前物品所涉及的用户进行更新,大大降低了更新的运算复杂度,提高了新数据在推荐算法中所占的权重,使得推荐更加准确、及时。我们的算法在天涯问答文本数据集上和MovieLens电影推荐数据集、Last.FM歌曲推荐数据集、豆瓣图书推荐数据集等协同过滤数据集上取得了很好的效果。 基于蚁群算法(Ant colony algorithm)的协同过滤推荐方法。受到群体智能(Swarm intelligence)算法的启发,我们提出了一种类似于蚁群算法的协同过滤推荐方法——Ant Collaborative Filtering,初始化阶段该方法给予每个用户或一组用户以全局唯一的单位数量的信息素,当用户对物品评分或者用户表示对该物品感兴趣时,用户所携带的信息素相应的传播到该物品上,同时该物品上已有的信息素(初始化为0)也会相应的传播给该用户;此外,用户和物品所携带的信息素会随着时间的推移有一定速率的挥发,通过挥发机制,可以在推荐时更重视用户近期的兴趣;推荐阶段,按照用户和物品所携带的信息素的种类和数量,我们可以得到相应的相似度,进而通过经典的相似度比较的方法来进行推荐。基于蚁群的协同过滤方法的优势在于可以有效的降低训练数据中的稀疏性,并且推荐算法可以实时的进行更新和推荐,同时考虑了用户兴趣随着时间的变化。我们在MovieLens电影评分、豆瓣书籍推荐、Last.FM音乐推荐数据集上验证了我们的方法。最后,我们建立了一个互联网新闻推荐系统,该系统以Firefox插件形式实现,自动采集用户浏览兴趣和偏好,后端使用不同的推荐算法推荐用户感兴趣的新闻给用户。 基于联合聚类(Co-clustering)的两阶段协同过滤方法。聚类(Clustering)是一种缩小数据规模、降低数据稀疏性的有效方法。对于庞大而稀疏的协同过滤训练数据来说,聚类是一种很自然事实上也的确很有效的预处理方法。因此我们提出了一种两阶段协同过滤框架:首先通过我们提出的一种联合聚类的方法,将原始评分矩阵分解成很多维度很小的块,每一块里面包含相似的用户对相似的物品的评分,然后通过矩阵拟合的方法(我们使用了非负矩阵分解NMF和主题模型PLSA)来对这些小块中的未知评分进行预测。当用户新增了对于某物品的一条评分,我们仅需要更新该用户或该物品所处的数据块进行重新评分预估,大大加快了评分预估的速度。我们在MovieLens电影评分数据集上验证了该算法的效果。 本文的研究成果不仅可以直接应用于大型推荐系统中,而且对于增量式推荐系统的后续研究也具有一定的指导意义。首先基于PLSA的增量式推荐算法对于其他基于图模型的推荐系统具有借鉴价值,其次蚁群推荐算法为一类新的、基于群体智能(Swarm intellignece)的协同过滤算法做出了有价值的探索,最后我们提出的两阶段协同过滤框架对于提高推荐算法的可扩展性和更新效率提出了一个通用的有效解决方案。 推荐系统是一个无止尽的优化的过程,除了推荐精度的不断提高之外,推荐算法的性能随着互联网上数据量的增加也需要进一步提高,增量式学习无疑是提高推荐算法更新速度最重要的方法,本文的研究为这一方向提供了参考。
Resumo:
Recognition of bacterial lipopolysaccharide (LPS) by the innate immune system involves at least three receptor molecules: CD14, TLR4 and MD-2. Additional receptor components such as heat shock proteins, chemokine receptor 4 (CXCR4), or CD55 have been suggested to be part of this activation cluster; possibly acting as additional LPS transfer molecules. Our group has previously identified CXCR4 as a component of the "LPS-sensing apparatus". In this study we aimed to elucidate the role that CXCR4 plays in innate immune responses to LPS. Here we demonstrate that CXCR4 transfection results in responsiveness to LPS. Fluorescence correlation spectroscopy experiments further showed that LPS directly interacts with CXCR4. Our data suggest that CXCR4 is not only involved in LPS binding but is also responsible for triggering signalling, especially mitogen-activated protein kinases in response to LPS. Finally, co-clustering of CXCR4 with other LPS receptors seems to be crucial for LPS signalling, thus suggesting that CXCR4 is a functional part of the multimeric LPS "sensing apparatus".
Resumo:
Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.
Resumo:
Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.
Resumo:
Analyzing geographical patterns by collocating events, objects or their attributes has a long history in surveillance and monitoring, and is particularly applied in environmental contexts, such as ecology or epidemiology. The identification of patterns or structures at some scales can be addressed using spatial statistics, particularly marked point processes methodologies. Classification and regression trees are also related to this goal of finding "patterns" by deducing the hierarchy of influence of variables on a dependent outcome. Such variable selection methods have been applied to spatial data, but, often without explicitly acknowledging the spatial dependence. Many methods routinely used in exploratory point pattern analysis are2nd-order statistics, used in a univariate context, though there is also a wide literature on modelling methods for multivariate point pattern processes. This paper proposes an exploratory approach for multivariate spatial data using higher-order statistics built from co-occurrences of events or marks given by the point processes. A spatial entropy measure, derived from these multinomial distributions of co-occurrences at a given order, constitutes the basis of the proposed exploratory methods. © 2010 Elsevier Ltd.
Resumo:
Analyzing geographical patterns by collocating events, objects or their attributes has a long history in surveillance and monitoring, and is particularly applied in environmental contexts, such as ecology or epidemiology. The identification of patterns or structures at some scales can be addressed using spatial statistics, particularly marked point processes methodologies. Classification and regression trees are also related to this goal of finding "patterns" by deducing the hierarchy of influence of variables on a dependent outcome. Such variable selection methods have been applied to spatial data, but, often without explicitly acknowledging the spatial dependence. Many methods routinely used in exploratory point pattern analysis are2nd-order statistics, used in a univariate context, though there is also a wide literature on modelling methods for multivariate point pattern processes. This paper proposes an exploratory approach for multivariate spatial data using higher-order statistics built from co-occurrences of events or marks given by the point processes. A spatial entropy measure, derived from these multinomial distributions of co-occurrences at a given order, constitutes the basis of the proposed exploratory methods. © 2010 Elsevier Ltd.
Resumo:
In the knowledge-based clustering approaches reported in the literature, explicit know ledge, typically in the form of a set of concepts, is used in computing similarity or conceptual cohesiveness between objects and in grouping them. We propose a knowledge-based clustering approach in which the domain knowledge is also used in the pattern representation phase of clustering. We argue that such a knowledge-based pattern representation scheme reduces the complexity of similarity computation and grouping phases. We present a knowledge-based clustering algorithm for grouping hooks in a library.
Resumo:
Bycatch and resultant discard mortality are issues of global concern. The groundfish demersal trawl fishery on the west coast of the United States is a multispecies fishery with significant catch of target and nontarget species. These catches are of particular concern in regard to species that have previously been declared overfished and are currently rebuilding biomass back to target levels. To understand these interactions better, we used data from the West Coast Groundfish Observer Program in a series of cluster analyses to evaluate 3 questions: 1) Are there identifiable associations between species caught in the bottom trawl fishery; 2) Do species that are undergoing population rebuilding toward target biomass levels (“rebuilding species”) cluster with targeted species in a consistent way; 3) Are the relationships between rebuilding bycatch species and target species more resolved at particular spatial scales or are relationships spatially consistent across the whole data set? Two strong species clusters emerged—a deepwater slope cluster and a shelf cluster—neither of which included rebuilding species. The likelihood of encountering rebuilding rockfish species is relatively low. To evaluate whether weak clustering of rebuilding rockfish was attributable to their low rate of occurrence, we specified null models of species occurrence. Results indicated that the ability to predict occurrence of rebuilding rockfish when target species were caught was low. Cluster analyses performed at a variety of spatial scales indicated that the most reliable clustering of rebuilding species was at the spatial scale of individual fishing ports. This finding underscores the value of spatially resolved data for fishery management.