931 resultados para Document Segmentation


Relevância:

20.00% 20.00%

Publicador:

Resumo:

With the growing size and variety of social media files on the web, it’s becoming critical to efficiently organize them into clusters for further processing. This paper presents a novel scalable constrained document clustering method that harnesses the power of search engines capable of dealing with large text data. Instead of calculating distance between the documents and all of the clusters’ centroids, a neighborhood of best cluster candidates is chosen using a document ranking scheme. To make the method faster and less memory dependable, the in-memory and in-database processing are combined in a semi-incremental manner. This method has been extensively tested in the social event detection application. Empirical analysis shows that the proposed method is efficient both in computation and memory usage while producing notable accuracy.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis investigates the fusion of 3D visual information with 2D image cues to provide 3D semantic maps of large-scale environments in which a robot traverses for robotic applications. A major theme of this thesis was to exploit the availability of 3D information acquired from robot sensors to improve upon 2D object classification alone. The proposed methods have been evaluated on several indoor and outdoor datasets collected from mobile robotic platforms including a quadcopter and ground vehicle covering several kilometres of urban roads.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This article presents a study of how humans perceive and judge the relevance of documents. Humans are adept at making reasonably robust and quick decisions about what information is relevant to them, despite the ever increasing complexity and volume of their surrounding information environment. The literature on document relevance has identified various dimensions of relevance (e.g., topicality, novelty, etc.), however little is understood about how these dimensions may interact. We performed a crowdsourced study of how human subjects judge two relevance dimensions in relation to document snippets retrieved from an internet search engine. The order of the judgment was controlled. For those judgments exhibiting an order effect, a q–test was performed to determine whether the order effects can be explained by a quantum decision model based on incompatible decision perspectives. Some evidence of incompatibility was found which suggests incompatible decision perspectives is appropriate for explaining interacting dimensions of relevance in such instances.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis presents new methods for classification and thematic grouping of billions of web pages, at scales previously not achievable. This process is also known as document clustering, where similar documents are automatically associated with clusters that represent various distinct topic. These automatically discovered topics are in turn used to improve search engine performance by only searching the topics that are deemed relevant to particular user queries.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The use of ‘topic’ concepts has shown improved search performance, given a query, by bringing together relevant documents which use different terms to describe a higher level concept. In this paper, we propose a method for discovering and utilizing concepts in indexing and search for a domain specific document collection being utilized in industry. This approach differs from others in that we only collect focused concepts to build the concept space and that instead of turning a user’s query into a concept based query, we experiment with different techniques of combining the original query with a concept query. We apply the proposed approach to a real-world document collection and the results show that in this scenario the use of concept knowledge at index and search can improve the relevancy of results.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Transit passenger market segmentation enables transit operators to target different classes of transit users for targeted surveys and various operational and strategic planning improvements. However, the existing market segmentation studies in the literature have been generally done using passenger surveys, which have various limitations. The smart card (SC) data from an automated fare collection system facilitate the understanding of the multiday travel pattern of transit passengers and can be used to segment them into identifiable types of similar behaviors and needs. This paper proposes a comprehensive methodology for passenger segmentation solely using SC data. After reconstructing the travel itineraries from SC transactions, this paper adopts the density-based spatial clustering of application with noise (DBSCAN) algorithm to mine the travel pattern of each SC user. An a priori market segmentation approach then segments transit passengers into four identifiable types. The methodology proposed in this paper assists transit operators to understand their passengers and provides them oriented information and services.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper describes our participation in the Chinese word segmentation task of CIPS-SIGHAN 2010. We implemented an n-gram mutual information (NGMI) based segmentation algorithm with the mixed-up features from unsupervised, supervised and dictionarybased segmentation methods. This algorithm is also combined with a simple strategy for out-of-vocabulary (OOV) word recognition. The evaluation for both open and closed training shows encouraging results of our system. The results for OOV word recognition in closed training evaluation were however found unsatisfactory.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Market segmentation has received relatively limited attention in social marketing, particularly within the context of changing children’s physical activity behaviour. This is an important area of investigation given growing concern over childhood obesity globally. The present research aims to extend current understanding of the applicability of market segmentation within this context. The results of a two-step cluster analysis on data from 512 respondents of an online survey show three distinct segments of caregivers, each with unique beliefs about their primary school children walking to/from school. The results demonstrate the validity of employing the process of market segmentation within this social context and provide further insights for targeting the identified segments through tailored social marketing programs.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

For traditional information filtering (IF) models, it is often assumed that the documents in one collection are only related to one topic. However, in reality users’ interests can be diverse and the documents in the collection often involve multiple topics. Topic modelling was proposed to generate statistical models to represent multiple topics in a collection of documents, but in a topic model, topics are represented by distributions over words which are limited to distinctively represent the semantics of topics. Patterns are always thought to be more discriminative than single terms and are able to reveal the inner relations between words. This paper proposes a novel information filtering model, Significant matched Pattern-based Topic Model (SPBTM). The SPBTM represents user information needs in terms of multiple topics and each topic is represented by patterns. More importantly, the patterns are organized into groups based on their statistical and taxonomic features, from which the more representative patterns, called Significant Matched Patterns, can be identified and used to estimate the document relevance. Experiments on benchmark data sets demonstrate that the SPBTM significantly outperforms the state-of-the-art models.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Recent changes in the aviation industry and in the expectations of travellers have begun to alter the way we approach our understanding, and thus the segmentation, of airport passengers. The key to successful segmentation of any population lies in the selection of the criteria on which the partitions are based. Increasingly, the basic criteria used to segment passengers (purpose of trip and frequency of travel) no longer provide adequate insights into the passenger experience. In this paper, we propose a new model for passenger segmentation based on the passenger core value, time. The results are based on qualitative research conducted in-situ at Brisbane International Terminal during 2012-2013. Based on our research, a relationship between time sensitivity and degree of passenger engagement was identified. This relationship was used as the basis for a new passenger segmentation model, namely: Airport Enthusiast (engaged, non time sensitive); Time Filler (non engaged, non time sensitive); Efficiency Lover (non engaged, time sensitive) and Efficient Enthusiast (engaged, time sensitive). The outcomes of this research extend the theoretical knowledge about passenger experience in the terminal environment. These new insights can ultimately be used to optimise the allocation of space for future terminal planning and design.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Objective This study seeks establish whether meaningful subgroups exist within a 14-16 year old adolescent population and if these segments respond differently to the Game On: Know Alcohol (GOKA) intervention, a school-based alcohol social marketing program. Methodology This study is part of a larger cluster randomized controlled evaluation of the Game On: Know Alcohol (GOKA) program implemented in 14 schools in 2013/2014. TwoStep cluster analysis was conducted to segment 2114 high school adolescents (14-16 years old) on the basis of 22 demographic, behavioral and psychographic variables. Program effects on knowledge, attitudes, behavioral intentions, social norms, expectancies and refusal self-efficacy of identified segments was subsequently examined. Results Three segments were identified: (1) Abstainers (2) Bingers (3) Moderate Drinkers. Program effects varied significantly across segments. The strongest positive change effects post participation were observed for the Bingers, while mixed effects were evident for Moderate Drinkers and Abstainers. Conclusions These findings provide preliminary empirical evidence supporting application of social marketing segmentation in alcohol education programs. Development of targeted programs that meet the unique needs of each of the three identified segments is indicated to extend the social marketing footprint in alcohol education.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This cross disciplinary study was conducted as two research and development projects. The outcome is a multimodal and dynamic chronicle, which incorporates the tracking of spatial, temporal and visual elements of performative practice-led and design-led research journeys. The distilled model provides a strong new approach to demonstrate rigour in non-traditional research outputs including provenance and an 'augmented web of facticity'.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Robust and automatic non-rigid registration depends on many parameters that have not yet been systematically explored. Here we determined how tissue classification influences non-linear fluid registration of brain MRI. Twin data is ideal for studying this question, as volumetric correlations between corresponding brain regions that are under genetic control should be higher in monozygotic twins (MZ) who share 100% of their genes when compared to dizygotic twins (DZ) who share half their genes on average. When these substructure volumes are quantified using tensor-based morphometry, improved registration can be defined based on which method gives higher MZ twin correlations when compared to DZs, as registration errors tend to deplete these correlations. In a study of 92 subjects, higher effect sizes were found in cumulative distribution functions derived from statistical maps when performing tissue classification before fluid registration, versus fluidly registering the raw images. This gives empirical evidence in favor of pre-segmenting images for tensor-based morphometry.