986 resultados para Similarity measure


Relevância:

60.00% 60.00%

Publicador:

Resumo:

The continuous growth of the XML data poses a great concern in the area of XML data management. The need for processing large amounts of XML data brings complications to many applications, such as information retrieval, data integration and many others. One way of simplifying this problem is to break the massive amount of data into smaller groups by application of clustering techniques. However, XML clustering is an intricate task that may involve the processing of both the structure and the content of XML data in order to identify similar XML data. This research presents four clustering methods, two methods utilizing the structure of XML documents and the other two utilizing both the structure and the content. The two structural clustering methods have different data models. One is based on a path model and other is based on a tree model. These methods employ rigid similarity measures which aim to identifying corresponding elements between documents with different or similar underlying structure. The two clustering methods that utilize both the structural and content information vary in terms of how the structure and content similarity are combined. One clustering method calculates the document similarity by using a linear weighting combination strategy of structure and content similarities. The content similarity in this clustering method is based on a semantic kernel. The other method calculates the distance between documents by a non-linear combination of the structure and content of XML documents using a semantic kernel. Empirical analysis shows that the structure-only clustering method based on the tree model is more scalable than the structure-only clustering method based on the path model as the tree similarity measure for the tree model does not need to visit the parents of an element many times. Experimental results also show that the clustering methods perform better with the inclusion of the content information on most test document collections. To further the research, the structural clustering method based on tree model is extended and employed in XML transformation. The results from the experiments show that the proposed transformation process is faster than the traditional transformation system that translates and converts the source XML documents sequentially. Also, the schema matching process of XML transformation produces a better matching result in a shorter time.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Trees are capable of portraying the semi-structured data which is common in web domain. Finding similarities between trees is mandatory for several applications that deal with semi-structured data. Existing similarity methods examine a pair of trees by comparing through nodes and paths of two trees, and find the similarity between them. However, these methods provide unfavorable results for unordered tree data and result in yielding NP-hard or MAX-SNP hard complexity. In this paper, we present a novel method that encodes a tree with an optimal traversing approach first, and then, utilizes it to model the tree with its equivalent matrix representation for finding similarity between unordered trees efficiently. Empirical analysis shows that the proposed method is able to achieve high accuracy even on the large data sets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Accurate process model elicitation continues to be a time consuming task, requiring skill on the part of the interviewer to extract explicit and tacit process information from the interviewee. Many errors occur in this elicitation stage that would be avoided by better activity recall, more consistent specification methods and greater engagement in the elicitation process by interviewees. Theories of situated cognition indicate that interactive 3D representations of real work environments engage and prime the cognitive state of the viewer. In this paper, our major contribution is to augment a previous process elicitation methodology with virtual world context metadata, drawn from a 3D simulation of the workplace. We present a conceptual and formal approach for representing this contextual metadata, integrated into a process similarity measure that provides hints for the business analyst to use in later modelling steps. Finally, we conclude with examples from two use cases to illustrate the potential abilities of this approach.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Identifying product families has been considered as an effective way to accommodate the increasing product varieties across the diverse market niches. In this paper, we propose a novel framework to identifying product families by using a similarity measure for a common product design data BOM (Bill of Materials) based on data mining techniques such as frequent mining and clus-tering. For calculating the similarity between BOMs, a novel Extended Augmented Adjacency Matrix (EAAM) representation is introduced that consists of information not only of the content and topology but also of the fre-quent structural dependency among the various parts of a product design. These EAAM representations of BOMs are compared to calculate the similarity between products and used as a clustering input to group the product fami-lies. When applied on a real-life manufacturing data, the proposed framework outperforms a current baseline that uses orthogonal Procrustes for grouping product families.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Acoustic recordings of the environment provide an effective means to monitor bird species diversity. To facilitate exploration of acoustic recordings, we describe a content-based birdcall retrieval algorithm. A query birdcall is a region of spectrogram bounded by frequency and time. Retrieval depends on a similarity measure derived from the orientation and distribution of spectral ridges. The spectral ridge detection method caters for a broad range of birdcall structures. In this paper, we extend previous work by incorporating a spectrogram scaling step in order to improve the detection of spectral ridges. Compared to an existing approach based on MFCC features, our feature representation achieves better retrieval performance for multiple bird species in noisy recordings.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Bioacoustic data can be used for monitoring animal species diversity. The deployment of acoustic sensors enables acoustic monitoring at large temporal and spatial scales. We describe a content-based birdcall retrieval algorithm for the exploration of large data bases of acoustic recordings. In the algorithm, an event-based searching scheme and compact features are developed. In detail, ridge events are detected from audio files using event detection on spectral ridges. Then event alignment is used to search through audio files to locate candidate instances. A similarity measure is then applied to dimension-reduced spectral ridge feature vectors. The event-based searching method processes a smaller list of instances for faster retrieval. The experimental results demonstrate that our features achieve better success rate than existing methods and the feature dimension is greatly reduced.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This research is a step forward in discovering knowledge from databases of complex structure like tree or graph. Several data mining algorithms are developed based on a novel representation called Balanced Optimal Search for extracting implicit, unknown and potentially useful information like patterns, similarities and various relationships from tree data, which are also proved to be advantageous in analysing big data. This thesis focuses on analysing unordered tree data, which is robust to data inconsistency, irregularity and swift information changes, hence, in the era of big data it becomes a popular and widely used data model.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Topic detection and tracking (TDT) is an area of information retrieval research the focus of which revolves around news events. The problems TDT deals with relate to segmenting news text into cohesive stories, detecting something new, previously unreported, tracking the development of a previously reported event, and grouping together news that discuss the same event. The performance of the traditional information retrieval techniques based on full-text similarity has remained inadequate for online production systems. It has been difficult to make the distinction between same and similar events. In this work, we explore ways of representing and comparing news documents in order to detect new events and track their development. First, however, we put forward a conceptual analysis of the notions of topic and event. The purpose is to clarify the terminology and align it with the process of news-making and the tradition of story-telling. Second, we present a framework for document similarity that is based on semantic classes, i.e., groups of words with similar meaning. We adopt people, organizations, and locations as semantic classes in addition to general terms. As each semantic class can be assigned its own similarity measure, document similarity can make use of ontologies, e.g., geographical taxonomies. The documents are compared class-wise, and the outcome is a weighted combination of class-wise similarities. Third, we incorporate temporal information into document similarity. We formalize the natural language temporal expressions occurring in the text, and use them to anchor the rest of the terms onto the time-line. Upon comparing documents for event-based similarity, we look not only at matching terms, but also how near their anchors are on the time-line. Fourth, we experiment with an adaptive variant of the semantic class similarity system. The news reflect changes in the real world, and in order to keep up, the system has to change its behavior based on the contents of the news stream. We put forward two strategies for rebuilding the topic representations and report experiment results. We run experiments with three annotated TDT corpora. The use of semantic classes increased the effectiveness of topic tracking by 10-30\% depending on the experimental setup. The gain in spotting new events remained lower, around 3-4\%. The anchoring the text to a time-line based on the temporal expressions gave a further 10\% increase the effectiveness of topic tracking. The gains in detecting new events, again, remained smaller. The adaptive systems did not improve the tracking results.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Visual tracking has been a challenging problem in computer vision over the decades. The applications of Visual Tracking are far-reaching, ranging from surveillance and monitoring to smart rooms. Mean-shift (MS) tracker, which gained more attention recently, is known for tracking objects in a cluttered environment and its low computational complexity. The major problem encountered in histogram-based MS is its inability to track rapidly moving objects. In order to track fast moving objects, we propose a new robust mean-shift tracker that uses both spatial similarity measure and color histogram-based similarity measure. The inability of MS tracker to handle large displacements is circumvented by the spatial similarity-based tracking module, which lacks robustness to object's appearance change. The performance of the proposed tracker is better than the individual trackers for tracking fast-moving objects with better accuracy.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A nonparametric, hierarchical, disaggregative clustering algorithm is developed using a novel similarity measure, called the mutual neighborhood value (MNV), which takes into account the conventional nearest neighbor ranks of two samples with respect to each other. The algorithm is simple, noniterative, requires low storage, and needs no specification of the expected number of clusters. The algorithm appears very versatile as it is capable of discerning spherical and nonspherical clusters, linearly nonseparable clusters, clusters with unequal populations, and clusters with lowdensity bridges. Changing of the neighborhood size enables discernment of strong or weak patterns.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Study of symmetric or repeating patterns in scalar fields is important in scientific data analysis because it gives deep insights into the properties of the underlying phenomenon. Though geometric symmetry has been well studied within areas like shape processing, identifying symmetry in scalar fields has remained largely unexplored due to the high computational cost of the associated algorithms. We propose a computationally efficient algorithm for detecting symmetric patterns in a scalar field distribution by analysing the topology of level sets of the scalar field. Our algorithm computes the contour tree of a given scalar field and identifies subtrees that are similar. We define a robust similarity measure for comparing subtrees of the contour tree and use it to group similar subtrees together. Regions of the domain corresponding to subtrees that belong to a common group are extracted and reported to be symmetric. Identifying symmetry in scalar fields finds applications in visualization, data exploration, and feature detection. We describe two applications in detail: symmetry-aware transfer function design and symmetry-aware isosurface extraction.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Lack of supervision in clustering algorithms often leads to clusters that are not useful or interesting to human reviewers. We investigate if supervision can be automatically transferred for clustering a target task, by providing a relevant supervised partitioning of a dataset from a different source task. The target clustering is made more meaningful for the human user by trading-off intrinsic clustering goodness on the target task for alignment with relevant supervised partitions in the source task, wherever possible. We propose a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions. The alignment process makes use of a cross-task similarity measure that discovers hidden relationships across tasks. When the source and target tasks correspond to different domains with potentially different vocabularies, we propose a projection approach using pivot vocabularies for the cross-domain similarity measure. Using multiple real-world and synthetic datasets, we show that our approach improves clustering accuracy significantly over traditional k-means and state-of-the-art semi-supervised clustering baselines, over a wide range of data characteristics and parameter settings.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Clustering techniques which can handle incomplete data have become increasingly important due to varied applications in marketing research, medical diagnosis and survey data analysis. Existing techniques cope up with missing values either by using data modification/imputation or by partial distance computation, often unreliable depending on the number of features available. In this paper, we propose a novel approach for clustering data with missing values, which performs the task by Symmetric Non-Negative Matrix Factorization (SNMF) of a complete pair-wise similarity matrix, computed from the given incomplete data. To accomplish this, we define a novel similarity measure based on Average Overlap similarity metric which can effectively handle missing values without modification of data. Further, the similarity measure is more reliable than partial distances and inherently possesses the properties required to perform SNMF. The experimental evaluation on real world datasets demonstrates that the proposed approach is efficient, scalable and shows significantly better performance compared to the existing techniques.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Part I of the thesis describes the olfactory searching and scanning behaviors of rats in a wind tunnel, and a detailed movement analysis of terrestrial arthropod olfactory scanning behavior. Olfactory scanning behaviors in rats may be a behavioral correlate to hippocampal place cell activity.

Part II focuses on the organization of olfactory perception, what it suggests about a natural order for chemicals in the environment, and what this in tum suggests about the organization of the olfactory system. A model of odor quality space (analogous to the "color wheel") is presented. This model defines relationships between odor qualities perceived by human subjects based on a quantitative similarity measure. Compounds containing Carbon, Nitrogen, or Sulfur elicit odors that are contiguous in this odor representation, which thus allows one to predict the broad class of odor qualities a compound is likely to elicit. Based on these findings, a natural organization for olfactory stimuli is hypothesized: the order provided by the metabolic process. This hypothesis is tested by comparing compounds that are structurally similar, perceptually similar, and metabolically similar in a psychophysical cross-adaptation paradigm. Metabolically similar compounds consistently evoked shifts in odor quality and intensity under cross-adaptation, while compounds that were structurally similar or perceptually similar did not. This suggests that the olfactory system may process metabolically similar compounds using the same neural pathways, and that metabolic similarity may be the fundamental metric about which olfactory processing is organized. In other words, the olfactory system may be organized around a biological basis.

The idea of a biological basis for olfactory perception represents a shift in how olfaction is understood. The biological view has predictive power while the current chemical view does not, and the biological view provides explanations for some of the most basic questions in olfaction, that are unanswered in the chemical view. Existing data do not disprove a biological view, and are consistent with basic hypotheses that arise from this viewpoint.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A computer can assist the process of design by analogy by recording past designs. The experience these represent could be much wider than that of designers using the system, who therefore need to identify potential cases of interest. If the computer assists with this lookup, the designers can concentrate on the more interesting aspect of extracting and using the ideas which are found. However, as the knowledge base grows it becomes ever harder to find relevant cases using a keyword indexing scheme without knowing precisely what to look for. Therefore a more flexible searching system is needed.

If a similarity measure can be defined for the features of the designs, then it is possible to match and cluster them. Using a simple measure like co-occurrence of features within a particular case would allow this to happen without human intervention, which is tedious and time- consuming. Any knowledge that is acquired about how features are related to each other will be very shallow: it is not intended as a cognitive model for how humans understand, learn, or retrieve information, but more an attempt to make effective, efficient use of the information available. The question remains of whether such shallow knowledge is sufficient for the task.

A system to retrieve information from a large database is described. It uses co-occurrences to relate keywords to each other, and then extends search queries with similar words. This seems to make relevant material more accessible, providing hope that this retrieval technique can be applied to a broader knowledge base.