990 resultados para content similarity


Relevância:

70.00% 70.00%

Publicador:

Resumo:

XML document clustering is essential for many document handling applications such as information storage, retrieval, integration and transformation. An XML clustering algorithm should process both the structural and the content information of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. This paper introduces a novel approach that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The proposed method reduces the high dimensionality of input data by using only the structure-constrained content. The empirical analysis reveals that the proposed method can effectively cluster even very large XML datasets and outperform other existing methods.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The emergence of cloud datacenters enhances the capability of online data storage. Since massive data is stored in datacenters, it is necessary to effectively locate and access interest data in such a distributed system. However, traditional search techniques only allow users to search images over exact-match keywords through a centralized index. These techniques cannot satisfy the requirements of content based image retrieval (CBIR). In this paper, we propose a scalable image retrieval framework which can efficiently support content similarity search and semantic search in the distributed environment. Its key idea is to integrate image feature vectors into distributed hash tables (DHTs) by exploiting the property of locality sensitive hashing (LSH). Thus, images with similar content are most likely gathered into the same node without the knowledge of any global information. For searching semantically close images, the relevance feedback is adopted in our system to overcome the gap between low-level features and high-level features. We show that our approach yields high recall rate with good load balance and only requires a few number of hops.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper proposes a novel Hybrid Clustering approach for XML documents (HCX) that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The empirical analysis reveals that the proposed method is scalable and accurate.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The continuous growth of the XML data poses a great concern in the area of XML data management. The need for processing large amounts of XML data brings complications to many applications, such as information retrieval, data integration and many others. One way of simplifying this problem is to break the massive amount of data into smaller groups by application of clustering techniques. However, XML clustering is an intricate task that may involve the processing of both the structure and the content of XML data in order to identify similar XML data. This research presents four clustering methods, two methods utilizing the structure of XML documents and the other two utilizing both the structure and the content. The two structural clustering methods have different data models. One is based on a path model and other is based on a tree model. These methods employ rigid similarity measures which aim to identifying corresponding elements between documents with different or similar underlying structure. The two clustering methods that utilize both the structural and content information vary in terms of how the structure and content similarity are combined. One clustering method calculates the document similarity by using a linear weighting combination strategy of structure and content similarities. The content similarity in this clustering method is based on a semantic kernel. The other method calculates the distance between documents by a non-linear combination of the structure and content of XML documents using a semantic kernel. Empirical analysis shows that the structure-only clustering method based on the tree model is more scalable than the structure-only clustering method based on the path model as the tree similarity measure for the tree model does not need to visit the parents of an element many times. Experimental results also show that the clustering methods perform better with the inclusion of the content information on most test document collections. To further the research, the structural clustering method based on tree model is extended and employed in XML transformation. The results from the experiments show that the proposed transformation process is faster than the traditional transformation system that translates and converts the source XML documents sequentially. Also, the schema matching process of XML transformation produces a better matching result in a shorter time.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Web sites that rely on databases for their content are now ubiquitous. Query result pages are dynamically generated from these databases in response to user-submitted queries. Automatically extracting structured data from query result pages is a challenging problem, as the structure of the data is not explicitly represented. While humans have shown good intuition in visually understanding data records on a query result page as displayed by a web browser, no existing approach to data record extraction has made full use of this intuition. We propose a novel approach, in which we make use of the common sources of evidence that humans use to understand data records on a displayed query result page. These include structural regularity, and visual and content similarity between data records displayed on a query result page. Based on these observations we propose new techniques that can identify each data record individually, while ignoring noise items, such as navigation bars and adverts. We have implemented these techniques in a software prototype, rExtractor, and tested it using two datasets. Our experimental results show that our approach achieves significantly higher accuracy than previous approaches. Furthermore, it establishes the case for use of vision-based algorithms in the context of data extraction from web sites.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The web services (WS) technology provides a comprehensive solution for representing, discovering, and invoking services in a wide variety of environments, including Service Oriented Architectures (SOA) and grid computing systems. At the core of WS technology lie a number of XML-based standards, such as the Simple Object Access Protocol (SOAP), that have successfully ensured WS extensibility, transparency, and interoperability. Nonetheless, there is an increasing demand to enhance WS performance, which is severely impaired by XML's verbosity. SOAP communications produce considerable network traffic, making them unfit for distributed, loosely coupled, and heterogeneous computing environments such as the open Internet. Also, they introduce higher latency and processing delays than other technologies, like Java RMI and CORBA. WS research has recently focused on SOAP performance enhancement. Many approaches build on the observation that SOAP message exchange usually involves highly similar messages (those created by the same implementation usually have the same structure, and those sent from a server to multiple clients tend to show similarities in structure and content). Similarity evaluation and differential encoding have thus emerged as SOAP performance enhancement techniques. The main idea is to identify the common parts of SOAP messages, to be processed only once, avoiding a large amount of overhead. Other approaches investigate nontraditional processor architectures, including micro-and macrolevel parallel processing solutions, so as to further increase the processing rates of SOAP/XML software toolkits. This survey paper provides a concise, yet comprehensive review of the research efforts aimed at SOAP performance enhancement. A unified view of the problem is provided, covering almost every phase of SOAP processing, ranging over message parsing, serialization, deserialization, compression, multicasting, security evaluation, and data/instruction-level processing.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Various factors are believed to govern the selection of references in citation networks, but a precise, quantitative determination of their importance has remained elusive. In this paper, we show that three factors can account for the referencing pattern of citation networks for two topics, namely "graphenes" and "complex networks", thus allowing one to reproduce the topological features of the networks built with papers being the nodes and the edges established by citations. The most relevant factor was content similarity, while the other two - in-degree (i.e. citation counts) and age of publication - had varying importance depending on the topic studied. This dependence indicates that additional factors could play a role. Indeed, by intuition one should expect the reputation (or visibility) of authors and/or institutions to affect the referencing pattern, and this is only indirectly considered via the in-degree that should correlate with such reputation. Because information on reputation is not readily available, we simulated its effect on artificial citation networks considering two communities with distinct fitness (visibility) parameters. One community was assumed to have twice the fitness value of the other, which amounts to a double probability for a paper being cited. While the h-index for authors in the community with larger fitness evolved with time with slightly higher values than for the control network (no fitness considered), a drastic effect was noted for the community with smaller fitness. (C) 2012 Elsevier Ltd. All rights reserved.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Abstract 1: Social Networks such as Twitter are often used for disseminating and collecting information during natural disasters. The potential for its use in Disaster Management has been acknowledged. However, more nuanced understanding of the communications that take place on social networks are required to more effectively integrate this information into the processes within disaster management. The type and value of information shared should be assessed, determining the benefits and issues, with credibility and reliability as known concerns. Mapping the tweets in relation to the modelled stages of a disaster can be a useful evaluation for determining the benefits/drawbacks of using data from social networks, such as Twitter, in disaster management.A thematic analysis of tweets’ content, language and tone during the UK Storms and Floods 2013/14 was conducted. Manual scripting was used to determine the official sequence of events, and classify the stages of the disaster into the phases of the Disaster Management Lifecycle, to produce a timeline. Twenty- five topics discussed on Twitter emerged, and three key types of tweets, based on the language and tone, were identified. The timeline represents the events of the disaster, according to the Met Office reports, classed into B. Faulkner’s Disaster Management Lifecycle framework. Context is provided when observing the analysed tweets against the timeline. This illustrates a potential basis and benefit for mapping tweets into the Disaster Management Lifecycle phases. Comparing the number of tweets submitted in each month with the timeline, suggests users tweet more as an event heightens and persists. Furthermore, users generally express greater emotion and urgency in their tweets.This paper concludes that the thematic analysis of content on social networks, such as Twitter, can be useful in gaining additional perspectives for disaster management. It demonstrates that mapping tweets into the phases of a Disaster Management Lifecycle model can have benefits in the recovery phase, not just in the response phase, to potentially improve future policies and activities. Abstract2: The current execution of privacy policies, as a mode of communicating information to users, is unsatisfactory. Social networking sites (SNS) exemplify this issue, attracting growing concerns regarding their use of personal data and its effect on user privacy. This demonstrates the need for more informative policies. However, SNS lack the incentives required to improve policies, which is exacerbated by the difficulties of creating a policy that is both concise and compliant. Standardization addresses many of these issues, providing benefits for users and SNS, although it is only possible if policies share attributes which can be standardized. This investigation used thematic analysis and cross- document structure theory, to assess the similarity of attributes between the privacy policies (as available in August 2014), of the six most frequently visited SNS globally. Using the Jaccard similarity coefficient, two types of attribute were measured; the clauses used by SNS and the coverage of forty recommendations made by the UK Information Commissioner’s Office. Analysis showed that whilst similarity in the clauses used was low, similarity in the recommendations covered was high, indicating that SNS use different clauses, but to convey similar information. The analysis also showed that low similarity in the clauses was largely due to differences in semantics, elaboration and functionality between SNS. Therefore, this paper proposes that the policies of SNS already share attributes, indicating the feasibility of standardization and five recommendations are made to begin facilitating this, based on the findings of the investigation.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Genome predictions based on selected genes would be a very welcome approach for taxonomic studies, including DNA-DNA similarity, G+C content and representative phylogeny of bacteria. At present, DNA-DNA hybridizations are still considered the gold standard in species descriptions. However, this method is time-consuming and troublesome, and datasets can vary significantly between experiments as well as between laboratories. For the same reasons, full matrix hybridizations are rarely performed, weakening the significance of the results obtained. The authors established a universal sequencing approach for the three genes recN, rpoA and thdF for the Pasteurellaceae, and determined if the sequences could be used for predicting DNA-DNA relatedness within the family. The sequence-based similarity values calculated using a previously published formula proved most useful for species and genus separation, indicating that this method provides better resolution and no experimental variation compared to hybridization. By this method, cross-comparisons within the family over species and genus borders easily become possible. The three genes also serve as an indicator of the genome G+C content of a species. A mean divergence of around 1 % was observed from the classical method, which in itself has poor reproducibility. Finally, the three genes can be used alone or in combination with already-established 16S rRNA, rpoB and infB gene-sequencing strategies in a multisequence-based phylogeny for the family Pasteurellaceae. It is proposed to use the three sequences as a taxonomic tool, replacing DNA-DNA hybridization.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Since multimedia data, such as images and videos, are way more expressive and informative than ordinary text-based data, people find it more attractive to communicate and express with them. Additionally, with the rising popularity of social networking tools such as Facebook and Twitter, multimedia information retrieval can no longer be considered a solitary task. Rather, people constantly collaborate with one another while searching and retrieving information. But the very cause of the popularity of multimedia data, the huge and different types of information a single data object can carry, makes their management a challenging task. Multimedia data is commonly represented as multidimensional feature vectors and carry high-level semantic information. These two characteristics make them very different from traditional alpha-numeric data. Thus, to try to manage them with frameworks and rationales designed for primitive alpha-numeric data, will be inefficient. An index structure is the backbone of any database management system. It has been seen that index structures present in existing relational database management frameworks cannot handle multimedia data effectively. Thus, in this dissertation, a generalized multidimensional index structure is proposed which accommodates the atypical multidimensional representation and the semantic information carried by different multimedia data seamlessly from within one single framework. Additionally, the dissertation investigates the evolving relationships among multimedia data in a collaborative environment and how such information can help to customize the design of the proposed index structure, when it is used to manage multimedia data in a shared environment. Extensive experiments were conducted to present the usability and better performance of the proposed framework over current state-of-art approaches.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Bioacoustic data can provide an important base for environmental monitoring. To explore a large amount of field recordings collected, an automated similarity search algorithm is presented in this paper. A region of an audio defined by frequency and time bounds is provided by a user; the content of the region is used to construct a query. In the retrieving process, our algorithm will automatically scan through recordings to search for similar regions. In detail, we present a feature extraction approach based on the visual content of vocalisations – in this case ridges, and develop a generic regional representation of vocalisations for indexing. Our feature extraction method works best for bird vocalisations showing ridge characteristics. The regional representation method allows the content of an arbitrary region of a continuous recording to be described in a compressed format.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background As the increasing adoption of information technology continues to offer better distant medical services, the distribution of, and remote access to digital medical images over public networks continues to grow significantly. Such use of medical images raises serious concerns for their continuous security protection, which digital watermarking has shown great potential to address. Methods We present a content-independent embedding scheme for medical image watermarking. We observe that the perceptual content of medical images varies widely with their modalities. Recent medical image watermarking schemes are image-content dependent and thus they may suffer from inconsistent embedding capacity and visual artefacts. To attain the image content-independent embedding property, we generalise RONI (region of non-interest, to the medical professionals) selection process and use it for embedding by utilising RONI’s least significant bit-planes. The proposed scheme thus avoids the need for RONI segmentation that incurs capacity and computational overheads. Results Our experimental results demonstrate that the proposed embedding scheme performs consistently over a dataset of 370 medical images including their 7 different modalities. Experimental results also verify how the state-of-the-art reversible schemes can have an inconsistent performance for different modalities of medical images. Our scheme has MSSIM (Mean Structural SIMilarity) larger than 0.999 with a deterministically adaptable embedding capacity. Conclusions Our proposed image-content independent embedding scheme is modality-wise consistent, and maintains a good image quality of RONI while keeping all other pixels in the image untouched. Thus, with an appropriate watermarking framework (i.e., with the considerations of watermark generation, embedding and detection functions), our proposed scheme can be viable for the multi-modality medical image applications and distant medical services such as teleradiology and eHealth.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The structure of time dependent jets in rotating fluids using similarity transformations is studied theoretically for which exact solutions are discussed. Approximate solution using a modified yon Mises transformation is also explored.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A thorough investigation of salt concentration dependence of lithium DNA fibres is made using X-ray diffraction. While for low salt the C-form pattern is obtained, crystalline B-type diffraction patterns result on increasing the salt concentration. The salt content in the gel (from which fibres are drawn) is estimated by equilibrium dialysis using the Donnan equilibrium principle. The salt range giving the best crystalline B pattern is determined. It is found that in this range meridional reflections occur on the fourth and sixth layer lines. In addition, the tenth layer meridian is absent at a particular salt concentration. These results strongly suggest the presence of non-helical features in the DNA molecule. Preliminary analysis of the diffraction patterns indicates a structural variability within the B-form itself. Further, the possibility of the structural parameters of DNA being similar in solid state and in solution is discussed.