A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics


Autoria(s): Tekli, Joe; Chbeir, Richard
Contribuinte(s)

UNIVERSIDADE DE SÃO PAULO

Data(s)

05/11/2013

05/11/2013

2012

Resumo

XML similarity evaluation has become a central issue in the database and information communities, its applications ranging over document clustering, version control, data integration and ranked retrieval. Various algorithms for comparing hierarchically structured data, XML documents in particular, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being commonly modeled as Ordered Labeled Trees. Yet, a thorough investigation of current approaches led us to identify several similarity aspects, i.e., sub-tree related structural and semantic similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an integrated and fine-grained comparison framework to deal with both structural and semantic similarities in XML documents (detecting the occurrences and repetitions of structurally and semantically similar sub-trees), and to allow the end-user to adjust the comparison process according to her requirements. Our framework consists of four main modules for (i) discovering the structural commonalities between sub-trees, (ii) identifying sub-tree semantic resemblances, (iii) computing tree-based edit operations costs, and (iv) computing tree edit distance. Experimental results demonstrate higher comparison accuracy with respect to alternative methods, while timing experiments reflect the impact of semantic similarity on overall system performance.

Research Support Foundation of the State of Sao Paulo, Brazil, FAPESP [2010/00330-2]

Identificador

JOURNAL OF WEB SEMANTICS, AMSTERDAM, v. 11, p. 14-40, MAR, 2012

1570-8268

http://www.producao.usp.br/handle/BDPI/40935

10.1016/j.websem.2011.10.002

http://dx.doi.org/10.1016/j.websem.2011.10.002

Idioma(s)

eng

Publicador

ELSEVIER SCIENCE BV

AMSTERDAM

Relação

JOURNAL OF WEB SEMANTICS

Direitos

restrictedAccess

Copyright ELSEVIER SCIENCE BV

Palavras-Chave #XML (SEMI-STRUCTURED) DATA #STRUCTURAL SIMILARITY #TREE EDIT DISTANCE #SEMANTIC SIMILARITY #INFORMATION RETRIEVAL #VECTOR SPACE MODEL #SIMILARITY #ALGORITHM #DISTANCE #RETRIEVAL #BOUNDS #COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE #COMPUTER SCIENCE, INFORMATION SYSTEMS #COMPUTER SCIENCE, SOFTWARE ENGINEERING
Tipo

article

original article

publishedVersion