Two-part Segmentation of Text Documents


Autoria(s): Padmanabhan, Deepak; Visweswariah, Karthik; Wiratunga, Nirmalie; Sani, Sadiq
Data(s)

2012

Resumo

We consider the problem of segmenting text documents that have a<br/>two-part structure such as a problem part and a solution part. Documents<br/>of this genre include incident reports that typically involve<br/>description of events relating to a problem followed by those pertaining<br/>to the solution that was tried. Segmenting such documents<br/>into the component two parts would render them usable in knowledge<br/>reuse frameworks such as Case-Based Reasoning. This segmentation<br/>problem presents a hard case for traditional text segmentation<br/>due to the lexical inter-relatedness of the segments. We develop<br/>a two-part segmentation technique that can harness a corpus<br/>of similar documents to model the behavior of the two segments<br/>and their inter-relatedness using language models and translation<br/>models respectively. In particular, we use separate language models<br/>for the problem and solution segment types, whereas the interrelatedness<br/>between segment types is modeled using an IBM Model<br/>1 translation model. We model documents as being generated starting<br/>from the problem part that comprises of words sampled from<br/>the problem language model, followed by the solution part whose<br/>words are sampled either from the solution language model or from<br/>a translation model conditioned on the words already chosen in the<br/>problem part. We show, through an extensive set of experiments on<br/>real-world data, that our approach outperforms the state-of-the-art<br/>text segmentation algorithms in the accuracy of segmentation, and<br/>that such improved accuracy translates well to improved usability<br/>in Case-based Reasoning systems. We also analyze the robustness<br/>of our technique to varying amounts and types of noise and empirically<br/>illustrate that our technique is quite noise tolerant, and<br/>degrades gracefully with increasing amounts of noise

Identificador

http://pure.qub.ac.uk/portal/en/publications/twopart-segmentation-of-text-documents(b38ed1ab-afd6-45ba-ab90-1d7a12099807).html

Idioma(s)

eng

Direitos

info:eu-repo/semantics/restrictedAccess

Fonte

Padmanabhan , D , Visweswariah , K , Wiratunga , N & Sani , S 2012 , Two-part Segmentation of Text Documents . in 21st ACM International Conference on Information and Knowledge Management, CIKM'12, Maui, HI, USA, October 29 - November 02, 2012. . pp. 793-802 , CIKM 2012 , Maui , United States , 29-2 November .

Tipo

contributionToPeriodical