Discourse Type Clustering using POS n-gram Profiles and High-Dimensional Embeddings


Autoria(s): Cocco C.
Data(s)

01/04/2012

Resumo

Abstract: To cluster textual sequence types (discourse types/modes) in French texts, K-means algorithm with high-dimensional embeddings and fuzzy clustering algorithm were applied on clauses whose POS (part-ofspeech) n-gram profiles were previously extracted. Uni-, bi- and trigrams were used on four 19th century French short stories by Maupassant. For high-dimensional embeddings, power transformations on the chi-squared distances between clauses were explored. Preliminary results show that highdimensional embeddings improve the quality of clustering, contrasting the use of bi and trigrams whose performance is disappointing, possibly because of feature space sparsity.

Identificador

http://serval.unil.ch/?id=serval:BIB_5A2CBDB06CA2

isbn:978-1-937284-19-0

http://my.unil.ch/serval/document/BIB_5A2CBDB06CA2.pdf

http://nbn-resolving.org/urn/resolver.pl?urn=urn:nbn:ch:serval-BIB_5A2CBDB06CA20

http://aclweb.org/anthology-new/E/E12/E12-3.pdf

Idioma(s)

en

Publicador

Stroudsburg: Association for Computational Linguistics

Stroudsburg: Université d'Avignon

Direitos

info:eu-repo/semantics/openAccess

Fonte

Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics

Palavras-Chave #Discourse types; K-means; high-dimensional embeddings; fuzzy clustering
Tipo

info:eu-repo/semantics/conferenceObject

inproceedings