Hierarchical Multi-Label Text Classification in a Low-Resource Setting


Autoria(s): Lavista, Andrea
Contribuinte(s)

Torroni, Paolo

Savino, Giuseppe

Data(s)

06/12/2022

Resumo

In this thesis we address a multi-label hierarchical text classification problem in a low-resource setting and explore different approaches to identify the best one for our case. The goal is to train a model that classifies English school exercises according to a hierarchical taxonomy with few labeled data. The experiments made in this work employ different machine learning models and text representation techniques: CatBoost with tf-idf features, classifiers based on pre-trained models (mBERT, LASER), and SetFit, a framework for few-shot text classification. SetFit proved to be the most promising approach, achieving better performance when during training only a few labeled examples per class are available. However, this thesis does not consider all the hierarchical taxonomy, but only the first two levels: to address classification with the classes at the third level further experiments should be carried out, exploring methods for zero-shot text classification, data augmentation, and strategies to exploit the hierarchical structure of the taxonomy during training.

Formato

application/pdf

Identificador

http://amslaurea.unibo.it/27453/1/thesis_andrea_lavista.pdf

Lavista, Andrea (2022) Hierarchical Multi-Label Text Classification in a Low-Resource Setting. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270] <http://amslaurea.unibo.it/view/cds/CDS9063/>

Idioma(s)

en

Publicador

Alma Mater Studiorum - Università di Bologna

Relação

http://amslaurea.unibo.it/27453/

Direitos

cc_by_nc_sa4

Palavras-Chave #natural language processing,text classification,multi-label classification,hierarchical classification,multi-label text classification,hierarchical text classification,few-shot learning,few-shot text classification,low-resource setting,pre-trained models,contextual embedding,sentence embedding,task-adaptive pre-training,domain adaptation,multilingual,BERT,SetFit,LASER,SHAP #Artificial intelligence [LM-DM270]
Tipo

PeerReviewed

info:eu-repo/semantics/masterThesis