Hierarchical Multi-Label Text Classification in a Low-Resource Setting
| Contribuinte(s) |
Torroni, Paolo Savino, Giuseppe |
|---|---|
| Data(s) |
06/12/2022
|
| Resumo |
In this thesis we address a multi-label hierarchical text classification problem in a low-resource setting and explore different approaches to identify the best one for our case. The goal is to train a model that classifies English school exercises according to a hierarchical taxonomy with few labeled data. The experiments made in this work employ different machine learning models and text representation techniques: CatBoost with tf-idf features, classifiers based on pre-trained models (mBERT, LASER), and SetFit, a framework for few-shot text classification. SetFit proved to be the most promising approach, achieving better performance when during training only a few labeled examples per class are available. However, this thesis does not consider all the hierarchical taxonomy, but only the first two levels: to address classification with the classes at the third level further experiments should be carried out, exploring methods for zero-shot text classification, data augmentation, and strategies to exploit the hierarchical structure of the taxonomy during training. |
| Formato |
application/pdf |
| Identificador |
http://amslaurea.unibo.it/27453/1/thesis_andrea_lavista.pdf Lavista, Andrea (2022) Hierarchical Multi-Label Text Classification in a Low-Resource Setting. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270] <http://amslaurea.unibo.it/view/cds/CDS9063/> |
| Idioma(s) |
en |
| Publicador |
Alma Mater Studiorum - Università di Bologna |
| Relação |
http://amslaurea.unibo.it/27453/ |
| Direitos |
cc_by_nc_sa4 |
| Palavras-Chave | #natural language processing,text classification,multi-label classification,hierarchical classification,multi-label text classification,hierarchical text classification,few-shot learning,few-shot text classification,low-resource setting,pre-trained models,contextual embedding,sentence embedding,task-adaptive pre-training,domain adaptation,multilingual,BERT,SetFit,LASER,SHAP #Artificial intelligence [LM-DM270] |
| Tipo |
PeerReviewed info:eu-repo/semantics/masterThesis |