2 resultados para word2vec

em AMS Tesi di Laurea - Alm@DL - Università di Bologna


Relevância:

10.00% 10.00%

Publicador:

Resumo:

La Sentiment analysis, nata nell'ambito dell’informatica, è una delle aree di ricerca più attive nel campo dell’analisi del linguaggio naturale e si è diffusa ampiamente anche in altri rami scientifici come ad esempio le scienze sociali, l’economia e il marketing. L’enorme diffusione della sentiment analysis coincide con la crescita dei cosiddetti social media: siti di commercio e recensioni di prodotti, forum di discussione, blog, micro-blog e di vari social network. L'obiettivo del presente lavoro di tesi è stato quello di progettare un sistema di sentiment analysis in grado di rilevare e classificare le opinioni e i sentimenti espressi tramite chat dagli utenti della piattaforma di video streaming Twitch.tv. Per impostare ed organizzare il lavoro, giungendo quindi alla definizione del sistema che ci si è proposti di realizzare, sono stati utilizzati vari modelli di analisi in particolare le recurrent neural networks (RNNLM) e sistemi di word embedding (word2vec),nello specifico i Paragraph Vectors, applicandoli, dapprima, su dati etichettati in maniera automatica attraverso l'uso di emoticon e, successivamente, su dati etichettati a mano.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Internet traffic classification is a relevant and mature research field, anyway of growing importance and with still open technical challenges, also due to the pervasive presence of Internet-connected devices into everyday life. We claim the need for innovative traffic classification solutions capable of being lightweight, of adopting a domain-based approach, of not only concentrating on application-level protocol categorization but also classifying Internet traffic by subject. To this purpose, this paper originally proposes a classification solution that leverages domain name information extracted from IPFIX summaries, DNS logs, and DHCP leases, with the possibility to be applied to any kind of traffic. Our proposed solution is based on an extension of Word2vec unsupervised learning techniques running on a specialized Apache Spark cluster. In particular, learning techniques are leveraged to generate word-embeddings from a mixed dataset composed by domain names and natural language corpuses in a lightweight way and with general applicability. The paper also reports lessons learnt from our implementation and deployment experience that demonstrates that our solution can process 5500 IPFIX summaries per second on an Apache Spark cluster with 1 slave instance in Amazon EC2 at a cost of $ 3860 year. Reported experimental results about Precision, Recall, F-Measure, Accuracy, and Cohen's Kappa show the feasibility and effectiveness of the proposal. The experiments prove that words contained in domain names do have a relation with the kind of traffic directed towards them, therefore using specifically trained word embeddings we are able to classify them in customizable categories. We also show that training word embeddings on larger natural language corpuses leads improvements in terms of precision up to 180%.