856 resultados para twitter, conversation retrieval
Resumo:
Things change. Words change, meaning changes and use changes both words and meaning. In information access systems this means concept schemes such as thesauri or clas- sification schemes change. They always have. Concept schemes that have survived have evolved over time, moving from one version, often called an edition, to the next. If we want to manage how words and meanings - and as a conse- quence use - change in an effective manner, and if we want to be able to search across versions of concept schemes, we have to track these changes. This paper explores how we might expand SKOS, a World Wide Web Consortium (W3C) draft recommendation in order to do that kind of tracking.The Simple Knowledge Organization System (SKOS) Core Guide is sponsored by the Semantic Web Best Practices and Deployment Working Group. The second draft, edited by Alistair Miles and Dan Brickley, was issued in November 2005. SKOS is a “model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, other types of controlled vocabulary and also concept schemes embedded in glossaries and terminologies” in RDF. How SKOS handles version in concept schemes is an open issue. The current draft guide suggests using OWL and DCTERMS as mechanisms for concept scheme revision.As it stands an editor of a concept scheme can make notes or declare in OWL that more than one version exists. This paper adds to the SKOS Core by introducing a tracking sys- tem for changes in concept schemes. We call this tracking system vocabulary ontogeny. Ontogeny is a biological term for the development of an organism during its lifetime. Here we use the ontogeny metaphor to describe how vocabularies change over their lifetime. Our purpose here is to create a conceptual mechanism that will track these changes and in so doing enhance information retrieval and prevent document loss through versioning, thereby enabling persistent retrieval.
Resumo:
The MARS (Media Asset Retrieval System) Project is the collaborative effort of public broadcasters,libraries and schools in the Puget Sound region to create a digital online resource that provides access to content produced by public broadcasters via the public libraries. Convergence ConsortiumThe Convergence Consortium is a model for community collaboration, including organizations such as public broadcasters, libraries, museums, and schools in the Puget Sound region to assess the needs of their constituents and pool resources to develop solutions to meet those needs. Specifically, the archives of public broadcasters have been identified as significant resources for the local communities and nationally. These resources can be accessed on the broadcasters websites, and through libraries and used by schools, and integrated with text and photographic archives from other partners.MARS’ goalCreate an online resource that provides effective access to the content produced locally by KCTS (Seattle PBS affiliate) and KUOW (Seattle NPR affiliate). The broadcasts will be made searchable using the CPB Metadata Element Set (under development) and controlled vocabularies (to be developed). This will ensure a user friendly search and navigation mechanism and user satisfaction.Furthermore, the resource can search the local public library’s catalog concurrently and provide the user with relevant TV material, radio material, and books on a given subject.The ultimate goal is to produce a model that can be used in cities around the country.The current phase of the project assesses the community’s need, analyzes the current operational systems, and makes recommendations for the design of the resource.Deliverables• Literature review of the issues surrounding the organization, description and representation of media assets• Needs assessment report of internal and external stakeholders• Profile of the systems in the area of managing and organizing media assetsfor public broadcasting nationwideActivities• Analysis of information seeking behavior• Analysis of collaboration within the respective organizations• Analysis of the scope and context of the proposed system• Examining the availability of information resources and exchangeof resources among users
Resumo:
The MARS (Media Asset Retrieval System) Project is a collaboration between public broadcasters, libraries and schools in the Puget Sound region to assess the needs of their constituents and pool resources to develop solutions to meet those needs. The Project’s ultimate goal is to create a digital online resource that will provide access to content produced by public broadcasters and libraries. The MARS Project is funded by a grant from the Corporation for Public Broadcasting (CPB) Television Future Fund. Convergence ConsortiumThe Convergence Consortium is a model for community collaboration, including representatives from public broadcasting, libraries and schools in the Puget Sound region. They meet regularly to consider collaborative efforts that will be mutually beneficial to their institutions and constituents. Specifically, the archives of public broadcasters have been identified as significant resources that can be accessed through libraries and used by schools, and integrated with text and photographic archives from other partners.Using the work-centered framework, we collected data through interviews with nine engineers and observation of their searching while they performed their regular, job-related searches on the Web. The framework was used to analyze the data on two levels: 1) the activities and organizational relationships and constrains of work domains, and 2) users’ cognitive and social activities and their subjective preferences during searching.
Resumo:
Conventional web search engines are centralised in that a single entity crawls and indexes the documents selected for future retrieval, and the relevance models used to determine which documents are relevant to a given user query. As a result, these search engines suffer from several technical drawbacks such as handling scale, timeliness and reliability, in addition to ethical concerns such as commercial manipulation and information censorship. Alleviating the need to rely entirely on a single entity, Peer-to-Peer (P2P) Information Retrieval (IR) has been proposed as a solution, as it distributes the functional components of a web search engine – from crawling and indexing documents, to query processing – across the network of users (or, peers) who use the search engine. This strategy for constructing an IR system poses several efficiency and effectiveness challenges which have been identified in past work. Accordingly, this thesis makes several contributions towards advancing the state of the art in P2P-IR effectiveness by improving the query processing and relevance scoring aspects of a P2P web search. Federated search systems are a form of distributed information retrieval model that route the user’s information need, formulated as a query, to distributed resources and merge the retrieved result lists into a final list. P2P-IR networks are one form of federated search in routing queries and merging result among participating peers. The query is propagated through disseminated nodes to hit the peers that are most likely to contain relevant documents, then the retrieved result lists are merged at different points along the path from the relevant peers to the query initializer (or namely, customer). However, query routing in P2P-IR networks is considered as one of the major challenges and critical part in P2P-IR networks; as the relevant peers might be lost in low-quality peer selection while executing the query routing, and inevitably lead to less effective retrieval results. This motivates this thesis to study and propose query routing techniques to improve retrieval quality in such networks. Cluster-based semi-structured P2P-IR networks exploit the cluster hypothesis to organise the peers into similar semantic clusters where each such semantic cluster is managed by super-peers. In this thesis, I construct three semi-structured P2P-IR models and examine their retrieval effectiveness. I also leverage the cluster centroids at the super-peer level as content representations gathered from cooperative peers to propose a query routing approach called Inverted PeerCluster Index (IPI) that simulates the conventional inverted index of the centralised corpus to organise the statistics of peers’ terms. The results show a competitive retrieval quality in comparison to baseline approaches. Furthermore, I study the applicability of using the conventional Information Retrieval models as peer selection approaches where each peer can be considered as a big document of documents. The experimental evaluation shows comparative and significant results and explains that document retrieval methods are very effective for peer selection that brings back the analogy between documents and peers. Additionally, Learning to Rank (LtR) algorithms are exploited to build a learned classifier for peer ranking at the super-peer level. The experiments show significant results with state-of-the-art resource selection methods and competitive results to corresponding classification-based approaches. Finally, I propose reputation-based query routing approaches that exploit the idea of providing feedback on a specific item in the social community networks and manage it for future decision-making. The system monitors users’ behaviours when they click or download documents from the final ranked list as implicit feedback and mines the given information to build a reputation-based data structure. The data structure is used to score peers and then rank them for query routing. I conduct a set of experiments to cover various scenarios including noisy feedback information (i.e, providing positive feedback on non-relevant documents) to examine the robustness of reputation-based approaches. The empirical evaluation shows significant results in almost all measurement metrics with approximate improvement more than 56% compared to baseline approaches. Thus, based on the results, if one were to choose one technique, reputation-based approaches are clearly the natural choices which also can be deployed on any P2P network.
Resumo:
Over the last decade, social media has become a hot topic for researchers of collaborative technologies (e.g., CSCW). The pervasive use of social media in our everyday lives provides a ready source of naturalistic data for researchers to empirically examine the complexities of the social world. In this talk I outline a different perspective informed by ethnomethodology and conversation analysis (EMCA) - an orientation that has been influential within CSCW, yet has only rarely been applied to social media use. EMCA approaches can complement existing perspectives through articulating how social media is embedded in everyday life, and how its social organisation is achieved by users of social media. Outlining a possible programme of research, I draw on a corpus of screen and ambient audio recordings of mobile device use to show how EMCA research can be generative for understanding social media through concepts such as adjacency pairs, sequential context, turn allocation / speaker selection, and repair. In doing so, I also raise questions about existing studies of social media use and the way they characterise interactional phenomena.
Resumo:
The experience approached in this paper aims at reflecting, reasoning, planning and implementing the “Conversation Circles” as a teaching strategy in the PF-4237 course “Theory of Education: Multiculturalism and Education” of the Latin American Doctoral Program in Education, University of Costa Rica. This training experience, based on the communicative action theory, intended to integrate the assistance of the teacher, the confrontation to otherness and the building of knowledge, skills and social attitudes in higher education.
Resumo:
This dissertation explores the link between hate crimes that occurred in the United Kingdom in June 2017, June 2018 and June 2019 through the posts of a robust sample of Conservative and radical right users on Twitter. In order to avoid the traditional challenges of this kind of research, I adopted a four staged research protocol that enabled me to merge content produced by a group of randomly selected users to observe the phenomenon from different angles. I collected tweets from thirty Conservative/right wing accounts for each month of June over the three years with the help of programming languages such as Python and CygWin tools. I then examined the language of my data focussing on humorous content in order to reveal whether, and if so how, radical users online often use humour as a tool to spread their views in conditions of heightened disgust and wide-spread political instability. A reflection on humour as a moral occurrence, expanding on the works of Christie Davies as well as applying recent findings on the behavioural immune system on online data, offers new insights on the overlooked humorous nature of radical political discourse. An unorthodox take on the moral foundations pioneered by Jonathan Haidt enriched my understanding of the analysed material through the addition of a moral-based layer of enquiry to my more traditional content-based one. This convergence of theoretical, data driven and real life events constitutes a viable “collection of strategies” for academia, data scientists; NGO’s fighting hate crimes and the wider public alike. Bringing together the ideas of Davies, Haidt and others to my data, helps us to perceive humorous online content in terms of complex radical narratives that are all too often compressed into a single tweet.
Resumo:
In questa tesi si trattano lo studio e la sperimentazione di un modello generativo retrieval-augmented, basato su Transformers, per il task di Abstractive Summarization su lunghe sentenze legali. La sintesi automatica del testo (Automatic Text Summarization) è diventata un task di Natural Language Processing (NLP) molto importante oggigiorno, visto il grandissimo numero di dati provenienti dal web e banche dati. Inoltre, essa permette di automatizzare un processo molto oneroso per gli esperti, specialmente nel settore legale, in cui i documenti sono lunghi e complicati, per cui difficili e dispendiosi da riassumere. I modelli allo stato dell’arte dell’Automatic Text Summarization sono basati su soluzioni di Deep Learning, in particolare sui Transformers, che rappresentano l’architettura più consolidata per task di NLP. Il modello proposto in questa tesi rappresenta una soluzione per la Long Document Summarization, ossia per generare riassunti di lunghe sequenze testuali. In particolare, l’architettura si basa sul modello RAG (Retrieval-Augmented Generation), recentemente introdotto dal team di ricerca Facebook AI per il task di Question Answering. L’obiettivo consiste nel modificare l’architettura RAG al fine di renderla adatta al task di Abstractive Long Document Summarization. In dettaglio, si vuole sfruttare e testare la memoria non parametrica del modello, con lo scopo di arricchire la rappresentazione del testo di input da riassumere. A tal fine, sono state sperimentate diverse configurazioni del modello su diverse tipologie di esperimenti e sono stati valutati i riassunti generati con diverse metriche automatiche.
Resumo:
Dopo lo sviluppo dei primi casi di Covid-19 in Cina nell’autunno del 2019, ad inizio 2020 l’intero pianeta è precipitato in una pandemia globale che ha stravolto le nostre vite con conseguenze che non si vivevano dall’influenza spagnola. La grandissima quantità di paper scientifici in continua pubblicazione sul coronavirus e virus ad esso affini ha portato alla creazione di un unico dataset dinamico chiamato CORD19 e distribuito gratuitamente. Poter reperire informazioni utili in questa mole di dati ha ulteriormente acceso i riflettori sugli information retrieval systems, capaci di recuperare in maniera rapida ed efficace informazioni preziose rispetto a una domanda dell'utente detta query. Di particolare rilievo è stata la TREC-COVID Challenge, competizione per lo sviluppo di un sistema di IR addestrato e testato sul dataset CORD19. Il problema principale è dato dal fatto che la grande mole di documenti è totalmente non etichettata e risulta dunque impossibile addestrare modelli di reti neurali direttamente su di essi. Per aggirare il problema abbiamo messo a punto nuove soluzioni self-supervised, a cui abbiamo applicato lo stato dell'arte del deep metric learning e dell'NLP. Il deep metric learning, che sta avendo un enorme successo soprattuto nella computer vision, addestra il modello ad "avvicinare" tra loro immagini simili e "allontanare" immagini differenti. Dato che sia le immagini che il testo vengono rappresentati attraverso vettori di numeri reali (embeddings) si possano utilizzare le stesse tecniche per "avvicinare" tra loro elementi testuali pertinenti (e.g. una query e un paragrafo) e "allontanare" elementi non pertinenti. Abbiamo dunque addestrato un modello SciBERT con varie loss, che ad oggi rappresentano lo stato dell'arte del deep metric learning, in maniera completamente self-supervised direttamente e unicamente sul dataset CORD19, valutandolo poi sul set formale TREC-COVID attraverso un sistema di IR e ottenendo risultati interessanti.
Resumo:
Most of the existing open-source search engines, utilize keyword or tf-idf based techniques to find relevant documents and web pages relative to an input query. Although these methods, with the help of a page rank or knowledge graphs, proved to be effective in some cases, they often fail to retrieve relevant instances for more complicated queries that would require a semantic understanding to be exploited. In this Thesis, a self-supervised information retrieval system based on transformers is employed to build a semantic search engine over the library of Gruppo Maggioli company. Semantic search or search with meaning can refer to an understanding of the query, instead of simply finding words matches and, in general, it represents knowledge in a way suitable for retrieval. We chose to investigate a new self-supervised strategy to handle the training of unlabeled data based on the creation of pairs of ’artificial’ queries and the respective positive passages. We claim that by removing the reliance on labeled data, we may use the large volume of unlabeled material on the web without being limited to languages or domains where labeled data is abundant.
Resumo:
The study of the atmospheric chemical composition is crucial to understand the climate changes that we are experiencing in the last decades and to monitor the air quality over industrialized areas. The Multi-AXis Differential Optical Absorption Spectroscopy (MAX-DOAS) ground-based instruments are particularly suitable to derive the concentration of some trace gases that absorb the Visible (VIS) and Ultra-Violet (UV) solar radiation. The zenith-sky spectra acquired by the Gas Analyzer Spectrometer Correlating Optical Differences / New Generation 4 (GASCOD/NG4) instrument are exploited to retrieve the NO2 and O3 total Vertical Column Densities (VCDs) over Lecce. The results show that the NO2 total VCDs are significantly affected by the tropospheric content, consequence of the anthropogenic activity. Indeed, they present systematically lower values during Sunday, when less traffic is generally present around the measurement site, and during windy days, especially when the wind direction measured at 2 m height is not from the city of Lecce. Another MAX-DOAS instrument (SkySpec-2D) is exploited to create the first Italian MAX-DOAS site compliant to the Fiducial Reference Measurements for DOAS (FRM4DOAS) standards, in San Pietro Capofiume (SPC), located in the middle of the Po Valley. After the assessment of the SkySpec-2D’s performances through two measurement campaigns taken place in Bologna and in Rome, SkySpec-2D is installed in SPC on the 1st October 2021. Its MAX-DOAS spectra are used to retrieve the NO2 and O3 total VCDs, and aerosol extinction and NO2 tropospheric vertical profiles over the Po Valley exploiting the Bremen Optimal estimation REtrieval for Aerosol and trace gaseS (BOREAS) algorithm. Promising results are found, with high correlations against both in-situ and satellite data. In the future, these data will play an important role for air quality studies over the Po Valley and for satellite validation purposes.
Resumo:
La tesi ha lo scopo di ricercare, esaminare ed implementare un sistema di Machine Learning, un Recommendation Systems per precisione, che permetta la racommandazione di documenti di natura giuridica, i quali sono già stati analizzati e categorizzati appropriatamente, in maniera ottimale, il cui scopo sarebbe quello di accompagnare un sistema già implementato di Information Retrieval, istanziato sopra una web application, che permette di ricercare i documenti giuridici appena menzionati.
Resumo:
Nel corso dell’elaborato verranno utilizzate tecniche e strumenti di analisi automatica di dati aventi carattere testuale. Lo scopo del lavoro di tesi consisterà nel condurre text mining e sentiment analysis su dei messaggi al fine di comprenderne il significato, con interesse particolare sulle emozioni ed i sentimenti in essi contenuti per riuscire ad estrapolare informazioni di interesse.
Resumo:
Artificial Intelligence is reshaping the field of fashion industry in different ways. E-commerce retailers exploit their data through AI to enhance their search engines, make outfit suggestions and forecast the success of a specific fashion product. However, it is a challenging endeavour as the data they possess is huge, complex and multi-modal. The most common way to search for fashion products online is by matching keywords with phrases in the product's description which are often cluttered, inadequate and differ across collections and sellers. A customer may also browse an online store's taxonomy, although this is time-consuming and doesn't guarantee relevant items. With the advent of Deep Learning architectures, particularly Vision-Language models, ad-hoc solutions have been proposed to model both the product image and description to solve this problems. However, the suggested solutions do not exploit effectively the semantic or syntactic information of these modalities, and the unique qualities and relations of clothing items. In this work of thesis, a novel approach is proposed to address this issues, which aims to model and process images and text descriptions as graphs in order to exploit the relations inside and between each modality and employs specific techniques to extract syntactic and semantic information. The results obtained show promising performances on different tasks when compared to the present state-of-the-art deep learning architectures.