3 resultados para Data reliability
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
Coral reefs are the most biodiverse ecosystems of the ocean and they provide notable ecosystem services. Nowadays, they are facing a number of local anthropogenic threats and environmental change is threatening their survivorship on a global scale. Large-scale monitoring is necessary to understand environmental changes and to perform useful conservation measurements. Governmental agencies are often underfunded and are not able of sustain the necessary spatial and temporal large-scale monitoring. To overcome the economic constrains, in some cases scientists can engage volunteers in environmental monitoring. Citizen Science enables the collection and analysis of scientific data at larger spatial and temporal scales than otherwise possible, addressing issues that are otherwise logistically or financially unfeasible. “STE: Scuba Tourism for the Environment” was a volunteer-based Red Sea coral reef biodiversity monitoring program. SCUBA divers and snorkelers were involved in the collection of data for 72 taxa, by completing survey questionnaires after their dives. In my thesis, I evaluated the reliability of the data collected by volunteers, comparing their questionnaires with those completed by professional scientists. Validation trials showed a sufficient level of reliability, indicating that non-specialists performed similarly to conservation volunteer divers on accurate transects. Using the data collected by volunteers, I developed a biodiversity index that revealed spatial trends across surveyed areas. The project results provided important feedbacks to the local authorities on the current health status of Red Sea coral reefs and on the effectiveness of the environmental management. I also analysed the spatial and temporal distribution of each surveyed taxa, identifying abundance trends related with anthropogenic impacts. Finally, I evaluated the effectiveness of the project to increase the environmental education of volunteers and showed that the participation in STEproject significantly increased both the knowledge on coral reef biology and ecology and the awareness of human behavioural impacts on the environment.
Resumo:
In recent years, there has been exponential growth in using virtual spaces, including dialogue systems, that handle personal information. The concept of personal privacy in the literature is discussed and controversial, whereas, in the technological field, it directly influences the degree of reliability perceived in the information system (privacy ‘as trust’). This work aims to protect the right to privacy on personal data (GDPR, 2018) and avoid the loss of sensitive content by exploring sensitive information detection (SID) task. It is grounded on the following research questions: (RQ1) What does sensitive data mean? How to define a personal sensitive information domain? (RQ2) How to create a state-of-the-art model for SID?(RQ3) How to evaluate the model? RQ1 theoretically investigates the concepts of privacy and the ontological state-of-the-art representation of personal information. The Data Privacy Vocabulary (DPV) is the taxonomic resource taken as an authoritative reference for the definition of the knowledge domain. Concerning RQ2, we investigate two approaches to classify sensitive data: the first - bottom-up - explores automatic learning methods based on transformer networks, the second - top-down - proposes logical-symbolic methods with the construction of privaframe, a knowledge graph of compositional frames representing personal data categories. Both approaches are tested. For the evaluation - RQ3 – we create SPeDaC, a sentence-level labeled resource. This can be used as a benchmark or training in the SID task, filling the gap of a shared resource in this field. If the approach based on artificial neural networks confirms the validity of the direction adopted in the most recent studies on SID, the logical-symbolic approach emerges as the preferred way for the classification of fine-grained personal data categories, thanks to the semantic-grounded tailor modeling it allows. At the same time, the results highlight the strong potential of hybrid architectures in solving automatic tasks.
Resumo:
Artificial Intelligence (AI) and Machine Learning (ML) are novel data analysis techniques providing very accurate prediction results. They are widely adopted in a variety of industries to improve efficiency and decision-making, but they are also being used to develop intelligent systems. Their success grounds upon complex mathematical models, whose decisions and rationale are usually difficult to comprehend for human users to the point of being dubbed as black-boxes. This is particularly relevant in sensitive and highly regulated domains. To mitigate and possibly solve this issue, the Explainable AI (XAI) field became prominent in recent years. XAI consists of models and techniques to enable understanding of the intricated patterns discovered by black-box models. In this thesis, we consider model-agnostic XAI techniques, which can be applied to Tabular data, with a particular focus on the Credit Scoring domain. Special attention is dedicated to the LIME framework, for which we propose several modifications to the vanilla algorithm, in particular: a pair of complementary Stability Indices that accurately measure LIME stability, and the OptiLIME policy which helps the practitioner finding the proper balance among explanations' stability and reliability. We subsequently put forward GLEAMS a model-agnostic surrogate interpretable model which requires to be trained only once, while providing both Local and Global explanations of the black-box model. GLEAMS produces feature attributions and what-if scenarios, from both dataset and model perspective. Eventually, we argue that synthetic data are an emerging trend in AI, being more and more used to train complex models instead of original data. To be able to explain the outcomes of such models, we must guarantee that synthetic data are reliable enough to be able to translate their explanations to real-world individuals. To this end we propose DAISYnt, a suite of tests to measure synthetic tabular data quality and privacy.