8 resultados para on-disk data layout

em AMS Tesi di Laurea - Alm@DL - Università di Bologna


Relevância:

100.00% 100.00%

Publicador:

Resumo:

During the last semester of the Master’s Degree in Artificial Intelligence, I carried out my internship working for TXT e-Solution on the ADMITTED project. This paper describes the work done in those months. The thesis will be divided into two parts representing the two different tasks I was assigned during the course of my experience. The First part will be about the introduction of the project and the work done on the admittedly library, maintaining the code base and writing the test suits. The work carried out is more connected to the Software engineer role, developing features, fixing bugs and testing. The second part will describe the experiments done on the Anomaly detection task using a Deep Learning technique called Autoencoder, this task is on the other hand more connected to the data science role. The two tasks were not done simultaneously but were dealt with one after the other, which is why I preferred to divide them into two separate parts of this paper.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This project is about retrieving data in range without allowing the server to read it, when the database is stored in the server. Basically, our goal is to build a database that allows the client to maintain the confidentiality of the data stored, despite all the data is stored in a different location from the client's hard disk. This means that all the information written on the hard disk can be easily read by another person who can do anything with it. Given that, we need to encrypt that data from eavesdroppers or other people. This is because they could sell it or log into accounts and use them for stealing money or identities. In order to achieve this, we need to encrypt the data stored in the hard drive, so that only the possessor of the key can easily read the information stored, while all the others are going to read only encrypted data. Obviously, according to that, all the data management must be done by the client, otherwise any malicious person can easily retrieve it and use it for any malicious intention. All the methods analysed here relies on encrypting data in transit. In the end of this project we analyse 2 theoretical and practical methods for the creation of the above databases and then we tests them with 3 datasets and with 10, 100 and 1000 queries. The scope of this work is to retrieve a trend that can be useful for future works based on this project.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This thesis presents a study of the Grid data access patterns in distributed analysis in the CMS experiment at the LHC accelerator. This study ranges from the deep analysis of the historical patterns of access to the most relevant data types in CMS, to the exploitation of a supervised Machine Learning classification system to set-up a machinery able to eventually predict future data access patterns - i.e. the so-called dataset “popularity” of the CMS datasets on the Grid - with focus on specific data types. All the CMS workflows run on the Worldwide LHC Computing Grid (WCG) computing centers (Tiers), and in particular the distributed analysis systems sustains hundreds of users and applications submitted every day. These applications (or “jobs”) access different data types hosted on disk storage systems at a large set of WLCG Tiers. The detailed study of how this data is accessed, in terms of data types, hosting Tiers, and different time periods, allows to gain precious insight on storage occupancy over time and different access patterns, and ultimately to extract suggested actions based on this information (e.g. targetted disk clean-up and/or data replication). In this sense, the application of Machine Learning techniques allows to learn from past data and to gain predictability potential for the future CMS data access patterns. Chapter 1 provides an introduction to High Energy Physics at the LHC. Chapter 2 describes the CMS Computing Model, with special focus on the data management sector, also discussing the concept of dataset popularity. Chapter 3 describes the study of CMS data access patterns with different depth levels. Chapter 4 offers a brief introduction to basic machine learning concepts and gives an introduction to its application in CMS and discuss the results obtained by using this approach in the context of this thesis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The first part of my work consisted in samplings conduced in nine different localities of the salento peninsula and Apulia (Italy): Costa Merlata (BR), Punta Penne (BR), Santa Cesarea terme (LE), Santa Caterina (LE), Torre Inserraglio (LE), Torre Guaceto (BR), Porto Cesareo (LE), Otranto (LE), Isole Tremiti (FG). I collected data of species percentage covering from the infralittoral rocky zone, using squares of 50x50 cm. We considered 3 sites for location and 10 replicates for each site, which has been taken randomly. Then I took other data about the same places, collected in some years, and I combined them together, to do a spatial analysis. So I started from a data set of 1896 samples but I decided not to consider time as a factor because I have reason to think that in this period of time anthropogenic stressors and their effects (if present), didn’t change considerably. The response variable I’ve analysed is the covering percentage of an amount of 243 species (subsequently merged into 32 functional groups), including seaweeds, invertebrates, sediment and rock. 2 After the sampling, I have been spent a period of two months at the Hopkins Marine Station of Stanford University, in Monterey (California,USA), at Fiorenza Micheli's laboratory. I've been carried out statistical analysis on my data set, using the software PRIMER 6. My explorative analysis starts with a nMDS in PRIMER 6, considering the original data matrix without, for the moment, the effect of stressors. What comes out is a good separation between localities and it confirms the result of ANOSIM analysis conduced on the original data matrix. What is possible to ensure is that there is not a separation led by a geographic pattern, but there should be something else that leads the differences. Is clear the presence of at least three groups: one composed by Porto cesareo, Torre Guaceto and Isole tremiti (the only marine protected areas considered in this work); another one by Otranto, and the last one by the rest of little, impacted localities. Inside the localities that include MPA(Marine Protected Areas), is also possible to observe a sort of grouping between protected and controlled areas. What comes out from SIMPER analysis is that the most of the species involved in leading differences between populations are not rare species, like: Cystoseira spp., Mytilus sp. and ECR. Moreover I assigned discrete values (0,1,2) of each stressor to all the sites I considered, in relation to the intensity with which the anthropogenic factor affect the localities. 3 Then I tried to estabilish if there were some significant interactions between stressors: by using Spearman rank correlation and Spearman tables of significance, and taking into account 17 grades of freedom, the outcome shows some significant stressors interactions. Then I built a nMDS considering the stressors as response variable. The result was positive: localities are well separeted by stressors. Consequently I related the matrix with 'localities and species' with the 'localities and stressors' one. Stressors combination explains with a good significance level the variability inside my populations. I tried with all the possible data transformations (none, square root, fourth root, log (X+1), P/A), but the fourth root seemed to be the best one, with the highest level of significativity, meaning that also rare species can influence the result. The challenge will be to characterize better which kind of stressors (including also natural ones), act on the ecosystem; and give them a quantitative and more accurate values, trying to understand how they interact (in an additive or non-additive way).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this work we will discuss about a project started by the Emilia-Romagna Regional Government regarding the manage of the public transport. In particular we will perform a data mining analysis on the data-set of this project. After introducing the Weka software used to make our analysis, we will discover the most useful data mining techniques and algorithms; and we will show how these results can be used to violate the privacy of the same public transport operators. At the end, despite is off topic of this work, we will spend also a few words about how it's possible to prevent this kind of attack.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Reinforcement learning is a particular paradigm of machine learning that, recently, has proved times and times again to be a very effective and powerful approach. On the other hand, cryptography usually takes the opposite direction. While machine learning aims at analyzing data, cryptography aims at maintaining its privacy by hiding such data. However, the two techniques can be jointly used to create privacy preserving models, able to make inferences on the data without leaking sensitive information. Despite the numerous amount of studies performed on machine learning and cryptography, reinforcement learning in particular has never been applied to such cases before. Being able to successfully make use of reinforcement learning in an encrypted scenario would allow us to create an agent that efficiently controls a system without providing it with full knowledge of the environment it is operating in, leading the way to many possible use cases. Therefore, we have decided to apply the reinforcement learning paradigm to encrypted data. In this project we have applied one of the most well-known reinforcement learning algorithms, called Deep Q-Learning, to simple simulated environments and studied how the encryption affects the training performance of the agent, in order to see if it is still able to learn how to behave even when the input data is no longer readable by humans. The results of this work highlight that the agent is still able to learn with no issues whatsoever in small state spaces with non-secure encryptions, like AES in ECB mode. For fixed environments, it is also able to reach a suboptimal solution even in the presence of secure modes, like AES in CBC mode, showing a significant improvement with respect to a random agent; however, its ability to generalize in stochastic environments or big state spaces suffers greatly.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Most of the existing open-source search engines, utilize keyword or tf-idf based techniques to find relevant documents and web pages relative to an input query. Although these methods, with the help of a page rank or knowledge graphs, proved to be effective in some cases, they often fail to retrieve relevant instances for more complicated queries that would require a semantic understanding to be exploited. In this Thesis, a self-supervised information retrieval system based on transformers is employed to build a semantic search engine over the library of Gruppo Maggioli company. Semantic search or search with meaning can refer to an understanding of the query, instead of simply finding words matches and, in general, it represents knowledge in a way suitable for retrieval. We chose to investigate a new self-supervised strategy to handle the training of unlabeled data based on the creation of pairs of ’artificial’ queries and the respective positive passages. We claim that by removing the reliance on labeled data, we may use the large volume of unlabeled material on the web without being limited to languages or domains where labeled data is abundant.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Semantic Web technologies provide the means to express the knowledge in a formal and standardized manner, enabling machines to automatically derive meaning from the data. Often this knowledge is uncertain or different degrees of certainty may be assigned to the same statements. This is the case in many fields of study such as in Digital Humanities, Science and Arts. The challenge relies on the fact that our knowledge about the surrounding world is dynamic and may evolve based on new data coming from the latest discoveries. Furthermore we should be able to express conflicting, debated or disputed statements in an efficient, effective and consistent way without the need of asserting them. We call this approach 'Expressing Without Asserting' (EWA). In this work we identify all existing methods that are compatible with actual Semantic Web standards and enable us to express EWA. In our research we were able to prove that existing reification methods such as Named Graphs, Singleton Properties, Wikidata Statements and RDF-Star are the most suitable methods to represent in a reliable way EWA. Next we compare these methods with our own method, namely Conjectures from a quantitative perspective. Our main objective was to put Conjectures into stress tests leveraging enormous datasets created ad hoc using art-related Wikidata dumps and measure the performance in various triplestores in relation with similar concurrent methods. Our experiments show that Conjectures are a formidable tool to express efficiently and effectively EWA. In some cases, Conjectures outperform state of the art methods such as singleton and Rdf-Star exposing their great potential. Is our firm belief that Conjectures represent a suitable solution to EWA issues. Conjectures in their weak form are fully compatible with Semantic Web standards, especially with RDF and SPARQL. Furthermore Conjectures benefit from comprehensive syntax and intuitive semantics that make them easy to learn and adapt.