976 resultados para Information Requirements: Data Availability
Resumo:
Data mining can be defined as the extraction of implicit, previously un-known, and potentially useful information from data. Numerous re-searchers have been developing security technology and exploring new methods to detect cyber-attacks with the DARPA 1998 dataset for Intrusion Detection and the modified versions of this dataset KDDCup99 and NSL-KDD, but until now no one have examined the performance of the Top 10 data mining algorithms selected by experts in data mining. The compared classification learning algorithms in this thesis are: C4.5, CART, k-NN and Naïve Bayes. The performance of these algorithms are compared with accuracy, error rate and average cost on modified versions of NSL-KDD train and test dataset where the instances are classified into normal and four cyber-attack categories: DoS, Probing, R2L and U2R. Additionally the most important features to detect cyber-attacks in all categories and in each category are evaluated with Weka’s Attribute Evaluator and ranked according to Information Gain. The results show that the classification algorithm with best performance on the dataset is the k-NN algorithm. The most important features to detect cyber-attacks are basic features such as the number of seconds of a network connection, the protocol used for the connection, the network service used, normal or error status of the connection and the number of data bytes sent. The most important features to detect DoS, Probing and R2L attacks are basic features and the least important features are content features. Unlike U2R attacks, where the content features are the most important features to detect attacks.
Resumo:
The Twitter System is the biggest social network in the world, and everyday millions of tweets are posted and talked about, expressing various views and opinions. A large variety of research activities have been conducted to study how the opinions can be clustered and analyzed, so that some tendencies can be uncovered. Due to the inherent weaknesses of the tweets - very short texts and very informal styles of writing - it is rather hard to make an investigation of tweet data analysis giving results with good performance and accuracy. In this paper, we intend to attack the problem from another aspect - using a two-layer structure to analyze the twitter data: LDA with topic map modelling. The experimental results demonstrate that this approach shows a progress in twitter data analysis. However, more experiments with this method are expected in order to ensure that the accurate analytic results can be maintained.
Resumo:
Recommendation systems aim to help users make decisions more efficiently. The most widely used method in recommendation systems is collaborative filtering, of which, a critical step is to analyze a user's preferences and make recommendations of products or services based on similarity analysis with other users' ratings. However, collaborative filtering is less usable for recommendation facing the "cold start" problem, i.e. few comments being given to products or services. To tackle this problem, we propose an improved method that combines collaborative filtering and data classification. We use hotel recommendation data to test the proposed method. The accuracy of the recommendation is determined by the rankings. Evaluations regarding the accuracies of Top-3 and Top-10 recommendation lists using the 10-fold cross-validation method and ROC curves are conducted. The results show that the Top-3 hotel recommendation list proposed by the combined method has the superiority of the recommendation performance than the Top-10 list under the cold start condition in most of the times.
Resumo:
Internet users consume online targeted advertising based on information collected about them and voluntarily share personal information in social networks. Sensor information and data from smart-phones is collected and used by applications, sometimes in unclear ways. As it happens today with smartphones, in the near future sensors will be shipped in all types of connected devices, enabling ubiquitous information gathering from the physical environment, enabling the vision of Ambient Intelligence. The value of gathered data, if not obvious, can be harnessed through data mining techniques and put to use by enabling personalized and tailored services as well as business intelligence practices, fueling the digital economy. However, the ever-expanding information gathering and use undermines the privacy conceptions of the past. Natural social practices of managing privacy in daily relations are overridden by socially-awkward communication tools, service providers struggle with security issues resulting in harmful data leaks, governments use mass surveillance techniques, the incentives of the digital economy threaten consumer privacy, and the advancement of consumergrade data-gathering technology enables new inter-personal abuses. A wide range of fields attempts to address technology-related privacy problems, however they vary immensely in terms of assumptions, scope and approach. Privacy of future use cases is typically handled vertically, instead of building upon previous work that can be re-contextualized, while current privacy problems are typically addressed per type in a more focused way. Because significant effort was required to make sense of the relations and structure of privacy-related work, this thesis attempts to transmit a structured view of it. It is multi-disciplinary - from cryptography to economics, including distributed systems and information theory - and addresses privacy issues of different natures. As existing work is framed and discussed, the contributions to the state-of-theart done in the scope of this thesis are presented. The contributions add to five distinct areas: 1) identity in distributed systems; 2) future context-aware services; 3) event-based context management; 4) low-latency information flow control; 5) high-dimensional dataset anonymity. Finally, having laid out such landscape of the privacy-preserving work, the current and future privacy challenges are discussed, considering not only technical but also socio-economic perspectives.
Resumo:
In today’s big data world, data is being produced in massive volumes, at great velocity and from a variety of different sources such as mobile devices, sensors, a plethora of small devices hooked to the internet (Internet of Things), social networks, communication networks and many others. Interactive querying and large-scale analytics are being increasingly used to derive value out of this big data. A large portion of this data is being stored and processed in the Cloud due the several advantages provided by the Cloud such as scalability, elasticity, availability, low cost of ownership and the overall economies of scale. There is thus, a growing need for large-scale cloud-based data management systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics can grow linearly with the time and resources required. Reducing the cost of data analytics in the Cloud thus remains a primary challenge. In my dissertation research, I have focused on building efficient and cost-effective cloud-based data management systems for different application domains that are predominant in cloud computing environments. In the first part of my dissertation, I address the problem of reducing the cost of transactional workloads on relational databases to support database-as-a-service in the Cloud. The primary challenges in supporting such workloads include choosing how to partition the data across a large number of machines, minimizing the number of distributed transactions, providing high data availability, and tolerating failures gracefully. I have designed, built and evaluated SWORD, an end-to-end scalable online transaction processing system, that utilizes workload-aware data placement and replication to minimize the number of distributed transactions that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement of data, and during query execution at runtime. In the second part of my dissertation, I focus on sampling-based progressive analytics as a means to reduce the cost of data analytics in the relational domain. Sampling has been traditionally used by data scientists to get progressive answers to complex analytical tasks over large volumes of data. Typically, this involves manually extracting samples of increasing data size (progressive samples) for exploratory querying. This provides the data scientists with user control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive data-parallel computation framework, NOW!, that provides support for progressive analytics over big data. In particular, NOW! enables progressive relational (SQL) query support in the Cloud using unique progress semantics that allow efficient and deterministic query processing over samples providing meaningful early results and provenance to data scientists. NOW! enables the provision of early results using significantly fewer resources thereby enabling a substantial reduction in the cost incurred during such analytics. Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics on large-scale graph-structured data in the Cloud. The system is based on the key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph; examples include ego network analysis, motif counting in biological networks, finding social circles in social networks, personalized recommendations, link prediction, etc. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high execution overheads. Further, the lack of support for extracting the relevant portions of the graph that are of interest to an analysis task and loading it onto distributed memory leads to poor scalability. NSCALE allows users to write programs at the level of neighborhoods or subgraphs rather than at the level of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient distributed execution of these neighborhood-centric complex analysis tasks over largescale graphs, while minimizing resource consumption and communication cost, thereby substantially reducing the overall cost of graph data analytics in the Cloud. The results of our extensive experimental evaluation of these prototypes with several real-world data sets and applications validate the effectiveness of our techniques which provide orders-of-magnitude reductions in the overheads of distributed data querying and analysis in the Cloud.
Resumo:
Discovery Driven Analysis (DDA) is a common feature of OLAP technology to analyze structured data. In essence, DDA helps analysts to discover anomalous data by highlighting 'unexpected' values in the OLAP cube. By giving indications to the analyst on what dimensions to explore, DDA speeds up the process of discovering anomalies and their causes. However, Discovery Driven Analysis (and OLAP in general) is only applicable on structured data, such as records in databases. We propose a system to extend DDA technology to semi-structured text documents, that is, text documents with a few structured data. Our system pipeline consists of two stages: first, the text part of each document is structured around user specified dimensions, using semi-PLSA algorithm; then, we adapt DDA to these fully structured documents, thus enabling DDA on text documents. We present some applications of this system in OLAP analysis and show how scalability issues are solved. Results show that our system can handle reasonable datasets of documents, in real time, without any need for pre-computation.
Resumo:
Open Access zu öffentlich geförderten wissenschaftlichen Publikationen ist unter dem Vorzeichen der „Openness“ Teil einer zunehmend bedeutsamen globalen Entwicklung mit strukturellen Folgen für Wissenschaft, Forschung und Bildung. Dabei bedingen die jeweiligen fachkulturellen Ausgangslagen und ökonomischen Interessenskonstellationen sehr stark, in welcher Weise, mit welcher Reichweite und Akzeptanz sich das Open-Access-Paradigma konkret materialisiert. Die vorliegende Arbeit geht dieser Frage am Beispiel des inter- bzw. pluridisziplinären Feldes der Erziehungswissenschaft/Bildungsforschung nach. Zum einen werden die fachlichen und soziokulturellen Konstellationen des Publizierens im disziplinären Feld, die verlagswirtschaftlichen Marktkonstellationen sowie die informationsinfrastrukturellen Bedingungen des Fachgebietes analysiert und ein differenziertes Gesamtbild erstellt. Gestützt auf eine Online-Befragung der Fachcommunity Erziehungswissenschaft/Bildungsforschung werden weitergehende Erkenntnisse über vorhandene Open-Access-Erfahrungen im Fachgebiet und Hemmnisse bzw. Anforderungen an das neue Publikationsmodell aus der Sicht der Wissenschaftler/innen selbst – sowie explorativ aus Sicht der Studierenden und der Bildungspraxis - ermittelt. Wesentliche Faktoren bei der Betrachtung der Potenziale und Effekte von Open Access im Fachgebiet bilden die Faktoren akademischer Status und Funktion, Interdisziplinarität und fachliche Provenienz sowie das Verhältnis von Bildungspraxis und akademischem Sektor. (DIPF/Orig.)
Resumo:
Manufacturing companies have passed from selling uniquely tangible products to adopting a service-oriented approach to generate steady and continuous revenue streams. Nowadays, equipment and machine manufacturers possess technologies to track and analyze product-related data for obtaining relevant information from customers’ use towards the product after it is sold. The Internet of Things on Industrial environments will allow manufacturers to leverage lifecycle product traceability for innovating towards an information-driven services approach, commonly referred as “Smart Services”, for achieving improvements in support, maintenance and usage processes. The aim of this study is to conduct a literature review and empirical analysis to present a framework that describes a customer-oriented approach for developing information-driven services leveraged by the Internet of Things in manufacturing companies. The empirical study employed tools for the assessment of customer needs for analyzing the case company in terms of information requirements and digital needs. The literature review supported the empirical analysis with a deep research on product lifecycle traceability and digitalization of product-related services within manufacturing value chains. As well as the role of simulation-based technologies on supporting the “Smart Service” development process. The results obtained from the case company analysis show that the customers mainly demand information that allow them to monitor machine conditions, machine behavior on different geographical conditions, machine-implement interactions, and resource and energy consumption. Put simply, information outputs that allow them to increase machine productivity for maximizing yields, save time and optimize resources in the most sustainable way. Based on customer needs assessment, this study presents a framework to describe the initial phases of a “Smart Service” development process, considering the requirements of Smart Engineering methodologies.
Resumo:
Due to the sensitive nature of patient data, the secondary use of electronic health records (EHR) is restricted in scientific research and product development. Such restrictions pursue to preserve the privacy of respective patients by limiting the availability and variety of sensitive patient data. Current limitations do not correspond with the actual needs requested by the potential secondary users. In this thesis, the secondary use of Finnish and Swedish EHR data is explored for the purpose of enhancing the availability of such data for clinical research and product development. Involved EHR-related procedures and technologies are analysed to identify the issues limiting the secondary use of patient data. Successful secondary use of patient data increases the data value. To explore the identified circumstances, a case study of potential secondary users and use intentions regarding EHR data was carried out in Finland and Sweden. The data collection for the conducted case study was performed using semi-structured interviews. In total, 14 Finnish and Swedish experts representing scientific research, health management, and business were interviewed. The motivation for the corresponding interviews was to evaluate the protection of EHR data used for secondary purposes. The efficiency of implemented procedures and technologies was analysed in terms of data availability and privacy preserving. The results of the conducted case study show that the factors affecting EHR availability are divided to three categories: management of patient data, preservation of patients' privacy, and potential secondary users. Identified issues regarding data management included laborious and inconsistent data request procedures and the role and effect of external service providers. Based on the study findings, two secondary use approaches enabling the secondary use of EHR data are identified: data alteration and protected processing environment. Data alteration increases the availability of relevant EHR data, further decreasing the value of such data. Protected processing approach restricts the amount of potential users and use intentions while providing more valuable data content.
Resumo:
Healthcare systems have assimilated information and communication technologies in order to improve the quality of healthcare and patient's experience at reduced costs. The increasing digitalization of people's health information raises however new threats regarding information security and privacy. Accidental or deliberate data breaches of health data may lead to societal pressures, embarrassment and discrimination. Information security and privacy are paramount to achieve high quality healthcare services, and further, to not harm individuals when providing care. With that in mind, we give special attention to the category of Mobile Health (mHealth) systems. That is, the use of mobile devices (e.g., mobile phones, sensors, PDAs) to support medical and public health. Such systems, have been particularly successful in developing countries, taking advantage of the flourishing mobile market and the need to expand the coverage of primary healthcare programs. Many mHealth initiatives, however, fail to address security and privacy issues. This, coupled with the lack of specific legislation for privacy and data protection in these countries, increases the risk of harm to individuals. The overall objective of this thesis is to enhance knowledge regarding the design of security and privacy technologies for mHealth systems. In particular, we deal with mHealth Data Collection Systems (MDCSs), which consists of mobile devices for collecting and reporting health-related data, replacing paper-based approaches for health surveys and surveillance. This thesis consists of publications contributing to mHealth security and privacy in various ways: with a comprehensive literature review about mHealth in Brazil; with the design of a security framework for MDCSs (SecourHealth); with the design of a MDCS (GeoHealth); with the design of Privacy Impact Assessment template for MDCSs; and with the study of ontology-based obfuscation and anonymisation functions for health data.
Resumo:
In recent years, there has been exponential growth in using virtual spaces, including dialogue systems, that handle personal information. The concept of personal privacy in the literature is discussed and controversial, whereas, in the technological field, it directly influences the degree of reliability perceived in the information system (privacy ‘as trust’). This work aims to protect the right to privacy on personal data (GDPR, 2018) and avoid the loss of sensitive content by exploring sensitive information detection (SID) task. It is grounded on the following research questions: (RQ1) What does sensitive data mean? How to define a personal sensitive information domain? (RQ2) How to create a state-of-the-art model for SID?(RQ3) How to evaluate the model? RQ1 theoretically investigates the concepts of privacy and the ontological state-of-the-art representation of personal information. The Data Privacy Vocabulary (DPV) is the taxonomic resource taken as an authoritative reference for the definition of the knowledge domain. Concerning RQ2, we investigate two approaches to classify sensitive data: the first - bottom-up - explores automatic learning methods based on transformer networks, the second - top-down - proposes logical-symbolic methods with the construction of privaframe, a knowledge graph of compositional frames representing personal data categories. Both approaches are tested. For the evaluation - RQ3 – we create SPeDaC, a sentence-level labeled resource. This can be used as a benchmark or training in the SID task, filling the gap of a shared resource in this field. If the approach based on artificial neural networks confirms the validity of the direction adopted in the most recent studies on SID, the logical-symbolic approach emerges as the preferred way for the classification of fine-grained personal data categories, thanks to the semantic-grounded tailor modeling it allows. At the same time, the results highlight the strong potential of hybrid architectures in solving automatic tasks.
Resumo:
Today’s data are increasingly complex and classical statistical techniques need growingly more refined mathematical tools to be able to model and investigate them. Paradigmatic situations are represented by data which need to be considered up to some kind of trans- formation and all those circumstances in which the analyst finds himself in the need of defining a general concept of shape. Topological Data Analysis (TDA) is a field which is fundamentally contributing to such challenges by extracting topological information from data with a plethora of interpretable and computationally accessible pipelines. We con- tribute to this field by developing a series of novel tools, techniques and applications to work with a particular topological summary called merge tree. To analyze sets of merge trees we introduce a novel metric structure along with an algorithm to compute it, define a framework to compare different functions defined on merge trees and investigate the metric space obtained with the aforementioned metric. Different geometric and topolog- ical properties of the space of merge trees are established, with the aim of obtaining a deeper understanding of such trees. To showcase the effectiveness of the proposed metric, we develop an application in the field of Functional Data Analysis, working with functions up to homeomorphic reparametrization, and in the field of radiomics, where each patient is represented via a clustering dendrogram.
Resumo:
Diapoma is reviewed and four species are recognized: (1) Diapoma thauma, new species, from streams of the rio Jacuí basin, state of Rio Grande do Sul; (2) D. pyrrhopteryx, new species collected from the rio Canoas and streams flowing into this basin in the states of Rio Grande do Sul and Santa Catarina, Brazil; (3) Diapoma terofali, from streams flowing into rio Uruguay in Uruguay and Rio Grande do Sul, Brazil and streams flowing into rio de la Plata, Argentina; and (4) Diapoma speculiferum, from lowland coastal streams in Rio Grande do Sul, Brazil and Uruguay. Diapoma pyrrhopteryx possess the posteroventral opercular elongation typical of D. speculiferum, type species of the genus, but which is absent in D. thauma and D. terofali. Nonetheless, all the diapomin species have the caudal pouch organ about equally developed in both sexes and the dorsal portion of the pouch opening bordered by a series of 3 to 8 elongated scales, the two derived features that characterize the group. The two previously described species, D. speculiferum and D. terofali, are redescribed. Previous hypotheses of relationships among the diapomin genera Planaltina, Diapoma and Acrobrycon are discussed on the basis of preliminary morphological information. It is proposed that the Diapomini is a monophyletic group. An identification key, information on sexual dimorphism, gonad anatomy, reproductive mode and distribution of the species of Diapoma are provided.
Resumo:
A catalogue is provided with the type material of four superfamilies of "Acalyptrate" (Conopoidea, Diopsoidea, Nerioidea and Tephritoidea) held in the collection of the Museu de Zoologia da Universidade de São Paulo (MZUSP), São Paulo, Brazil. Concerning the taxa dealt with herein, the Diptera collection of MZUSP held 77 holotypes, 4 "allotypes" and 194 paratypes. In this paper, information about data labels, preservation and missing structures of the type specimens is given.
Resumo:
To investigate stress intensity and coping style in older people with mild Alzheimer`s disease. The potential risk assessment of a stress event and the devising of coping strategies are dependent on cognitive function. Although older individuals with Alzheimer`s disease present significant cognitive impairment, little is known about how these individuals experience stress events and select coping strategies in stress situations. Survey. A convenient sample of 30 cognitively healthy older people and 30 individuals with mild Alzheimer`s disease were given an assessment battery of stress indicators (Symptom Stress List, Cornell Scale for Depression in Dementia, State-Trait Anxiety Inventory), coping style (Jalowiec Coping Scale) and cognitive performance (mini-mental state exam) were applied in both groups. Statistical analysis of the data employed the Mann-Whitney test to compare medians of stress indicators and coping style, Fischer`s exact test to compare proportions when expected frequencies were lower than five, and Spearman`s correlation coefficient to verify correlation between coping style and cognitive performance. Both groups suffered from the same stress intensity (p = 0.254). Regarding coping styles, although differences were not statistically significant (p = 0.124), emotion-oriented coping was predominant in the patients with Alzheimer`s disease. However, those individuals displaying better cognitive performance in the Alzheimer`s disease group had selected coping strategies focused on problem solving (p = 0.0074). Despite a tendency for older people with Alzheimer`s disease to select escape strategies and emotional control, rather than attempting to resolve or lesser the consequences arising from a problem, coping ultimately depends on cognitive performance of the individual. The findings of this study provide information and data to assist planning of appropriate support care for individuals with Alzheimer`s disease who experience stress situations, based on their cognitive performance.