867 resultados para data-basemanagement
Resumo:
A tag-based item recommendation method generates an ordered list of items, likely interesting to a particular user, using the users past tagging behaviour. However, the users tagging behaviour varies in different tagging systems. A potential problem in generating quality recommendation is how to build user profiles, that interprets user behaviour to be effectively used, in recommendation models. Generally, the recommendation methods are made to work with specific types of user profiles, and may not work well with different datasets. In this paper, we investigate several tagging data interpretation and representation schemes that can lead to building an effective user profile. We discuss the various benefits a scheme brings to a recommendation method by highlighting the representative features of user tagging behaviours on a specific dataset. Empirical analysis shows that each interpretation scheme forms a distinct data representation which eventually affects the recommendation result. Results on various datasets show that an interpretation scheme should be selected based on the dominant usage in the tagging data (i.e. either higher amount of tags or higher amount of items present). The usage represents the characteristic of user tagging behaviour in the system. The results also demonstrate how the scheme is able to address the cold-start user problem.
Resumo:
Bactrocera papayae Drew & Hancock, Bactrocera philippinensis Drew & Hancock, Bactrocera carambolae Drew & Hancock, and Bactrocera invadens Drew, Tsuruta & White are four horticultural pest tephritid fruit fly species that are highly similar, morphologically and genetically, to the destructive pest, the Oriental fruit fly, Bactrocera dorsalis (Hendel) (Diptera: Tephritidae). This similarity has rendered the discovery of reliable diagnostic characters problematic, which, in view of the economic importance of these taxa and the international trade implications, has resulted in ongoing difficulties for many areas of plant protection and food security. Consequently, a major international collaborative and integrated multidisciplinary research effort was initiated in 2009 to build upon existing literature with the specific aim of resolving biological species limits among B. papayae, B. philippinensis, B. carambolae, B. invadens and B. dorsalis to overcome constraints to pest management and international trade. Bactrocera philippinensis has recently been synonymized with B. papayae as a result of this initiative and this review corroborates that finding; however, the other names remain in use. While consistent characters have been found to reliably distinguish B. carambolae from B. dorsalis, B. invadens and B. papayae, no such characters have been found to differentiate the latter three putative species. We conclude that B. carambolae is a valid species and that the remaining taxa, B. dorsalis, B. invadens and B. papayae, represent the same species. Thus, we consider B. dorsalis (Hendel) as the senior synonym of B. papayae Drew and Hancock syn.n. and B. invadens Drew, Tsuruta & White syn.n. A redescription of B. dorsalis is provided. Given the agricultural importance of B. dorsalis, this taxonomic decision will have significant global plant biosecurity implications, affecting pest management, quarantine, international trade, postharvest treatment and basic research. Throughout the paper, we emphasize the value of independent and multidisciplinary tools in delimiting species, particularly in complicated cases involving morphologically cryptic taxa.
Resumo:
This thesis presents a novel program parallelization technique incorporating with dynamic and static scheduling. It utilizes a problem specific pattern developed from the prior knowledge of the targeted problem abstraction. Suitable for solving complex parallelization problems such as data intensive all-to-all comparison constrained by memory, the technique delivers more robust and faster task scheduling compared to the state-of-the art techniques. Good performance is achieved from the technique in data intensive bioinformatics applications.
Resumo:
This pilot project investigated the existing practices and processes of Proficient, Highly Accomplished and Lead teachers in the interpretation, analysis and implementation of National Assessment Program – Literacy and Numeracy (NAPLAN) data. A qualitative case study approach was the chosen methodology, with nine teachers across a variety of school sectors interviewed. Themes and sub-themes were identified from the participants’ interview responses revealing the ways in which Queensland teachers work with NAPLAN data. The data illuminated that generally individual schools and teachers adopted their own ways of working with data, with approaches ranging from individual/ad hoc, to hierarchical or a whole school approach. Findings also revealed that data are the responsibility of various persons from within the school hierarchy; some working with the data electronically whilst others rely on manual manipulation. Manipulation of data is used for various purposes including tracking performance, value adding and targeting programmes for specific groups of students, for example the gifted and talented. Whilst all participants had knowledge of intervention programmes and how practice could be modified, there were large inconsistencies in knowledge and skills across schools. Some see the use of data as a mechanism for accountability, whilst others mention data with regards to changing the school culture and identifying best practice. Overall, the findings showed inconsistencies in approach to focus area 5.4. Recommendations therefore include a more national approach to the use of educational data.
Resumo:
In this chapter, we draw out the relevant themes from a range of critical scholarship from the small body of digital media and software studies work that has focused on the politics of Twitter data and the sociotechnical means by which access is regulated. We highlight in particular the contested relationships between social media research (in both academic and non-academic contexts) and the data wholesale, retail, and analytics industries that feed on them. In the second major section of the chapter we discuss in detail the pragmatic edge of these politics in terms of what kinds of scientific research is and is not possible in the current political economy of Twitter data access. Finally, at the end of the chapter we return to the much broader implications of these issues for the politics of knowledge, demonstrating how the apparently microscopic level of how the Twitter API mediates access to Twitter data actually inscribes and influences the macro level of the global political economy of science itself, through re-inscribing institutional and traditional disciplinary privilege We conclude with some speculations about future developments in data rights and data philanthropy that may at least mitigate some of these negative impacts.
Resumo:
Monitoring the environment with acoustic sensors is an effective method for understanding changes in ecosystems. Through extensive monitoring, large-scale, ecologically relevant, datasets can be produced that can inform environmental policy. The collection of acoustic sensor data is a solved problem; the current challenge is the management and analysis of raw audio data to produce useful datasets for ecologists. This paper presents the applied research we use to analyze big acoustic datasets. Its core contribution is the presentation of practical large-scale acoustic data analysis methodologies. We describe details of the data workflows we use to provide both citizen scientists and researchers practical access to large volumes of ecoacoustic data. Finally, we propose a work in progress large-scale architecture for analysis driven by a hybrid cloud-and-local production-grade website.
Resumo:
This program of research linked police and health data collections to investigate the potential benefits for road safety in terms of enhancing the quality of data. This research has important implications for road safety because, although police collected data has historically underpinned efforts in the area, it is known that many road crashes are not reported to police and that these data lack specific injury severity information. This research shows that data linkage provides a more accurate quantification of the severity and prevalence of road crash injuries which is essential for: prioritising funding; targeting interventions; and estimating the burden and cost of road trauma.
Resumo:
Transit passenger market segmentation enables transit operators to target different classes of transit users for targeted surveys and various operational and strategic planning improvements. However, the existing market segmentation studies in the literature have been generally done using passenger surveys, which have various limitations. The smart card (SC) data from an automated fare collection system facilitate the understanding of the multiday travel pattern of transit passengers and can be used to segment them into identifiable types of similar behaviors and needs. This paper proposes a comprehensive methodology for passenger segmentation solely using SC data. After reconstructing the travel itineraries from SC transactions, this paper adopts the density-based spatial clustering of application with noise (DBSCAN) algorithm to mine the travel pattern of each SC user. An a priori market segmentation approach then segments transit passengers into four identifiable types. The methodology proposed in this paper assists transit operators to understand their passengers and provides them oriented information and services.
Resumo:
High-Order Co-Clustering (HOCC) methods have attracted high attention in recent years because of their ability to cluster multiple types of objects simultaneously using all available information. During the clustering process, HOCC methods exploit object co-occurrence information, i.e., inter-type relationships amongst different types of objects as well as object affinity information, i.e., intra-type relationships amongst the same types of objects. However, it is difficult to learn accurate intra-type relationships in the presence of noise and outliers. Existing HOCC methods consider the p nearest neighbours based on Euclidean distance for the intra-type relationships, which leads to incomplete and inaccurate intra-type relationships. In this paper, we propose a novel HOCC method that incorporates multiple subspace learning with a heterogeneous manifold ensemble to learn complete and accurate intra-type relationships. Multiple subspace learning reconstructs the similarity between any pair of objects that belong to the same subspace. The heterogeneous manifold ensemble is created based on two-types of intra-type relationships learnt using p-nearest-neighbour graph and multiple subspaces learning. Moreover, in order to make sure the robustness of clustering process, we introduce a sparse error matrix into matrix decomposition and develop a novel iterative algorithm. Empirical experiments show that the proposed method achieves improved results over the state-of-art HOCC methods for FScore and NMI.
Resumo:
This thesis proposes three novel models which extend the statistical methodology for motor unit number estimation, a clinical neurology technique. Motor unit number estimation is important in the treatment of degenerative muscular diseases and, potentially, spinal injury. Additionally, a recent and untested statistic to enable statistical model choice is found to be a practical alternative for larger datasets. The existing methods for dose finding in dual-agent clinical trials are found to be suitable only for designs of modest dimensions. The model choice case-study is the first of its kind containing interesting results using so-called unit information prior distributions.
Resumo:
When crystallization screening is conducted many outcomes are observed but typically the only trial recorded in the literature is the condition that yielded the crystal(s) used for subsequent diffraction studies. The initial hit that was optimized and the results of all the other trials are lost. These missing results contain information that would be useful for an improved general understanding of crystallization. This paper provides a report of a crystallization data exchange (XDX) workshop organized by several international large-scale crystallization screening laboratories to discuss how this information may be captured and utilized. A group that administers a significant fraction of the worlds crystallization screening results was convened, together with chemical and structural data informaticians and computational scientists who specialize in creating and analysing large disparate data sets. The development of a crystallization ontology for the crystallization community was proposed. This paper (by the attendees of the workshop) provides the thoughts and rationale leading to this conclusion. This is brought to the attention of the wider audience of crystallographers so that they are aware of these early efforts and can contribute to the process going forward. © 2012 International Union of Crystallography All rights reserved.
Resumo:
Many techniques in information retrieval produce counts from a sample, and it is common to analyse these counts as proportions of the whole - term frequencies are a familiar example. Proportions carry only relative information and are not free to vary independently of one another: for the proportion of one term to increase, one or more others must decrease. These constraints are hallmarks of compositional data. While there has long been discussion in other fields of how such data should be analysed, to our knowledge, Compositional Data Analysis (CoDA) has not been considered in IR. In this work we explore compositional data in IR through the lens of distance measures, and demonstrate that common measures, naïve to compositions, have some undesirable properties which can be avoided with composition-aware measures. As a practical example, these measures are shown to improve clustering. Copyright 2014 ACM.
Resumo:
Due to the availability of huge number of web services, finding an appropriate Web service according to the requirements of a service consumer is still a challenge. Moreover, sometimes a single web service is unable to fully satisfy the requirements of the service consumer. In such cases, combinations of multiple inter-related web services can be utilised. This paper proposes a method that first utilises a semantic kernel model to find related services and then models these related Web services as nodes of a graph. An all-pair shortest-path algorithm is applied to find the best compositions of Web services that are semantically related to the service consumer requirement. The recommendation of individual and composite Web services composition for a service request is finally made. Empirical evaluation confirms that the proposed method significantly improves the accuracy of service discovery in comparison to traditional keyword-based discovery methods.