970 resultados para data collections
Resumo:
This paper evaluates the efficiency of a number of popular corpus-based distributional models in performing discovery on very large document sets, including online collections. Literature-based discovery is the process of identifying previously unknown connections from text, often published literature, that could lead to the development of new techniques or technologies. Literature-based discovery has attracted growing research interest ever since Swanson's serendipitous discovery of the therapeutic effects of fish oil on Raynaud's disease in 1986. The successful application of distributional models in automating the identification of indirect associations underpinning literature-based discovery has been heavily demonstrated in the medical domain. However, we wish to investigate the computational complexity of distributional models for literature-based discovery on much larger document collections, as they may provide computationally tractable solutions to tasks including, predicting future disruptive innovations. In this paper we perform a computational complexity analysis on four successful corpus-based distributional models to evaluate their fit for such tasks. Our results indicate that corpus-based distributional models that store their representations in fixed dimensions provide superior efficiency on literature-based discovery tasks.
Resumo:
Collecting regular personal reflections from first year teachers in rural and remote schools is challenging as they are busily absorbed in their practice, and separated from each other and the researchers by thousands of kilometres. In response, an innovative web-based solution was designed to both collect data and be a responsive support system for early career teachers as they came to terms with their new professional identities within rural and remote school settings. Using an emailed link to a web-based application named goingok.com, the participants are charting their first year plotlines using a sliding scale from ‘distressed’, ‘ok’ to ‘soaring’ and describing their self-assessment in short descriptive posts. These reflections are visible to the participants as a developing online journal, while the collections of de-identified developing plotlines are visible to the research team, alongside numerical data. This paper explores important aspects of the design process, together with the challenges and opportunities encountered in its implementation. A number of the key considerations for choosing to develop a web application for data collection are initially identified, and the resultant application features and scope are then examined. Examples are then provided about how a responsive software development approach can be part of a supportive feedback loop for participants while being an effective data collection process. Opportunities for further development are also suggested with projected implications for future research.
Resumo:
Data associated with germplasm collections are typically large and multivariate with a considerable number of descriptors measured on each of many accessions. Pattern analysis methods of clustering and ordination have been identified as techniques for statistically evaluating the available diversity in germplasm data. While used in many studies, the approaches have not dealt explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions). To consider the application of these techniques to germplasm evaluation data, 11328 accessions of groundnut (Arachis hypogaea L) from the International Research Institute for the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the rainy and post-rainy growing seasons were used. The ordination technique of principal component analysis was used to reduce the dimensionality of the germplasm data. The identification of phenotypically similar groups of accessions within large scale data via the computationally intensive hierarchical clustering techniques was not feasible and non-hierarchical techniques had to be used. Finite mixture models that maximise the likelihood of an accession belonging to a cluster were used to cluster the accessions in this collection. The patterns of response for the different growing seasons were found to be highly correlated. However, in relating the results to passport and other characterisation and evaluation descriptors, the observed patterns did not appear to be related to taxonomy or any other well known characteristics of groundnut.
Resumo:
As a sequel to a paper that dealt with the analysis of two-way quantitative data in large germplasm collections, this paper presents analytical methods appropriate for two-way data matrices consisting of mixed data types, namely, ordered multicategory and quantitative data types. While various pattern analysis techniques have been identified as suitable for analysis of the mixed data types which occur in germplasm collections, the clustering and ordination methods used often can not deal explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions) with incomplete information. However, it is shown that the ordination technique of principal component analysis and the mixture maximum likelihood method of clustering can be employed to achieve such analyses. Germplasm evaluation data for 11436 accessions of groundnut (Arachis hypogaea L.) from the International Research Institute of the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the post-rainy season and five ordered multicategory descriptors were used. Pattern analysis results generally indicated that the accessions could be distinguished into four regions along the continuum of growth habit (or plant erectness). Interpretation of accession membership in these regions was found to be consistent with taxonomic information, such as subspecies. Each growth habit region contained accessions from three of the most common groundnut botanical varieties. This implies that within each of the habit types there is the full range of expression for the other descriptors used in the analysis. Using these types of insights, the patterns of variability in germplasm collections can provide scientists with valuable information for their plant improvement programs.
Resumo:
Data in germplasm collections contain a mixture of data types; binary, multistate and quantitative. Given the multivariate nature of these data, the pattern analysis methods of classification and ordination have been identified as suitable techniques for statistically evaluating the available diversity. The proximity (or resemblance) measure, which is in part the basis of the complementary nature of classification and ordination techniques, is often specific to particular data types. The use of a combined resemblance matrix has an advantage over data type specific proximity measures. This measure accommodates the different data types without manipulating them to be of a specific type. Descriptors are partitioned into their data types and an appropriate proximity measure is used on each. The separate proximity matrices, after range standardisation, are added as a weighted average and the combined resemblance matrix is then used for classification and ordination. Germplasm evaluation data for 831 accessions of groundnut (Arachis hypogaea L.) from the Australian Tropical Field Crops Genetic Resource Centre, Biloela, Queensland were examined. Data for four binary, five ordered multistate and seven quantitative descriptors have been documented. The interpretative value of different weightings - equal and unequal weighting of data types to obtain a combined resemblance matrix - was investigated by using principal co-ordinate analysis (ordination) and hierarchical cluster analysis. Equal weighting of data types was found to be more valuable for these data as the results provided a greater insight into the patterns of variability available in the Australian groundnut germplasm collection. The complementary nature of pattern analysis techniques enables plant breeders to identify relevant accessions in relation to the descriptors which distinguish amongst them. This additional information may provide plant breeders with a more defined entry point into the germplasm collection for identifying sources of variability for their plant improvement program, thus improving the utilisation of germplasm resources.
Resumo:
Herbarium accession data offer a useful historical botanical perspective and have been used to track the spread of plant invasions through time and space. Nevertheless, few studies have utilised this resource for genetic analysis to reconstruct a more complete picture of historical invasion dynamics, including the occurrence of separate introduction events. In this study, we combined nuclear and chloroplast microsatellite analyses of contemporary and historical collections of Senecio madagascariensis, a globally invasive weed first introduced to Australia c. 1918 from its native South Africa. Analysis of nuclear microsatellites, together with temporal spread data and simulations of herbarium voucher sampling, revealed distinct introductions to south-eastern Australia and mid-eastern Australia. Genetic diversity of the south-eastern invasive population was lower than in the native range, but higher than in the mid-eastern invasion. In the invasive range, despite its low resolution, our chloroplast microsatellite data revealed the occurrence of new haplotypes over time, probably as the result of subsequent introduction(s) to Australia from the native range during the latter half of the 20th century. Our work demonstrates how molecular studies of contemporary and historical field collections can be combined to reconstruct a more complete picture of the invasion history of introduced taxa. Further, our study indicates that a survey of contemporary samples only (as undertaken for the majority of invasive species studies) would be insufficient to identify potential source populations and occurrence of multiple introductions.
Resumo:
Many organizations realize that increasing amounts of data (“Big Data”) need to be dealt with intelligently in order to compete with other organizations in terms of efficiency, speed and services. The goal is not to collect as much data as possible, but to turn event data into valuable insights that can be used to improve business processes. However, data-oriented analysis approaches fail to relate event data to process models. At the same time, large organizations are generating piles of process models that are disconnected from the real processes and information systems. In this chapter we propose to manage large collections of process models and event data in an integrated manner. Observed and modeled behavior need to be continuously compared and aligned. This results in a “liquid” business process model collection, i.e. a collection of process models that is in sync with the actual organizational behavior. The collection should self-adapt to evolving organizational behavior and incorporate relevant execution data (e.g. process performance and resource utilization) extracted from the logs, thereby allowing insightful reports to be produced from factual organizational data.
Resumo:
QUT Library Research Support has simplified and streamlined the process of research data management planning, storage, discovery and reuse through collaboration and the use of integrated and tailored online tools, and a simplification of the metadata schema. This poster presents the integrated data management services a QUT, including QUT’s Data Management Planning Tool, Research Data Finder, Spatial Data Finder and Software Finder, and information on the simplified Registry Interchange Format – Collections and Services (RIF-CS) Schema. The QUT Data Management Planning (DMP) Tool was built using the Digital Curation Centre’s DMP Online Tool and modified to QUT’s needs and policies. The tool allows researchers and Higher Degree Research students to plan how to handle research data throughout the active phase of their research. The plan is promoted as a ‘live’ document’ and researchers are encouraged to update it as required. The information entered into the plan can be made private or shared with supervisors, project members and external examiners. A plan is mandatory when requesting storage space on the QUT Research Data Storage Service. QUT’s Research Data Finder is integrated with QUT’s Academic Profiles and the Data Management Planning Tool to create a seamless data management process. This process aims to encourage the creation of high quality rich records which facilitate discovery and reuse of quality data. The Registry Interchange Format – Collections and Services (RIF-CS) Schema that is used in the QUT Research Data Finder was simplified to “RIF-CS lite” to reflect mandatory and optional metadata requirements. RIF-CS lite removed schema fields that were underused or extra to the needs of the users and system. This has reduced the amount of metadata fields required from users and made integration of systems a far more simple process where field content is easily shared across services making the process of collecting metadata as transparent as possible.
Resumo:
Developing and maintaining a successful institutional repository for research publications requires a considerable investment by the institution. Most of the money is spent on developing the skill-sets of existing staff or hiring new staff with the necessary skills. The return on this investment can be magnified by using this valuable infrastructure to curate collections of other materials such as learning objects, student work, conference proceedings and institutional or local community heritage materials. When Queensland University of Technology (QUT) implemented its repository for research publications (QUT ePrints) over 11 years ago, it was one of the first institutional repositories to be established in Australia. Currently, the repository holds over 29,000 open access research publications and the cumulative total number of full-text downloads for these document now exceeds 16 million. The full-text deposit rate for recently-published peer reviewed papers (currently over 74%) shows how well the repository has been embraced by QUT researchers. The success of QUT ePrints has resulted in requests to accommodate a plethora of materials which are ‘out of scope’ for this repository. QUT Library saw this as an opportunity to use its repository infrastructure (software, technical know-how and policies) to develop and implement a metadata repository for its research datasets (QUT Research Data Finder), a repository for research-related software (QUT Software Finder) and to curate a number of digital collections of institutional and local community heritage materials (QUT Digital Collections). This poster describes the repositories and digital collections curated by QUT Library and outlines the value delivered to the institution, and the wider community, by these initiatives.
Resumo:
The mountain yellow-legged frog Rana muscosa sensu lato, once abundant in the Sierra Nevada of California and Nevada, and the disjunct Transverse Ranges of southern California, has declined precipitously throughout its range, even though most of its habitat is protected. The species is now extinct in Nevada and reduced to tiny remnants in southern California, where as a distinct population segment, it is classified as Endangered. Introduced predators (trout), air pollution and an infectious disease (chytridiomycosis) threaten remaining populations. A Bayesian analysis of 1901 base pairs of mitochondrial DNA confirms the presence of two deeply divergent clades that come into near contact in the Sierra Nevada. Morphological studies of museum specimens and analysis of acoustic data show that the two major mtDNA clades are readily differentiated phenotypically. Accordingly, we recognize two species, Rana sierrae, in the northern and central Sierra Nevada, and R. muscosa, in the southern Sierra Nevada and southern California. Existing data indicate no range overlap. These results have important implications for the conservation of these two species as they illuminate a profound mismatch between the current delineation of the distinct population segments (southern California vs. Sierra Nevada) and actual species boundaries. For example, our study finds that remnant populations of R. muscosa exist in both the southern Sierra Nevada and the mountains of southern California, which may broaden options for management. In addition, despite the fact that only the southern California populations are listed as Endangered, surveys conducted since 1995 at 225 historic (1899-1994) localities from museum collections show that 93.3% (n=146) of R. sierrae populations and 95.2% (n=79) of R. muscosa populations are extinct. Evidence presented here underscores the need for revision of protected population status to include both species throughout their ranges.
Resumo:
The red porgy, Pagrus pagrus, is an important reef fish in several offshore fisheries along the southeastern United States. We examined samples from North Carolina through southeast Florida from recreational (headboat) and commercial (hook and line) fisheries, as well as samples from a fishery-independent source. Red porgy attain a maximum age of at least 18 years and 733 mm total length. The weight-length relationship is represented by the ln-ln transformed equation: W = 8.85 × 10–6(L)3.06, where W = whole weight in grams, and L = total length in mm. The von Bertalanffy growth equation fitted to the most recent, back-calculated lengths from all the samples is Lt = 644(1 – e –0.15(t + 0.76)). Our study revealed a difference in mean length at age of red porgy from the three sources. Red porgy in fishery-independent collections were smaller at age than specimens examined from fishery-dependent sources. The difference in length-at-age may be related to gear selectivity and have important consequences in the assessment of fish stocks.
Resumo:
This workshop followed on from two previous workshops held in Colombo, Sri Lanka, 2012 and Kochi, India in 2013. The 14 microsattellite markers had previously been developed for Indian Mackerel (Rastrelliger kanagurta) were used on 31 tissue collections from all eight countries were genotyped in India.
Resumo:
ARK (‘Access Research Knowledge’) was set up with a single goal: to make social science information on Northern Ireland available to the widest possible audience. The most well-known and widely used part of the ARK resource is CAIN (Conflict Archive on the INternet), which is one of the largest on-line collections of source material and information and about the Northern Ireland conflict. The compilation of CAIN's new Remembering: Victims, Survivors and Commemoration section raised issues related to the sensitivity of the material, as it feeds into the fundamental debate on the legacy of the Northern Ireland conflict. It also fundamentally raises the question to what extent archiving is a neutral or political activity and necessitates a discourse on responsibility and ethics among social researchers. Experiences from the establishment of the Northern Ireland Qualitative Archive (NIQA) shed light on future possibilities with regard to qualitative archives on the Northern Ireland conflict.
Resumo:
Abstract. Single-zone modelling is used to assess different collections of impeller 1D loss models. Three collections of loss models have been identified in literature, and the background to each of these collections is discussed. Each collection is evaluated using three modern automotive turbocharger style centrifugal compressors; comparisons of performance for each of the collections are made. An empirical data set taken from standard hot gas stand tests for each turbocharger is used as a baseline for comparison. Compressor range is predicted in this study; impeller diffusion ratio is shown to be a useful method of predicting compressor surge in 1D, and choke is predicted using basic compressible flow theory. The compressor designer can use this as a guide to identify the most compatible collection of losses for turbocharger compressor design applications. The analysis indicates the most appropriate collection for the design of automotive turbocharger centrifugal compressors.