957 resultados para Imbalanced datasets
Resumo:
At QUT research data refers to information that is generated or collected to be used as primary sources in the production of original research results, and which would be required to validate or replicate research findings (Callan, De Vine, & Baker, 2010). Making publicly funded research data discoverable by the broader research community and the public is a key aim of the Australian National Data Service (ANDS). Queensland University of Technology (QUT) has been innovating in this space by undertaking mutually dependant technical and content (metadata) focused projects funded by ANDS. Research Data Librarians identified and described datasets generated from Category 1 funded research at QUT, by interviewing researchers, collecting metadata and fashioning metadata records for upload to the Australian Research Data commons (ARDC) and exposure through the Research Data Australia interface. In parallel to this project, a Research Data Management Service and Metadata hub project were being undertaken by QUT High Performance Computing & Research Support specialists. These projects will collectively store and aggregate QUT’s metadata and research data from multiple repositories and administration systems and contribute metadata directly by OAI-PMH compliant feed to RDA. The pioneering nature of the work has resulted in a collaborative project dynamic where good data management practices and the discoverability and sharing of research data were the shared drivers for all activity. Each project’s development and progress was dependent on feedback from the other. The metadata structure evolved in tandem with the development of the repository and the development of the repository interface responded to meet the needs of the data interview process. The project environment was one of bottom-up collaborative approaches to process and system development which matched top-down strategic alliances crossing organisational boundaries in order to provide the deliverables required by ANDS. This paper showcases the work undertaken at QUT, focusing on the Seeding the Commons project as a case study, and illustrates how the data management projects are interconnected. It describes the processes and systems being established to make QUT research data more visible and the nature of the collaborations between organisational areas required to achieve this. The paper concludes with the Seeding the Commons project outcomes and the contribution this project made to getting more research data ‘out there’.
Resumo:
In automatic facial expression detection, very accurate registration is desired which can be achieved via a deformable model approach where a dense mesh of 60-70 points on the face is used, such as an active appearance model (AAM). However, for applications where manually labeling frames is prohibitive, AAMs do not work well as they do not generalize well to unseen subjects. As such, a more coarse approach is taken for person-independent facial expression detection, where just a couple of key features (such as face and eyes) are tracked using a Viola-Jones type approach. The tracked image is normally post-processed to encode for shift and illumination invariance using a linear bank of filters. Recently, it was shown that this preprocessing step is of no benefit when close to ideal registration has been obtained. In this paper, we present a system based on the Constrained Local Model (CLM) which is a generic or person-independent face alignment algorithm which gains high accuracy. We show these results against the LBP feature extraction on the CK+ and GEMEP datasets.
Resumo:
Ross River virus (RRV) is a mosquito-borne member of the genus Alphavirus that causes epidemic polyarthritis in humans, costing the Australian health system at least US$10 million annually. Recent progress in RRV vaccine development requires accurate assessment of RRV genetic diversity and evolution, particularly as they may affect the utility of future vaccination. In this study, we provide novel RRV genome sequences and investigate the evolutionary dynamics of RRV from time-structured E2 gene datasets. Our analysis indicates that, although RRV evolves at a similar rate to other alphaviruses (mean evolutionary rate of approx. 8x10(-4) nucleotide substitutions per site year(-1)), the relative genetic diversity of RRV has been continuously low through time, possibly as a result of purifying selection imposed by replication in a wide range of natural host and vector species. Together, these findings suggest that vaccination against RRV is unlikely to result in the rapid antigenic evolution that could compromise the future efficacy of current RRV vaccines.
Resumo:
Modern statistical models and computational methods can now incorporate uncertainty of the parameters used in Quantitative Microbial Risk Assessments (QMRA). Many QMRAs use Monte Carlo methods, but work from fixed estimates for means, variances and other parameters. We illustrate the ease of estimating all parameters contemporaneously with the risk assessment, incorporating all the parameter uncertainty arising from the experiments from which these parameters are estimated. A Bayesian approach is adopted, using Markov Chain Monte Carlo Gibbs sampling (MCMC) via the freely available software, WinBUGS. The method and its ease of implementation are illustrated by a case study that involves incorporating three disparate datasets into an MCMC framework. The probabilities of infection when the uncertainty associated with parameter estimation is incorporated into a QMRA are shown to be considerably more variable over various dose ranges than the analogous probabilities obtained when constants from the literature are simply ‘plugged’ in as is done in most QMRAs. Neglecting these sources of uncertainty may lead to erroneous decisions for public health and risk management.
Resumo:
Anthropometric assessment is a simple, safe, and cost-efficient method to examine the health status of individu-als. The Japanese obesity classification based on the sum of two skin folds (Σ2SF) was proposed nearly 40 years ago therefore its applicability to Japanese living today is unknown. The current study aimed to determine Σ2SF cut-off values that correspond to percent body fat (%BF) and BMI values using two datasets from young Japa-nese adults (233 males and 139 females). Using regression analysis, Σ2SF and height-corrected Σ2SF (HtΣ2SF) values that correspond to %BF of 20, 25, and 30% for males and 30, 35, and 40% for females were determined. In addition, cut-off values of both Σ2SF and HtΣ2SF that correspond to BMI values of 23 kg/m2, 25 kg/m2 and 30 kg/m2 were determined. In comparison with the original Σ2SF values, the proposed values are smaller by about 10 mm at maximum. The proposed values show an improvement in sensitivity from about 25% to above 90% to identify individuals with ≥20% body fat in males and ≥30% body fat in females with high specificity of about 95% in both genders. The results indicate that the original Σ2SF cut-off values to screen obese individuals cannot be applied to young Japanese adults living today and modification is required. Application of the pro-posed values may assist screening in the clinical setting.
Resumo:
Process models in organizational collections are typically modeled by the same team and using the same conventions. As such, these models share many characteristic features like size range, type and frequency of errors. In most cases merely small samples of these collections are available due to e.g. the sensitive information they contain. Because of their sizes, these samples may not provide an accurate representation of the characteristics of the originating collection. This paper deals with the problem of constructing collections of process models, in the form of Petri nets, from small samples of a collection for accurate estimations of the characteristics of this collection. Given a small sample of process models drawn from a real-life collection, we mine a set of generation parameters that we use to generate arbitrary-large collections that feature the same characteristics of the original collection. In this way we can estimate the characteristics of the original collection on the generated collections.We extensively evaluate the quality of our technique on various sample datasets drawn from both research and industry.
Resumo:
Detection of Region of Interest (ROI) in a video leads to more efficient utilization of bandwidth. This is because any ROIs in a given frame can be encoded in higher quality than the rest of that frame, with little or no degradation of quality from the perception of the viewers. Consequently, it is not necessary to uniformly encode the whole video in high quality. One approach to determine ROIs is to use saliency detectors to locate salient regions. This paper proposes a methodology for obtaining ground truth saliency maps to measure the effectiveness of ROI detection by considering the role of user experience during the labelling process of such maps. User perceptions can be captured and incorporated into the definition of salience in a particular video, taking advantage of human visual recall within a given context. Experiments with two state-of-the-art saliency detectors validate the effectiveness of this approach to validating visual saliency in video. This paper will provide the relevant datasets associated with the experiments.
Resumo:
Scalable high-resolution tiled display walls are becoming increasingly important to decision makers and researchers because high pixel counts in combination with large screen areas facilitate content rich, simultaneous display of computer-generated visualization information and high-definition video data from multiple sources. This tutorial is designed to cater for new users as well as researchers who are currently operating tiled display walls or 'OptiPortals'. We will discuss the current and future applications of display wall technology and explore opportunities for participants to collaborate and contribute in a growing community. Multiple tutorial streams will cover both hands-on practical development, as well as policy and method design for embedding these technologies into the research process. Attendees will be able to gain an understanding of how to get started with developing similar systems themselves, in addition to becoming familiar with typical applications and large-scale visualisation techniques. Presentations in this tutorial will describe current implementations of tiled display walls that highlight the effective usage of screen real-estate with various visualization datasets, including collaborative applications such as visualcasting, classroom learning and video conferencing. A feature presentation for this tutorial will be given by Jurgen Schulze from Calit2 at the University of California, San Diego. Jurgen is an expert in scientific visualization in virtual environments, human-computer interaction, real-time volume rendering, and graphics algorithms on programmable graphics hardware.
Resumo:
NF-Y is a heterotrimeric transcription factor complex. Each of the NF-Y subunits (NF-YA, NF-YB and NF-YC) in plants is encoded by multiple genes. Quantitative RT-PCR analysis revealed that five wheat NF-YC members (TaNF-YC5, 8, 9, 11 & 12) were upregulated by light in both the leaf and seedling shoot. Co-expression analysis of Affymetrix wheat genome array datasets revealed that transcript levels of a large number of genes were consistently correlated with those of the TaNF-YC11 and TaNF-YC8 genes in 3-4 separate Affymetrix array datasets. TaNF-YC11-correlated transcripts were significantly enriched with the Gene Ontology term photosynthesis. Sequence analysis in the promoters of TaNF-YC11-correlated genes revealed the presence of putative NF-Y complex binding sites (CCAAT motifs). Quantitative RT-PCR analysis of a subset of potential TaNF-YC11 target genes showed that ten out of the thirteen genes were also light-upregulated in both the leaf and seedling shoot and had significantly correlated expression profiles with TaNF-YC11. The potential target genes for TaNF-YC11 include subunit members from all four thylakoid membrane bound complexes required for the conversion of solar energy into chemical energy and rate limiting enzymes in the Calvin cycle. These data indicate that TaNF-YC11 is potentially involved in regulation of photosynthesis-related genes.
Resumo:
Nuclear Factor Y (NF-Y) transcription factor is a heterotrimer comprised of three subunits: NF-YA, NF-YB and NF-YC. Each of the three subunits in plants is encoded by multiple genes with differential expression profiles, implying the functional specialisation of NF-Y subunit members in plants. In this study, we investigated the roles of NF-YB members in the light-mediated regulation of photosynthesis genes. We identified two NF-YB members from Triticum aestivum (TaNF-YB3 & 7) which were markedly upregulated by light in the leaves and seedling shoots using quantitative RT-PCR. A genome-wide coexpression analysis of multiple Affymetrix Wheat Genome Array datasets revealed that TaNF-YB3-coexpressed transcripts were highly enriched with the Gene Ontology term photosynthesis. Transgenic wheat lines constitutively overexpressing TaNF-YB3 had a significant increase in the leaf chlorophyll content, photosynthesis rate and early growth rate. Quantitative RT-PCR analysis showed that the expression levels of a number of TaNF-YB3-coexpressed transcripts were elevated in the transgenic wheat lines. The mRNA level of TaGluTR encoding glutamyl-tRNA reductase, which catalyses the rate limiting step of the chlorophyll biosynthesis pathway, was significantly increased in the leaves of the transgenic wheat. Significant increases in the expression level in the transgenic plant leaves were also observed for four photosynthetic apparatus genes encoding chlorophyll a/b-binding proteins (Lhca4 and Lhcb4) and photosystem I reaction center subunits (subunit K and subunit N), as well as for a gene coding for chloroplast ATP synthase subunit. These results indicate that TaNF-YB3 is involved in the positive regulation of a number of photosynthesis genes in wheat.
Resumo:
Objective: to assess the accuracy of data linkage across the spectrum of emergency care in the absence of a unique patient identifier, and to use the linked data to examine service delivery outcomes in an emergency department setting. Design: automated data linkage and manual data linkage were compared to determine their relative accuracy. Data were extracted from three separate health information systems: ambulance, ED and hospital inpatients, then linked to provide information about the emergency journey of each patient. The linking was done manually through physical review of records and automatically using a data linking tool (Health Data Integration) developed by the CSIRO. Match rate and quality of the linking were compared. Setting: 10, 835 patient presentations to a large, regional teaching hospital ED over a two month period (August-September 2007). Results: comparison of the manual and automated linkage outcomes for each pair of linked datasets demonstrated a sensitivity of between 95% and 99%; a specificity of between 75% and 99%; and a positive predictive value of between 88% and 95%. Conclusions: Our results indicate that automated linking provides a sound basis for health service analysis, even in the absence of a unique patient identifier. The use of an automated linking tool yields accurate data suitable for planning and service delivery purposes and enables the data to be linked regularly to examine service delivery outcomes.
Resumo:
As organizations reach higher levels of Business Process Management maturity, they tend to accumulate large collections of process models. These repositories may contain thousands of activities and be managed by different stakeholders with varying skills and responsibilities. However, while being of great value, these repositories induce high management costs. Thus, it becomes essential to keep track of the various model versions as they may mutually overlap, supersede one another and evolve over time. We propose an innovative versioning model, and associated storage structure, specifically designed to maximize sharing across process models and process model versions, reduce conflicts in concurrent edits and automatically handle controlled change propagation. The focal point of this technique is to version single process model fragments, rather than entire process models. Indeed empirical evidence shows that real-life process model repositories have numerous duplicate fragments. Experiments on two industrial datasets confirm the usefulness of our technique.
Resumo:
Copyright protects much of the creative, cultural, educational, scientific and informational material generated by federal, State/Territory and local governments and their constituent departments and agencies. Governments at all levels develop, manage and distribute a vast array of materials in the form of documents, reports, websites, datasets and databases on CD or DVD and files that can be downloaded from a website. Under the Copyright Act 1968 (Cth), with few exceptions government copyright is treated the same as copyright owned by non-government parties insofar as the range of protected materials and the exclusive proprietary rights attaching to them are concerned. However, the rationale for recognizing copyright in public sector materials and vesting ownership of copyright in governments is fundamentally different to the main rationales underpinning copyright generally. The central justification for recognizing Crown copyright is to ensure that government documents and materials created for public administrative purposes are disseminated in an accurate and reliable form. Consequently, the exclusive rights held by governments as copyright owners must be exercised in a manner consistent with the rationale for conferring copyright ownership on them. Since Crown copyright exists primarily to ensure that documents and materials produced for use in the conduct of government are circulated in an accurate and reliable form, governments should exercise their exclusive rights to ensure that their copyright materials are made available for access and reuse, in accordance with any laws and policies relating to access to public sector materials. While copyright law vests copyright owners with extensive bundles of exclusive rights which can be exercised to prevent others making use of the copyright material, in the case of Crown copyright materials these rights should rarely be asserted by government to deviate from the general rule that Crown copyright materials will be available for “full and free reproduction” by the community at large.
Resumo:
The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information.
Resumo:
Information overload has become a serious issue for web users. Personalisation can provide effective solutions to overcome this problem. Recommender systems are one popular personalisation tool to help users deal with this issue. As the base of personalisation, the accuracy and efficiency of web user profiling affects the performances of recommender systems and other personalisation systems greatly. In Web 2.0, the emerging user information provides new possible solutions to profile users. Folksonomy or tag information is a kind of typical Web 2.0 information. Folksonomy implies the users‘ topic interests and opinion information. It becomes another source of important user information to profile users and to make recommendations. However, since tags are arbitrary words given by users, folksonomy contains a lot of noise such as tag synonyms, semantic ambiguities and personal tags. Such noise makes it difficult to profile users accurately or to make quality recommendations. This thesis investigates the distinctive features and multiple relationships of folksonomy and explores novel approaches to solve the tag quality problem and profile users accurately. Harvesting the wisdom of crowds and experts, three new user profiling approaches are proposed: folksonomy based user profiling approach, taxonomy based user profiling approach, hybrid user profiling approach based on folksonomy and taxonomy. The proposed user profiling approaches are applied to recommender systems to improve their performances. Based on the generated user profiles, the user and item based collaborative filtering approaches, combined with the content filtering methods, are proposed to make recommendations. The proposed new user profiling and recommendation approaches have been evaluated through extensive experiments. The effectiveness evaluation experiments were conducted on two real world datasets collected from Amazon.com and CiteULike websites. The experimental results demonstrate that the proposed user profiling and recommendation approaches outperform those related state-of-the-art approaches. In addition, this thesis proposes a parallel, scalable user profiling implementation approach based on advanced cloud computing techniques such as Hadoop, MapReduce and Cascading. The scalability evaluation experiments were conducted on a large scaled dataset collected from Del.icio.us website. This thesis contributes to effectively use the wisdom of crowds and expert to help users solve information overload issues through providing more accurate, effective and efficient user profiling and recommendation approaches. It also contributes to better usages of taxonomy information given by experts and folksonomy information contributed by users in Web 2.0.