902 resultados para Large Data Sets
Resumo:
Background: To derive preference-based measures from various condition-specific descriptive health-related quality of life (HRQOL) measures. A general 2-stage method is evolved: 1) an item from each domain of the HRQOL measure is selected to form a health state classification system (HSCS); 2) a sample of health states is valued and an algorithm derived for estimating the utility of all possible health states. The aim of this analysis was to determine whether confirmatory or exploratory factor analysis (CFA, EFA) should be used to derive a cancer-specific utility measure from the EORTC QLQ-C30. Methods: Data were collected with the QLQ-C30v3 from 356 patients receiving palliative radiotherapy for recurrent or metastatic cancer (various primary sites). The dimensional structure of the QLQ-C30 was tested with EFA and CFA, the latter based on a conceptual model (the established domain structure of the QLQ-C30: physical, role, emotional, social and cognitive functioning, plus several symptoms) and clinical considerations (views of both patients and clinicians about issues relevant to HRQOL in cancer). The dimensions determined by each method were then subjected to item response theory, including Rasch analysis. Results: CFA results generally supported the proposed conceptual model, with residual correlations requiring only minor adjustments (namely, introduction of two cross-loadings) to improve model fit (increment χ2(2) = 77.78, p < .001). Although EFA revealed a structure similar to the CFA, some items had loadings that were difficult to interpret. Further assessment of dimensionality with Rasch analysis aligned the EFA dimensions more closely with the CFA dimensions. Three items exhibited floor effects (>75% observation at lowest score), 6 exhibited misfit to the Rasch model (fit residual > 2.5), none exhibited disordered item response thresholds, 4 exhibited DIF by gender or cancer site. Upon inspection of the remaining items, three were considered relatively less clinically important than the remaining nine. Conclusions: CFA appears more appropriate than EFA, given the well-established structure of the QLQ-C30 and its clinical relevance. Further, the confirmatory approach produced more interpretable results than the exploratory approach. Other aspects of the general method remain largely the same. The revised method will be applied to a large number of data sets as part of the international and interdisciplinary project to develop a multi-attribute utility instrument for cancer (MAUCa).
Resumo:
A simple and effective down-sample algorithm, Peak-Hold-Down-Sample (PHDS) algorithm is developed in this paper to enable a rapid and efficient data transfer in remote condition monitoring applications. The algorithm is particularly useful for high frequency Condition Monitoring (CM) techniques, and for low speed machine applications since the combination of the high sampling frequency and low rotating speed will generally lead to large unwieldy data size. The effectiveness of the algorithm was evaluated and tested on four sets of data in the study. One set of the data was extracted from the condition monitoring signal of a practical industry application. Another set of data was acquired from a low speed machine test rig in the laboratory. The other two sets of data were computer simulated bearing defect signals having either a single or multiple bearing defects. The results disclose that the PHDS algorithm can substantially reduce the size of data while preserving the critical bearing defect information for all the data sets used in this work even when a large down-sample ratio was used (i.e., 500 times down-sampled). In contrast, the down-sample process using existing normal down-sample technique in signal processing eliminates the useful and critical information such as bearing defect frequencies in a signal when the same down-sample ratio was employed. Noise and artificial frequency components were also induced by the normal down-sample technique, thus limits its usefulness for machine condition monitoring applications.
Resumo:
Particulate matter research is essential because of the well known significant adverse effects of aerosol particles on human health and the environment. In particular, identification of the origin or sources of particulate matter emissions is of paramount importance in assisting efforts to control and reduce air pollution in the atmosphere. This thesis aims to: identify the sources of particulate matter; compare pollution conditions at urban, rural and roadside receptor sites; combine information about the sources with meteorological conditions at the sites to locate the emission sources; compare sources based on particle size or mass; and ultimately, provide the basis for control and reduction in particulate matter concentrations in the atmosphere. To achieve these objectives, data was obtained from assorted local and international receptor sites over long sampling periods. The samples were analysed using Ion Beam Analysis and Scanning Mobility Particle Sizer methods to measure the particle mass with chemical composition and the particle size distribution, respectively. Advanced data analysis techniques were employed to derive information from large, complex data sets. Multi-Criteria Decision Making (MCDM), a ranking method, drew on data variability to examine the overall trends, and provided the rank ordering of the sites and years that sampling was conducted. Coupled with the receptor model Positive Matrix Factorisation (PMF), the pollution emission sources were identified and meaningful information pertinent to the prioritisation of control and reduction strategies was obtained. This thesis is presented in the thesis by publication format. It includes four refereed papers which together demonstrate a novel combination of data analysis techniques that enabled particulate matter sources to be identified and sampling site/year ranked. The strength of this source identification process was corroborated when the analysis procedure was expanded to encompass multiple receptor sites. Initially applied to identify the contributing sources at roadside and suburban sites in Brisbane, the technique was subsequently applied to three receptor sites (roadside, urban and rural) located in Hong Kong. The comparable results from these international and national sites over several sampling periods indicated similarities in source contributions between receptor site-types, irrespective of global location and suggested the need to apply these methods to air pollution investigations worldwide. Furthermore, an investigation into particle size distribution data was conducted to deduce the sources of aerosol emissions based on particle size and elemental composition. Considering the adverse effects on human health caused by small-sized particles, knowledge of particle size distribution and their elemental composition provides a different perspective on the pollution problem. This thesis clearly illustrates that the application of an innovative combination of advanced data interpretation methods to identify particulate matter sources and rank sampling sites/years provides the basis for the prioritisation of future air pollution control measures. Moreover, this study contributes significantly to knowledge based on chemical composition of airborne particulate matter in Brisbane, Australia and on the identity and plausible locations of the contributing sources. Such novel source apportionment and ranking procedures are ultimately applicable to environmental investigations worldwide.
Resumo:
A user’s query is considered to be an imprecise description of their information need. Automatic query expansion is the process of reformulating the original query with the goal of improving retrieval effectiveness. Many successful query expansion techniques ignore information about the dependencies that exist between words in natural language. However, more recent approaches have demonstrated that by explicitly modeling associations between terms significant improvements in retrieval effectiveness can be achieved over those that ignore these dependencies. State-of-the-art dependency-based approaches have been shown to primarily model syntagmatic associations. Syntagmatic associations infer a likelihood that two terms co-occur more often than by chance. However, structural linguistics relies on both syntagmatic and paradigmatic associations to deduce the meaning of a word. Given the success of dependency-based approaches and the reliance on word meanings in the query formulation process, we argue that modeling both syntagmatic and paradigmatic information in the query expansion process will improve retrieval effectiveness. This article develops and evaluates a new query expansion technique that is based on a formal, corpus-based model of word meaning that models syntagmatic and paradigmatic associations. We demonstrate that when sufficient statistical information exists, as in the case of longer queries, including paradigmatic information alone provides significant improvements in retrieval effectiveness across a wide variety of data sets. More generally, when our new query expansion approach is applied to large-scale web retrieval it demonstrates significant improvements in retrieval effectiveness over a strong baseline system, based on a commercial search engine.
Resumo:
Cell trajectory data is often reported in the experimental cell biology literature to distinguish between different types of cell migration. Unfortunately, there is no accepted protocol for designing or interpreting such experiments and this makes it difficult to quantitatively compare different published data sets and to understand how changes in experimental design influence our ability to interpret different experiments. Here, we use an individual based mathematical model to simulate the key features of a cell trajectory experiment. This shows that our ability to correctly interpret trajectory data is extremely sensitive to the geometry and timing of the experiment, the degree of motility bias and the number of experimental replicates. We show that cell trajectory experiments produce data that is most reliable when the experiment is performed in a quasi 1D geometry with a large number of identically{prepared experiments conducted over a relatively short time interval rather than few trajectories recorded over particularly long time intervals.
Resumo:
Queensland University of Technology (QUT) Library offers a range of resources and services to researchers as part of their research support portfolio. This poster will present key features of two of the data management services offered by research support staff at QUT Library. The first service is QUT Research Data Finder (RDF), a product of the Australian National Data Service (ANDS) funded Metadata Stores project. RDF is a data registry (metadata repository) that aims to publicise datasets that are research outputs arising from completed QUT research projects. The second is a software and code registry, which is currently under development with the sole purpose of improving discovery of source code and software as QUT research outputs. RESEARCH DATA FINDER As an integrated metadata repository, Research Data Finder aligns with institutional sources of truth, such as QUT’s research administration system, ResearchMaster, as well as QUT’s Academic Profiles system to provide high quality data descriptions that increase awareness of, and access to, shareable research data. The repository and its workflows are designed to foster better data management practices, enhance opportunities for collaboration and research, promote cross-disciplinary research and maximise the impact of existing research data sets. SOFTWARE AND CODE REGISTRY The QUT Library software and code registry project stems from concerns amongst researchers with regards to development activities, storage, accessibility, discoverability and impact, sharing, copyright and IP ownership of software and code. As a result, the Library is developing a registry for code and software research outputs, which will use existing Research Data Finder architecture. The underpinning software for both registries is VIVO, open source software developed by Cornell University. The registry will use the Research Data Finder service instance of VIVO and will include a searchable interface, links to code/software locations and metadata feeds to Research Data Australia. Key benefits of the project include:improving the discoverability and reuse of QUT researchers’ code and software amongst QUT and the QUT research community; increasing the profile of QUT research outputs on a national level by providing a metadata feed to Research Data Australia, and; improving the metrics for access and reuse of code and software in the repository.
Resumo:
The calcium-activated potassium ion channel gene (KCNN3) is located in the vicinity of the familial hemiplegic migraine type 2 locus on chromosome 1q21.3. This gene is expressed in the central nervous system and plays a role in neural excitability. Previous association studies have provided some, although not conclusive, evidence for involvement of this gene in migraine susceptibility. To elucidate KCNN3 involvement in migraine, we performed gene-wide SNP genotyping in a high-risk genetic isolate from Norfolk Island, a population descended from a small number of eighteenth century Isle of Man ‘Bounty Mutineer’ and Tahitian founders. Phenotype information was available for 377 individuals who are related through the single, well-defined Norfolk pedigree (96 were affected: 64 MA, 32 MO). A total of 85 SNPs spanning the KCNN3 gene were genotyped in a sub-sample of 285 related individuals (76 affected), all core members of the extensive Norfolk Island ‘Bounty Mutineer’ genealogy. All genotyping was performed using the Illumina BeadArray platform. The analysis was performed using the statistical program SOLAR v4.0.6 assuming an additive model of allelic effect adjusted for the effects of age and sex. Haplotype analysis was undertaken using the program HAPLOVIEW v4.0. A total of four intronic SNPs in the KCNN3 gene displayed significant association (P < 0.05) with migraine. Two SNPs, rs73532286 and rs6426929, separated by approximately 0.1 kb, displayed complete LD (r 2 = 1.00, D′ = 1.00, D′ 95% CI = 0.96–1.00). In all cases, the minor allele led to a decrease in migraine risk (beta coefficient = 0.286–0.315), suggesting that common gene variants confer an increased risk of migraine in the Norfolk pedigree. This effect may be explained by founder effect in this genetic isolate. This study provides evidence for association of variants in the KCNN3 ion channel gene with migraine susceptibility in the Norfolk genetic isolate with the rarer allelic variants conferring a possible protective role. This the first comprehensive analysis of this potential candidate gene in migraine and also the first study that has utilised the unique Norfolk Island large pedigree isolate to implicate a specific migraine gene. Studies of additional variants in KCNN3 in the Norfolk pedigree are now required (e.g. polyglutamine variants) and further analyses in other population data sets are required to clarify the association of the KCNN3 gene and migraine risk in the general outbred population.
Resumo:
Endometrial cancer is one of the most common female diseases in developed nations and is the most commonly diagnosed gynaecological cancer in Australia. The disease is commonly classified by histology: endometrioid or non-endometrioid endometrial cancer. While non-endometrioid endometrial cancers are accepted to be high-grade, aggressive cancers, endometrioid cancers (comprising 80% of all endometrial cancers diagnosed) generally carry a favourable patient prognosis. However, endometrioid endometrial cancer patients endure significant morbidity due to surgery and radiotherapy used for disease treatment, and patients with recurrent disease have a 5-year survival rate of less than 50%. Genetic analysis of women with endometrial cancer could uncover novel markers associated with disease risk and/or prognosis, which could then be used to identify women at high risk and for the use of specialised treatments. Proteases are widely accepted to play an important role in the development and progression of cancer. This PhD project hypothesised that SNPs from two protease gene families, the matrix metalloproteases (MMPs, including their tissue inhibitors, TIMPs) and the tissue kallikrein-related peptidases (KLKs) would be associated with endometrial cancer susceptibility and/or prognosis. In the first part of this study, optimisation of the genotyping techniques was performed. Results from previously published endometrial cancer genetic association studies were attempted to be validated in a large, multicentre replication set (maximum cases n = 2,888, controls n = 4,483, 3 studies). The rs11224561 progesterone receptor SNP (PGR, A/G) was observed to be associated with increased endometrial cancer risk (per A allele OR 1.31, 95% CI 1.12-1.53; p-trend = 0.001), a result which was initially reported among a Chinese sample set. Previously reported associations for the remaining 8 SNPs investigated for this section of the PhD study were not confirmed, thereby reinforcing the importance of validation of genetic association studies. To examine the effect of SNPs from the MMP and KLK families on endometrial cancer risk, we selected the most significantly associated MMP and KLK SNPs from genome-wide association study analysis (GWAS) to be genotyped in the GWAS replication set (cases n = 4,725, controls n = 9,803, 13 studies). The significance of the MMP24 rs932562 SNP was unchanged after incorporation of the stage 2 samples (Stage 1 per allele OR 1.18, p = 0.002; Combined Stage 1 and 2 OR 1.09, p = 0.002). The rs10426 SNP, located 3' to KLK10 was predicted by bioinformatic analysis to effect miRNA binding. This SNP was observed in the GWAS stage 1 result to exhibit a recessive effect on endometrial cancer risk, a result which was not validated in the stage 2 sample set (Stage 1 OR 1.44, p = 0.007; Combined Stage 1 and 2 OR 1.14, p = 0.08). Investigation of the regions imputed surrounding the MMP, TIMP and KLK genes did not reveal any significant targets for further analysis. Analysis of the case data from the endometrial cancer GWAS to identify genetic variation associated with cancer grade did not reveal SNPs from the MMP, TIMP or KLK genes to be statistically significant. However, the representation of SNPs from the MMP, TIMP and KLK families by the GWAS genotyping platform used in this PhD project was examined and observed to be very low, with the genetic variation of four genes (MMP23A, MMP23B, MMP28 and TIMP1) not captured at all by this technique. This suggests that comprehensive candidate gene association studies will be required to assess the role of SNPs from these genes with endometrial cancer risk and prognosis. Meta-analysis of gene expression microarray datasets curated as part of this PhD study identified a number of MMP, TIMP and KLK genes to display differential expression by endometrial cancer status (MMP2, MMP10, MMP11, MMP13, MMP19, MMP25 and KLK1) and histology (MMP2, MMP11, MMP12, MMP26, MMP28, TIMP2, TIMP3, KLK6, KLK7, KLK11 and KLK12). In light of these findings these genes should be prioritised for future targeted genetic association studies. Two SNPs located 43.5 Mb apart on chromosome 15 were observed from the GWAS analysis to be associated with increased endometrial cancer grade, results that were validated in silico in two independent datasets. One of these SNPs, rs8035725 is located in the 5' untranslated region of a MYC promoter binding protein DENND4A (Stage 1 OR 1.15, p = 9.85 x 10P -5 P, combined Stage 1 and in silico validation OR 1.13, p = 5.24 x 10P -6 P). This SNP has previously been reported to alter the expression of PTPLAD1, a gene involved in the synthesis of very long fatty acid chains and in the Rac1 signaling pathway. Meta-analysis of gene expression microarray data found PTPLAD1 to display increased expression in the aggressive non-endometrioid histology compared with endometrioid endometrial cancer, suggesting that the causal SNP underlying the observed genetic association may influence expression of this gene. Neither rs8035725 nor significant SNPs identified by imputation were predicted bioinformatically to affect transcription factor binding sites, indicating that further studies are required to assess their potential effect on other regulatory elements. The other grade- associated SNP, rs6606792, is located upstream of an inferred pseudogene, ELMO2P1 (Stage 1 OR 1.12, p = 5 x 10P -5 P; combined Stage 1 and in silico validation OR 1.09, p = 3.56 x 10P -5 P). Imputation of the ±1 Mb region surrounding this SNP revealed a cluster of significantly associated variants which are predicted to abolish various transcription factor binding sites, and would be expected to decrease gene expression. ELMO2P1 was not included on the microarray platforms collected for this PhD, and so its expression could not be investigated. However, the high sequence homology of ELMO2P1 with ELMO2, a gene important to cell motility, indicates that ELMO2 could be the parent gene for ELMO2P1 and as such, ELMO2P1 could function to regulate the expression of ELMO2. Increased expression of ELMO2 was seen to be associated with increasing endometrial cancer grade, as well as with aggressive endometrial cancer histological subtypes by microarray meta-analysis. Thus, it is hypothesised that SNPs in linkage disequilibrium with rs6606792 decrease the transcription of ELMO2P1, reducing the regulatory effect of ELMO2P1 on ELMO2 expression. Consequently, ELMO2 expression is increased, cell motility is enhanced leading to an aggressive endometrial cancer phenotype. In summary, these findings have identified several areas of research for further study. The results presented in this thesis provide evidence that a SNP in PGR is associated with risk of developing endometrial cancer. This PhD study also reports two independent loci on chromosome 15 to be associated with increased endometrial cancer grade, and furthermore, genes associated with these SNPs to be differentially expressed according in aggressive subtypes and/or by grade. The studies reported in this thesis support the need for comprehensive SNP association studies on prioritised MMP, TIMP and KLK genes in large sample sets. Until these studies are performed, the role of MMP, TIMP and KLK genetic variation remains unclear. Overall, this PhD study has contributed to the understanding of genetic variation involvement in endometrial cancer susceptibility and prognosis. Importantly, the genetic regions highlighted in this study could lead to the identification of novel gene targets to better understand the biology of endometrial cancer and also aid in the development of therapeutics directed at treating this disease.
Resumo:
As of June 2009, 361 genome-wide association studies (GWAS) had been referenced by the HuGE database. GWAS require DNA from many thousands of individuals, relying on suitable DNA collections. We recently performed a multiple sclerosis (MS) GWAS where a substantial component of the cases (24%) had DNA derived from saliva. Genotyping was done on the Illumina genotyping platform using the Infinium Hap370CNV DUO microarray. Additionally, we genotyped 10 individuals in duplicate using both saliva- and blood-derived DNA. The performance of blood- versus saliva-derived DNA was compared using genotyping call rate, which reflects both the quantity and quality of genotyping per sample and the “GCScore,” an Illumina genotyping quality score, which is a measure of DNA quality. We also compared genotype calls and GCScores for the 10 sample pairs. Call rates were assessed for each sample individually. For the GWAS samples, we compared data according to source of DNA and center of origin. We observed high concordance in genotyping quality and quantity between the paired samples and minimal loss of quality and quantity of DNA in the saliva samples in the large GWAS sample, with the blood samples showing greater variation between centers of origin. This large data set highlights the usefulness of saliva DNA for genotyping, especially in high-density single-nucleotide polymorphism microarray studies such as GWAS.
Resumo:
Objective Do employees care about their relative (economic) position in comparison to their co-workers in an organization? And if so, does it raise or lower their performance? While the topic is widely discussed in the literature, behavioral evidence on these important questions is relatively rare. Methods This article explores the pay-performance relationship using a sports data set. The strength of analyzing such data is that sports tournaments take place in a very controlled environment that helps to isolate a relative income effect. Results Using two large unique data sets that cover 26 seasons in basketball and eight seasons in soccer (Bundesliga), we find considerable support for the idea that a relative income disadvantage is correlated with a decrease in individual performance. In addition, there does not seem to be any tolerance for income disparity based on the hope that such differences may signal that better times are ahead. Conclusions This suggests the need to consider the impact of the relative income position when designing pay-for-performance mechanisms within firms and teams.
Resumo:
The use of hedonic models to estimate the effects of various factors on house prices is well established. This paper examines a number of international hedonic house price models that seek to quantify the effect of infrastructure charges on new house prices. This work is an important factor in the housing affordability debate, with many governments in high growth areas having user-pays infrastructure charging policies operating in tandem with housing affordability objectives, with no empirical evidence on the impact of one on the other. This research finds there is little consistency between existing models and the data sets utilised. Specification appears dependent upon data availability rather than sound theoretical grounding. This may lead to a lack of external validity with model specification dependent upon data availability rather than sound theoretical grounding.
Resumo:
A large number of methods have been published that aim to evaluate various components of multi-view geometry systems. Most of these have focused on the feature extraction, description and matching stages (the visual front end), since geometry computation can be evaluated through simulation. Many data sets are constrained to small scale scenes or planar scenes that are not challenging to new algorithms, or require special equipment. This paper presents a method for automatically generating geometry ground truth and challenging test cases from high spatio-temporal resolution video. The objective of the system is to enable data collection at any physical scale, in any location and in various parts of the electromagnetic spectrum. The data generation process consists of collecting high resolution video, computing accurate sparse 3D reconstruction, video frame culling and down sampling, and test case selection. The evaluation process consists of applying a test 2-view geometry method to every test case and comparing the results to the ground truth. This system facilitates the evaluation of the whole geometry computation process or any part thereof against data compatible with a realistic application. A collection of example data sets and evaluations is included to demonstrate the range of applications of the proposed system.
Resumo:
Long-term autonomy in robotics requires perception systems that are resilient to unusual but realistic conditions that will eventually occur during extended missions. For example, unmanned ground vehicles (UGVs) need to be capable of operating safely in adverse and low-visibility conditions, such as at night or in the presence of smoke. The key to a resilient UGV perception system lies in the use of multiple sensor modalities, e.g., operating at different frequencies of the electromagnetic spectrum, to compensate for the limitations of a single sensor type. In this paper, visual and infrared imaging are combined in a Visual-SLAM algorithm to achieve localization. We propose to evaluate the quality of data provided by each sensor modality prior to data combination. This evaluation is used to discard low-quality data, i.e., data most likely to induce large localization errors. In this way, perceptual failures are anticipated and mitigated. An extensive experimental evaluation is conducted on data sets collected with a UGV in a range of environments and adverse conditions, including the presence of smoke (obstructing the visual camera), fire, extreme heat (saturating the infrared camera), low-light conditions (dusk), and at night with sudden variations of artificial light. A total of 240 trajectory estimates are obtained using five different variations of data sources and data combination strategies in the localization method. In particular, the proposed approach for selective data combination is compared to methods using a single sensor type or combining both modalities without preselection. We show that the proposed framework allows for camera-based localization resilient to a large range of low-visibility conditions.
Resumo:
Parametric roll is a critical phenomenon for ships, whose onset may cause roll oscillations up to +-40 degrees, leading to very dangerous situations and possibly capsizing. Container ships have been shown to be particularly prone to parametric roll resonance when they are sailing in moderate to heavy head seas. A Matlab/Simulink parametric roll benchmark model for a large container ship has been implemented and validated against a wide set of experimental data. The model is a part of a Matlab/Simulink Toolbox (MSS, 2007). The benchmark implements a 3rd-order nonlinear model where the dynamics of roll is strongly coupled with the heave and pitch dynamics. The implemented model has shown good accuracy in predicting the container ship motions, both in the vertical plane and in the transversal one. Parametric roll has been reproduced for all the data sets in which it happened, and the model provides realistic results which are in good agreement with the model tank experiments.
Resumo:
We examine some variations of standard probability designs that preferentially sample sites based on how easy they are to access. Preferential sampling designs deliver unbiased estimates of mean and sampling variance and will ease the burden of data collection but at what cost to our design efficiency? Preferential sampling has the potential to either increase or decrease sampling variance depending on the application. We carry out a simulation study to gauge what effect it will have when sampling Soil Organic Carbon (SOC) values in a large agricultural region in south-eastern Australia. Preferential sampling in this region can reduce the distance to travel by up to 16%. Our study is based on a dataset of predicted SOC values produced from a datamining exercise. We consider three designs and two ways to determine ease of access. The overall conclusion is that sampling performance deteriorates as the strength of preferential sampling increases, due to the fact the regions of high SOC are harder to access. So our designs are inadvertently targeting regions of low SOC value. The good news, however, is that Generalised Random Tessellation Stratification (GRTS) sampling designs are not as badly affected as others and GRTS remains an efficient design compared to competitors.