50 resultados para Data mining and knowledge discovery
em QUB Research Portal - Research Directory and Institutional Repository for Queen's University Belfast
Resumo:
We conducted data-mining analyses of genome wide association (GWA) studies of the CATIE and MGS-GAIN datasets, and found 13 markers in the two physically linked genes, PTPN21 and EML5, showing nominally significant association with schizophrenia. Linkage disequilibrium (LD) analysis indicated that all 7 markers from PTPN21 shared high LD (r(2)>0.8), including rs2274736 and rs2401751, the two non-synonymous markers with the most significant association signals (rs2401751, P=1.10 × 10(-3) and rs2274736, P=1.21 × 10(-3)). In a meta-analysis of all 13 replication datasets with a total of 13,940 subjects, we found that the two non-synonymous markers are significantly associated with schizophrenia (rs2274736, OR=0.92, 95% CI: 0.86-0.97, P=5.45 × 10(-3) and rs2401751, OR=0.92, 95% CI: 0.86-0.97, P=5.29 × 10(-3)). One SNP (rs7147796) in EML5 is also significantly associated with the disease (OR=1.08, 95% CI: 1.02-1.14, P=6.43 × 10(-3)). These 3 markers remain significant after Bonferroni correction. Furthermore, haplotype conditioned analyses indicated that the association signals observed between rs2274736/rs2401751 and rs7147796 are statistically independent. Given the results that 2 non-synonymous markers in PTPN21 are associated with schizophrenia, further investigation of this locus is warranted.
Resumo:
The last decade has witnessed an unprecedented growth in availability of data having spatio-temporal characteristics. Given the scale and richness of such data, finding spatio-temporal patterns that demonstrate significantly different behavior from their neighbors could be of interest for various application scenarios such as – weather modeling, analyzing spread of disease outbreaks, monitoring traffic congestions, and so on. In this paper, we propose an automated approach of exploring and discovering such anomalous patterns irrespective of the underlying domain from which the data is recovered. Our approach differs significantly from traditional methods of spatial outlier detection, and employs two phases – i) discovering homogeneous regions, and ii) evaluating these regions as anomalies based on their statistical difference from a generalized neighborhood. We evaluate the quality of our approach and distinguish it from existing techniques via an extensive experimental evaluation.
Resumo:
In the last decade, data mining has emerged as one of the most dynamic and lively areas in information technology. Although many algorithms and techniques for data mining have been proposed, they either focus on domain independent techniques or on very specific domain problems. A general requirement in bridging the gap between academia and business is to cater to general domain-related issues surrounding real-life applications, such as constraints, organizational factors, domain expert knowledge, domain adaption, and operational knowledge. Unfortunately, these either have not been addressed, or have not been sufficiently addressed, in current data mining research and development.Domain-Driven Data Mining (D3M) aims to develop general principles, methodologies, and techniques for modeling and merging comprehensive domain-related factors and synthesized ubiquitous intelligence surrounding problem domains with the data mining process, and discovering knowledge to support business decision-making. This paper aims to report original, cutting-edge, and state-of-the-art progress in D3M. It covers theoretical and applied contributions aiming to: 1) propose next-generation data mining frameworks and processes for actionable knowledge discovery, 2) investigate effective (automated, human and machine-centered and/or human-machined-co-operated) principles and approaches for acquiring, representing, modelling, and engaging ubiquitous intelligence in real-world data mining, and 3) develop workable and operational systems balancing technical significance and applications concerns, and converting and delivering actionable knowledge into operational applications rules to seamlessly engage application processes and systems.
Resumo:
Background. The assembly of the tree of life has seen significant progress in recent years but algae and protists have been largely overlooked in this effort. Many groups of algae and protists have ancient roots and it is unclear how much data will be required to resolve their phylogenetic relationships for incorporation in the tree of life. The red algae, a group of primary photosynthetic eukaryotes of more than a billion years old, provide the earliest fossil evidence for eukaryotic multicellularity and sexual reproduction. Despite this evolutionary significance, their phylogenetic relationships are understudied. This study aims to infer a comprehensive red algal tree of life at the family level from a supermatrix containing data mined from GenBank. We aim to locate remaining regions of low support in the topology, evaluate their causes and estimate the amount of data required to resolve them. Results. Phylogenetic analysis of a supermatrix of 14 loci and 98 red algal families yielded the most complete red algal tree of life to date. Visualization of statistical support showed the presence of five poorly supported regions. Causes for low support were identified with statistics about the age of the region, data availability and node density, showing that poor support has different origins in different parts of the tree. Parametric simulation experiments yielded optimistic estimates of how much data will be needed to resolve the poorly supported regions (ca. 103 to ca. 104 nucleotides for the different regions). Nonparametric simulations gave a markedly more pessimistic image, some regions requiring more than 2.8 105 nucleotides or not achieving the desired level of support at all. The discrepancies between parametric and nonparametric simulations are discussed in light of our dataset and known attributes of both approaches. Conclusions. Our study takes the red algae one step closer to meaningful inclusion in the tree of life. In addition to the recovery of stable relationships, the recognition of five regions in need of further study is a significant outcome of this work. Based on our analyses of current availability and future requirements of data, we make clear recommendations for forthcoming research.
Resumo:
Here, we describe gene expression compositional assignment (GECA), a powerful, yet simple method based on compositional statistics that can validate the transfer of prior knowledge, such as gene lists, into independent data sets, platforms and technologies. Transcriptional profiling has been used to derive gene lists that stratify patients into prognostic molecular subgroups and assess biomarker performance in the pre-clinical setting. Archived public data sets are an invaluable resource for subsequent in silico validation, though their use can lead to data integration issues. We show that GECA can be used without the need for normalising expression levels between data sets and can outperform rank-based correlation methods. To validate GECA, we demonstrate its success in the cross-platform transfer of gene lists in different domains including: bladder cancer staging, tumour site of origin and mislabelled cell lines. We also show its effectiveness in transferring an epithelial ovarian cancer prognostic gene signature across technologies, from a microarray to a next-generation sequencing setting. In a final case study, we predict the tumour site of origin and histopathology of epithelial ovarian cancer cell lines. In particular, we identify and validate the commonly-used cell line OVCAR-5 as non-ovarian, being gastrointestinal in origin. GECA is available as an open-source R package.
Resumo:
There is a growing incentive for sociologists to demonstrate the use value of their research. Research ‘impact’ is a driver of research funding and a measure of academic standing. Academic debate on this issue has intensified since Burawoy’s (2004) call for a ‘public’ sociology. However the academy is no longer the sole or primary producer of knowledge and empirical sociologists need to contend with the ‘huge swathes’ of social data that now exist (Savage and Burrows, 2007). This article furthers these debates by considering power struggles between competing forms of knowledge. Using a case study, it specifically considers the power struggle between normative and empirical knowledge, and how providers of knowledge assert legitimacy for their truth claims. The article concludes that the idea of ‘impact’ and ‘use-value’ are extremely complex and depends in the policy context on knowledge power struggles, and on how policy makers want to view the world. © The Author(s) 2012
Resumo:
Kelp forests along temperate and polar coastlines represent some of most diverse and productive habitats on the Earth. Here, we synthesize information from >60 years of research on the structure and functioning of kelp forest habitats in European waters, with particular emphasis on the coasts of UK and Ireland, which represents an important biogeographic transition zone that is subjected to multiple threats and stressors. We collated existing data on kelp distribution and abundance and reanalyzed these data to describe the structure of kelp forests along a spatial gradient spanning more than 10° of latitude. We then examined ecological goods and services provided by kelp forests, including elevated secondary production, nutrient cycling, energy capture and flow, coastal defense, direct applications, and biodiversity repositories, before discussing current and future threats posed to kelp forests and identifying key knowledge gaps. Recent evidence unequivocally demonstrates that the structure of kelp forests in the NE Atlantic is changing in response to climate- and non-climate-related stressors, which will have major implications for the structure and functioning of coastal ecosystems. However, kelp-dominated habitats along much of the NE Atlantic coastline have been chronically understudied over recent decades in comparison with other regions such as Australasia and North America. The paucity of field-based research currently impedes our ability to conserve and manage these important ecosystems. Targeted observational and experimental research conducted over large spatial and temporal scales is urgently needed to address these knowledge gaps.
Resumo:
Recent figures show that Autism Spectrum Disorder (ASD) affects at least 1 in 88 of the population, yet for years, international public awareness of ASD was limited. Over the past 5-10 years intense efforts have been made to raise autism awareness in the general population in countries such as UK and US. In this paper we report data from a large-scale general population survey (n=1204) in which we assessed autism awareness, knowledge about autism, and perceptions about autism interventions in Northern Ireland. We found high levels of autism awareness, in fact over 80% of the sample were aware of ASD and over 60% of these respondents knew someone with ASD in their own family, circle of friends or work colleagues. Generally, knowledge of strengths and challenges faced by individuals with ASD was relatively accurate. However, perceptions of interventions and service provider responsibilities were vague and uncertain. Results show that local and international autism awareness campaigns have largely been successful and that the focus should shift towards disseminating accurate information regarding intervention and service provider responsibilities.
Resumo:
The problem of detecting spatially-coherent groups of data that exhibit anomalous behavior has started to attract attention due to applications across areas such as epidemic analysis and weather forecasting. Earlier efforts from the data mining community have largely focused on finding outliers, individual data objects that display deviant behavior. Such point-based methods are not easy to extend to find groups of data that exhibit anomalous behavior. Scan Statistics are methods from the statistics community that have considered the problem of identifying regions where data objects exhibit a behavior that is atypical of the general dataset. The spatial scan statistic and methods that build upon it mostly adopt the framework of defining a character for regions (e.g., circular or elliptical) of objects and repeatedly sampling regions of such character followed by applying a statistical test for anomaly detection. In the past decade, there have been efforts from the statistics community to enhance efficiency of scan statstics as well as to enable discovery of arbitrarily shaped anomalous regions. On the other hand, the data mining community has started to look at determining anomalous regions that have behavior divergent from their neighborhood.In this chapter,we survey the space of techniques for detecting anomalous regions on spatial data from across the data mining and statistics communities while outlining connections to well-studied problems in clustering and image segmentation. We analyze the techniques systematically by categorizing them appropriately to provide a structured birds eye view of the work on anomalous region detection;we hope that this would encourage better cross-pollination of ideas across communities to help advance the frontier in anomaly detection.
Resumo:
The past decade had witnessed an unprecedented growth in the amount of available digital content, and its volume is expected to continue to grow the next few years. Unstructured text data generated from web and enterprise sources form a large fraction of such content. Many of these contain large volumes of reusable data such as solutions to frequently occurring problems, and general know-how that may be reused in appropriate contexts. In this work, we address issues around leveraging unstructured text data from sources as diverse as the web and the enterprise within the Case-based Reasoning framework. Case-based Reasoning (CBR) provides a framework and methodology for systematic reuse of historical knowledge that is available in the form of problemsolution
pairs, in solving new problems. Here, we consider possibilities of enhancing Textual CBR systems under three main themes: procurement, maintenance and retrieval. We adapt and build upon the stateof-the-art techniques from data mining and natural language processing in addressing various challenges therein. Under procurement, we investigate the problem of extracting cases (i.e., problem-solution pairs) from data sources such as incident/experience
reports. We develop case-base maintenance methods specifically tuned to text targeted towards retaining solutions such that the utility of the filtered case base in solving new problems is maximized. Further, we address the problem of query suggestions for textual case-bases and show that exploiting the problem-solution partition can enhance retrieval effectiveness by prioritizing more useful query suggestions. Additionally, we illustrate interpretable clustering as a tool to drill-down to domain specific text collections (since CBR systems are usually very domain specific) and develop techniques for improved similarity assessment in social media sources such as microblogs. Through extensive empirical evaluations, we illustrate the improvements that we are able to
achieve over the state-of-the-art methods for the respective tasks.
Resumo:
Objective
To explore the concerns, needs and knowledge of women diagnosed with Gestational Diabetes Mellitus (GDM).
Design
A qualitative study of women with GDM or a history of GDM.
Methods
Nineteen women who were both pregnant and recently diagnosed with GDM or post- natal with a recent history of GDM were recruited from outpatient diabetes care clinics. This qualitative study utilised focus groups. Participants were asked a series of open-ended questions to explore 1) current knowledge of GDM; 2) anxiety when diagnosed with GDM, and whether this changed overtime; 3) understanding and managing GDM and 4) the future impact of GDM. The data were analysed using a conventional content analysis approach.
Findings
Women experience a steep learning curve when initially diagnosed and eventually become skilled at managing their disease effectively. The use of insulin is associated with fear and guilt. Diet advice was sometimes complex and not culturally appropriate. Women appear not to be fully aware of the short or long-term consequences of a diagnosis of GDM.
Conclusions
Midwives and other Health Care Professionals need to be cognisant of the impact of a diagnosis of GDM and give individual and culturally appropriate advice (especially with regards to diet). High quality, evidence based information resources need to be made available to this group of women. Future health risks and lifestyle changes need to be discussed at diagnosis to ensure women have the opportunity to improve their health.