995 resultados para clustered data
Resumo:
Overlaying maps using a desktop GIS is often the first step of a multivariate spatial analysis. The potential of this operation has increased considerably as data sources an dWeb services to manipulate them are becoming widely available via the Internet. Standards from the OGC enable such geospatial ‘mashups’ to be seamless and user driven, involving discovery of thematic data. The user is naturally inclined to look for spatial clusters and ‘correlation’ of outcomes. Using classical cluster detection scan methods to identify multivariate associations can be problematic in this context, because of a lack of control on or knowledge about background populations. For public health and epidemiological mapping, this limiting factor can be critical but often the focus is on spatial identification of risk factors associated with health or clinical status. In this article we point out that this association itself can ensure some control on underlying populations, and develop an exploratory scan statistic framework for multivariate associations. Inference using statistical map methodologies can be used to test the clustered associations. The approach is illustrated with a hypothetical data example and an epidemiological study on community MRSA. Scenarios of potential use for online mashups are introduced but full implementation is left for further research.
Resumo:
Aim: To test the efficacy of a comprehensive health assessment using the CHAP tool in adults with an intellectual disability (ID). Method: A cluster randomised control design was used. The intervention group received the CHAP, while the control group received usual care. This tool directed carers to gather a health history, which was reviewed by the person’s general practitioner (GP) who completed a medical examination and a healthcare plan. The tool acted as an advocacy tool, a ticket-of-entry to the GPs surgery and educated the GP and the caregiver about the deficits in the healthcare of adults with ID. The healthcare of the participants was followed for one-year after intervention by the collection of data from GP and service providers’ notes. Also interviews were performed with all those involved. Results: We obtained a representative sample of adults with ID (RR%). We found the intervention group received a significant increase in many health promotion/disease prevention activities e.g. hearing screening was times and a Pap smear was times more likely to have occurred in the intervention groups.We also found a trend towards earlier detection of disease. Conclusions: The CHAP process improves the provision of health screening/promotion activities and should be implemented.
Resumo:
The inv(16) and related t(16;16) are found in 10% of all cases with de novo acute myeloid leukemia. In these rearrangements the core binding factor beta (CBFB) gene on 16q22 is fused to the smooth muscle myosin heavy chain gene (MYH11) on 16p13. To gain insight into the mechanisms causing the inv(16) we have analysed 24 genomic CBFB-MYH11 breakpoints. All breakpoints in CBFB are located in a 15-Kb intron. More than 50% of the sequenced 6.2 Kb of this intron consists of human repetitive elements. Twenty-one of the 24 breakpoints in MYH11 are located in a 370-bp intron. The remaining three breakpoints in MYH11 are located more upstream. The localization of three breakpoints adjacent to a V(D)J recombinase signal sequence in MYH11 suggests a V(D)J recombinase-mediated rearrangement in these cases. V(D)J recombinase-associated characteristics (small nucleotide deletions and insertions of random nucleotides) were detected in six other cases. CBFB and MYH11 duplications were detected in four of six cases tested.
Resumo:
Most common human traits and diseases have a polygenic pattern of inheritance: DNA sequence variants at many genetic loci influence the phenotype. Genome-wide association (GWA) studies have identified more than 600 variants associated with human traits, but these typically explain small fractions of phenotypic variation, raising questions about the use of further studies. Here, using 183,727 individuals, we show that hundreds of genetic variants, in at least 180 loci, influence adult height, a highly heritable and classic polygenic trait. The large number of loci reveals patterns with important implications for genetic studies of common human diseases and traits. First, the 180 loci are not random, but instead are enriched for genes that are connected in biological pathways (P = 0.016) and that underlie skeletal growth defects (P < 0.001). Second, the likely causal gene is often located near the most strongly associated variant: in 13 of 21 loci containing a known skeletal growth gene, that gene was closest to the associated variant. Third, at least 19 loci have multiple independently associated variants, suggesting that allelic heterogeneity is a frequent feature of polygenic traits, that comprehensive explorations of already-discovered loci should discover additional variants and that an appreciable fraction of associated loci may have been identified. Fourth, associated variants are enriched for likely functional effects on genes, being over-represented among variants that alter amino-acid structure of proteins and expression levels of nearby genes. Our data explain approximately 10% of the phenotypic variation in height, and we estimate that unidentified common variants of similar effect sizes would increase this figure to approximately 16% of phenotypic variation (approximately 20% of heritable variation). Although additional approaches are needed to dissect the genetic architecture of polygenic human traits fully, our findings indicate that GWA studies can identify large numbers of loci that implicate biologically relevant genes and pathways.
Resumo:
Flooding is a major hazard in both rural and urban areas worldwide, but it is in urban areas that the impacts are most severe. An investigation of the ability of high resolution TerraSAR-X data to detect flooded regions in urban areas is described. An important application for this would be the calibration and validation of the flood extent predicted by an urban flood inundation model. To date, research on such models has been hampered by lack of suitable distributed validation data. The study uses a 3m resolution TerraSAR-X image of a 1-in-150 year flood near Tewkesbury, UK, in 2007, for which contemporaneous aerial photography exists for validation. The DLR SETES SAR simulator was used in conjunction with airborne LiDAR data to estimate regions of the TerraSAR-X image in which water would not be visible due to radar shadow or layover caused by buildings and taller vegetation, and these regions were masked out in the flood detection process. A semi-automatic algorithm for the detection of floodwater was developed, based on a hybrid approach. Flooding in rural areas adjacent to the urban areas was detected using an active contour model (snake) region-growing algorithm seeded using the un-flooded river channel network, which was applied to the TerraSAR-X image fused with the LiDAR DTM to ensure the smooth variation of heights along the reach. A simpler region-growing approach was used in the urban areas, which was initialized using knowledge of the flood waterline in the rural areas. Seed pixels having low backscatter were identified in the urban areas using supervised classification based on training areas for water taken from the rural flood, and non-water taken from the higher urban areas. Seed pixels were required to have heights less than a spatially-varying height threshold determined from nearby rural waterline heights. Seed pixels were clustered into urban flood regions based on their close proximity, rather than requiring that all pixels in the region should have low backscatter. This approach was taken because it appeared that urban water backscatter values were corrupted in some pixels, perhaps due to contributions from side-lobes of strong reflectors nearby. The TerraSAR-X urban flood extent was validated using the flood extent visible in the aerial photos. It turned out that 76% of the urban water pixels visible to TerraSAR-X were correctly detected, with an associated false positive rate of 25%. If all urban water pixels were considered, including those in shadow and layover regions, these figures fell to 58% and 19% respectively. These findings indicate that TerraSAR-X is capable of providing useful data for the calibration and validation of urban flood inundation models.
Resumo:
FoxC, FoxF, FoxL1 and FoxQ1 genes have been shown to be clustered in some animal genomes, with mesendodermal expression hypothesised as a selective force maintaining cluster integrity. Hypotheses are, however, constrained by a lack of data from the Lophotrochozoa. Here we characterise members of the FoxC, FoxF, FoxL1 and FoxQ1 families from the annelid Capitella teleta and the molluscs Lottia gigantea and Patella vulgata. We cloned FoxC, FoxF, FoxL1 and FoxQ1 genes from C. teleta, and FoxC, FoxF and FoxL1 genes from P. vulgata, and established their expression during development. We also examined their genomic organisation in C. teleta and L. gigantea, and investigated local syntenic relationships. Our results show mesodermal and anterior gut expression is a common feature of these genes in lophotrochozoans. In L. gigantea FoxC, FoxF and FoxL1 are closely linked, while in C. teleta Ct-foxC and Ct-foxL1 are closely linked, with Ct-foxF and Ct-foxQ1 on different scaffolds. Adjacent to these genes there is limited evidence of local synteny. This demonstrates conservation of genomic organisation and expression of these genes can be traced in all three bilaterian Superphyla. These data are evaluated against competing theories for the long-term maintenance of gene clusters.
Resumo:
The performance of flood inundation models is often assessed using satellite observed data; however these data have inherent uncertainty. In this study we assess the impact of this uncertainty when calibrating a flood inundation model (LISFLOOD-FP) for a flood event in December 2006 on the River Dee, North Wales, UK. The flood extent is delineated from an ERS-2 SAR image of the event using an active contour model (snake), and water levels at the flood margin calculated through intersection of the shoreline vector with LiDAR topographic data. Gauged water levels are used to create a reference water surface slope for comparison with the satellite-derived water levels. Residuals between the satellite observed data points and those from the reference line are spatially clustered into groups of similar values. We show that model calibration achieved using pattern matching of observed and predicted flood extent is negatively influenced by this spatial dependency in the data. By contrast, model calibration using water elevations produces realistic calibrated optimum friction parameters even when spatial dependency is present. To test the impact of removing spatial dependency a new method of evaluating flood inundation model performance is developed by using multiple random subsamples of the water surface elevation data points. By testing for spatial dependency using Moran’s I, multiple subsamples of water elevations that have no significant spatial dependency are selected. The model is then calibrated against these data and the results averaged. This gives a near identical result to calibration using spatially dependent data, but has the advantage of being a statistically robust assessment of model performance in which we can have more confidence. Moreover, by using the variations found in the subsamples of the observed data it is possible to assess the effects of observational uncertainty on the assessment of flooding risk.
Resumo:
This paper is concerned with the computational efficiency of fuzzy clustering algorithms when the data set to be clustered is described by a proximity matrix only (relational data) and the number of clusters must be automatically estimated from such data. A fuzzy variant of an evolutionary algorithm for relational clustering is derived and compared against two systematic (pseudo-exhaustive) approaches that can also be used to automatically estimate the number of fuzzy clusters in relational data. An extensive collection of experiments involving 18 artificial and two real data sets is reported and analyzed. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
Background: A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes. Results: In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence. Conclusions: This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them.
Resumo:
We have investigated the use of hierarchical clustering of flow cytometry data to classify samples of conventional central chondrosarcoma, a malignant cartilage forming tumor of uncertain cellular origin, according to similarities with surface marker profiles of several known cell types. Human primary chondrosarcoma cells, articular chondrocytes, mesenchymal stem cells, fibroblasts, and a panel of tumor cell lines from chondrocytic or epithelial origin were clustered based on the expression profile of eleven surface markers. For clustering, eight hierarchical clustering algorithms, three distance metrics, as well as several approaches for data preprocessing, including multivariate outlier detection, logarithmic transformation, and z-score normalization, were systematically evaluated. By selecting clustering approaches shown to give reproducible results for cluster recovery of known cell types, primary conventional central chondrosacoma cells could be grouped in two main clusters with distinctive marker expression signatures: one group clustering together with mesenchymal stem cells (CD49b-high/CD10-low/CD221-high) and a second group clustering close to fibroblasts (CD49b-low/CD10-high/CD221-low). Hierarchical clustering also revealed substantial differences between primary conventional central chondrosarcoma cells and established chondrosarcoma cell lines, with the latter not only segregating apart from primary tumor cells and normal tissue cells, but clustering together with cell lines from epithelial lineage. Our study provides a foundation for the use of hierarchical clustering applied to flow cytometry data as a powerful tool to classify samples according to marker expression patterns, which could lead to uncover new cancer subtypes.
Does published orthodontic research account for clustering effects during statistical data analysis?
Resumo:
In orthodontics, multiple site observations within patients or multiple observations collected at consecutive time points are often encountered. Clustered designs require larger sample sizes compared to individual randomized trials and special statistical analyses that account for the fact that observations within clusters are correlated. It is the purpose of this study to assess to what degree clustering effects are considered during design and data analysis in the three major orthodontic journals. The contents of the most recent 24 issues of the American Journal of Orthodontics and Dentofacial Orthopedics (AJODO), Angle Orthodontist (AO), and European Journal of Orthodontics (EJO) from December 2010 backwards were hand searched. Articles with clustering effects and whether the authors accounted for clustering effects were identified. Additionally, information was collected on: involvement of a statistician, single or multicenter study, number of authors in the publication, geographical area, and statistical significance. From the 1584 articles, after exclusions, 1062 were assessed for clustering effects from which 250 (23.5 per cent) were considered to have clustering effects in the design (kappa = 0.92, 95 per cent CI: 0.67-0.99 for inter rater agreement). From the studies with clustering effects only, 63 (25.20 per cent) had indicated accounting for clustering effects. There was evidence that the studies published in the AO have higher odds of accounting for clustering effects [AO versus AJODO: odds ratio (OR) = 2.17, 95 per cent confidence interval (CI): 1.06-4.43, P = 0.03; EJO versus AJODO: OR = 1.90, 95 per cent CI: 0.84-4.24, non-significant; and EJO versus AO: OR = 1.15, 95 per cent CI: 0.57-2.33, non-significant). The results of this study indicate that only about a quarter of the studies with clustering effects account for this in statistical data analysis.
Resumo:
High-throughput assays, such as yeast two-hybrid system, have generated a huge amount of protein-protein interaction (PPI) data in the past decade. This tremendously increases the need for developing reliable methods to systematically and automatically suggest protein functions and relationships between them. With the available PPI data, it is now possible to study the functions and relationships in the context of a large-scale network. To data, several network-based schemes have been provided to effectively annotate protein functions on a large scale. However, due to those inherent noises in high-throughput data generation, new methods and algorithms should be developed to increase the reliability of functional annotations. Previous work in a yeast PPI network (Samanta and Liang, 2003) has shown that the local connection topology, particularly for two proteins sharing an unusually large number of neighbors, can predict functional associations between proteins, and hence suggest their functions. One advantage of the work is that their algorithm is not sensitive to noises (false positives) in high-throughput PPI data. In this study, we improved their prediction scheme by developing a new algorithm and new methods which we applied on a human PPI network to make a genome-wide functional inference. We used the new algorithm to measure and reduce the influence of hub proteins on detecting functionally associated proteins. We used the annotations of the Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) as independent and unbiased benchmarks to evaluate our algorithms and methods within the human PPI network. We showed that, compared with the previous work from Samanta and Liang, our algorithm and methods developed in this study improved the overall quality of functional inferences for human proteins. By applying the algorithms to the human PPI network, we obtained 4,233 significant functional associations among 1,754 proteins. Further comparisons of their KEGG and GO annotations allowed us to assign 466 KEGG pathway annotations to 274 proteins and 123 GO annotations to 114 proteins with estimated false discovery rates of <21% for KEGG and <30% for GO. We clustered 1,729 proteins by their functional associations and made pathway analysis to identify several subclusters that are highly enriched in certain signaling pathways. Particularly, we performed a detailed analysis on a subcluster enriched in the transforming growth factor β signaling pathway (P<10-50) which is important in cell proliferation and tumorigenesis. Analysis of another four subclusters also suggested potential new players in six signaling pathways worthy of further experimental investigations. Our study gives clear insight into the common neighbor-based prediction scheme and provides a reliable method for large-scale functional annotations in this post-genomic era.