535 resultados para Clustering Analysis


Relevância:

30.00% 30.00%

Publicador:

Resumo:

Barmah Forest virus (BFV) disease is one of the most widespread mosquito-borne diseases in Australia. The number of outbreaks and the incidence rate of BFV in Australia have attracted growing concerns about the spatio-temporal complexity and underlying risk factors of BFV disease. A large number of notifications has been recorded continuously in Queensland since 1992. Yet, little is known about the spatial and temporal characteristics of the disease. I aim to use notification data to better understand the effects of climatic, demographic, socio-economic and ecological risk factors on the spatial epidemiology of BFV disease transmission, develop predictive risk models and forecast future disease risks under climate change scenarios. Computerised data files of daily notifications of BFV disease and climatic variables in Queensland during 1992-2008 were obtained from Queensland Health and Australian Bureau of Meteorology, respectively. Projections on climate data for years 2025, 2050 and 2100 were obtained from Council of Scientific Industrial Research Organisation. Data on socio-economic, demographic and ecological factors were also obtained from relevant government departments as follows: 1) socio-economic and demographic data from Australian Bureau of Statistics; 2) wetlands data from Department of Environment and Resource Management and 3) tidal readings from Queensland Department of Transport and Main roads. Disease notifications were geocoded and spatial and temporal patterns of disease were investigated using geostatistics. Visualisation of BFV disease incidence rates through mapping reveals the presence of substantial spatio-temporal variation at statistical local areas (SLA) over time. Results reveal high incidence rates of BFV disease along coastal areas compared to the whole area of Queensland. A Mantel-Haenszel Chi-square analysis for trend reveals a statistically significant relationship between BFV disease incidence rates and age groups (ƒÓ2 = 7587, p<0.01). Semi-variogram analysis and smoothed maps created from interpolation techniques indicate that the pattern of spatial autocorrelation was not homogeneous across the state. A cluster analysis was used to detect the hot spots/clusters of BFV disease at a SLA level. Most likely spatial and space-time clusters are detected at the same locations across coastal Queensland (p<0.05). The study demonstrates heterogeneity of disease risk at a SLA level and reveals the spatial and temporal clustering of BFV disease in Queensland. Discriminant analysis was employed to establish a link between wetland classes, climate zones and BFV disease. This is because the importance of wetlands in the transmission of BFV disease remains unclear. The multivariable discriminant modelling analyses demonstrate that wetland types of saline 1, riverine and saline tidal influence were the most significant risk factors for BFV disease in all climate and buffer zones, while lacustrine, palustrine, estuarine and saline 2 and saline 3 wetlands were less important. The model accuracies were 76%, 98% and 100% for BFV risk in subtropical, tropical and temperate climate zones, respectively. This study demonstrates that BFV disease risk varied with wetland class and climate zone. The study suggests that wetlands may act as potential breeding habitats for BFV vectors. Multivariable spatial regression models were applied to assess the impact of spatial climatic, socio-economic and tidal factors on the BFV disease in Queensland. Spatial regression models were developed to account for spatial effects. Spatial regression models generated superior estimates over a traditional regression model. In the spatial regression models, BFV disease incidence shows an inverse relationship with minimum temperature, low tide and distance to coast, and positive relationship with rainfall in coastal areas whereas in whole Queensland the disease shows an inverse relationship with minimum temperature and high tide and positive relationship with rainfall. This study determines the most significant spatial risk factors for BFV disease across Queensland. Empirical models were developed to forecast the future risk of BFV disease outbreaks in coastal Queensland using existing climatic, socio-economic and tidal conditions under climate change scenarios. Logistic regression models were developed using BFV disease outbreak data for the existing period (2000-2008). The most parsimonious model had high sensitivity, specificity and accuracy and this model was used to estimate and forecast BFV disease outbreaks for years 2025, 2050 and 2100 under climate change scenarios for Australia. Important contributions arising from this research are that: (i) it is innovative to identify high-risk coastal areas by creating buffers based on grid-centroid and the use of fine-grained spatial units, i.e., mesh blocks; (ii) a spatial regression method was used to account for spatial dependence and heterogeneity of data in the study area; (iii) it determined a range of potential spatial risk factors for BFV disease; and (iv) it predicted the future risk of BFV disease outbreaks under climate change scenarios in Queensland, Australia. In conclusion, the thesis demonstrates that the distribution of BFV disease exhibits a distinct spatial and temporal variation. Such variation is influenced by a range of spatial risk factors including climatic, demographic, socio-economic, ecological and tidal variables. The thesis demonstrates that spatial regression method can be applied to better understand the transmission dynamics of BFV disease and its risk factors. The research findings show that disease notification data can be integrated with multi-factorial risk factor data to develop build-up models and forecast future potential disease risks under climate change scenarios. This thesis may have implications in BFV disease control and prevention programs in Queensland.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background Cancer outlier profile analysis (COPA) has proven to be an effective approach to analyzing cancer expression data, leading to the discovery of the TMPRSS2 and ETS family gene fusion events in prostate cancer. However, the original COPA algorithm did not identify down-regulated outliers, and the currently available R package implementing the method is similarly restricted to the analysis of over-expressed outliers. Here we present a modified outlier detection method, mCOPA, which contains refinements to the outlier-detection algorithm, identifies both over- and under-expressed outliers, is freely available, and can be applied to any expression dataset. Results We compare our method to other feature-selection approaches, and demonstrate that mCOPA frequently selects more-informative features than do differential expression or variance-based feature selection approaches, and is able to recover observed clinical subtypes more consistently. We demonstrate the application of mCOPA to prostate cancer expression data, and explore the use of outliers in clustering, pathway analysis, and the identification of tumour suppressors. We analyse the under-expressed outliers to identify known and novel prostate cancer tumour suppressor genes, validating these against data in Oncomine and the Cancer Gene Index. We also demonstrate how a combination of outlier analysis and pathway analysis can identify molecular mechanisms disrupted in individual tumours. Conclusions We demonstrate that mCOPA offers advantages, compared to differential expression or variance, in selecting outlier features, and that the features so selected are better able to assign samples to clinically annotated subtypes. Further, we show that the biology explored by outlier analysis differs from that uncovered in differential expression or variance analysis. mCOPA is an important new tool for the exploration of cancer datasets and the discovery of new cancer subtypes, and can be combined with pathway and functional analysis approaches to discover mechanisms underpinning heterogeneity in cancers

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper proposes the use of Bayesian approaches with the cross likelihood ratio (CLR) as a criterion for speaker clustering within a speaker diarization system, using eigenvoice modeling techniques. The CLR has previously been shown to be an effective decision criterion for speaker clustering using Gaussian mixture models. Recently, eigenvoice modeling has become an increasingly popular technique, due to its ability to adequately represent a speaker based on sparse training data, as well as to provide an improved capture of differences in speaker characteristics. The integration of eigenvoice modeling into the CLR framework to capitalize on the advantage of both techniques has also been shown to be beneficial for the speaker clustering task. Building on that success, this paper proposes the use of Bayesian methods to compute the conditional probabilities in computing the CLR, thus effectively combining the eigenvoice-CLR framework with the advantages of a Bayesian approach to the diarization problem. Results obtained on the 2002 Rich Transcription (RT-02) Evaluation dataset show an improved clustering performance, resulting in a 33.5% relative improvement in the overall Diarization Error Rate (DER) compared to the baseline system.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Speaker diarization determines instances of the same speaker within a recording. Extending this task to a collection of recordings for linking together segments spoken by a unique speaker requires speaker linking. In this paper we propose a speaker linking system using linkage clustering and state-of-the-art speaker recognition techniques. We evaluate our approach against two baseline linking systems using agglomerative cluster merging (AC) and agglomerative clustering with model retraining (ACR). We demonstrate that our linking method, using complete-linkage clustering, provides a relative improvement of 20% and 29% in attribution error rate (AER), over the AC and ACR systems, respectively.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper we propose and evaluate a speaker attribution system using a complete-linkage clustering method. Speaker attribution refers to the annotation of a collection of spoken audio based on speaker identities. This can be achieved using diarization and speaker linking. The main challenge associated with attribution is achieving computational efficiency when dealing with large audio archives. Traditional agglomerative clustering methods with model merging and retraining are not feasible for this purpose. This has motivated the use of linkage clustering methods without retraining. We first propose a diarization system using complete-linkage clustering and show that it outperforms traditional agglomerative and single-linkage clustering based diarization systems with a relative improvement of 40% and 68%, respectively. We then propose a complete-linkage speaker linking system to achieve attribution and demonstrate a 26% relative improvement in attribution error rate (AER) over the single-linkage speaker linking approach.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Smartphones become very critical part of our lives as they offer advanced capabilities with PC-like functionalities. They are getting widely deployed while not only being used for classical voice-centric communication. New smartphone malwares keep emerging where most of them still target Symbian OS. In the case of Symbian OS, application signing seemed to be an appropriate measure for slowing down malware appearance. Unfortunately, latest examples showed that signing can be bypassed resulting in new malware outbreak. In this paper, we present a novel approach to static malware detection in resource-limited mobile environments. This approach can be used to extend currently used third-party application signing mechanisms for increasing malware detection capabilities. In our work, we extract function calls from binaries in order to apply our clustering mechanism, called centroid. This method is capable of detecting unknown malwares. Our results are promising where the employed mechanism might find application at distribution channels, like online application stores. Additionally, it seems suitable for directly being used on smartphones for (pre-)checking installed applications.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Standard differential equation–based models of collective cell behaviour, such as the logistic growth model, invoke a mean–field assumption which is equivalent to assuming that individuals within the population interact with each other in proportion to the average population density. Implementing such assumptions implies that the dynamics of the system are unaffected by spatial structure, such as the formation of patches or clusters within the population. Recent theoretical developments have introduced a class of models, known as moment dynamics models, which aim to account for the dynamics of individuals, pairs of individuals, triplets of individuals and so on. Such models enable us to describe the dynamics of populations with clustering, however, little progress has been made with regard to applying moment dynamics models to experimental data. Here, we report new experimental results describing the formation of a monolayer of cells using two different cell types: 3T3 fibroblast cells and MDA MB 231 breast cancer cells. Our analysis indicates that the 3T3 fibroblast cells are relatively motile and we observe that the 3T3 fibroblast monolayer forms without clustering. Alternatively, the MDA MB 231 cells are less motile and we observe that the MDA MB 231 monolayer formation is associated with significant clustering. We calibrate a moment dynamics model and a standard mean–field model to both data sets. Our results indicate that the mean–field and moment dynamics models provide similar descriptions of the 3T3 fibroblast monolayer formation whereas these two models give very different predictions for the MDA MD 231 monolayer formation. These outcomes indicate that standard mean–field models of collective cell behaviour are not always appropriate and that care ought to be exercised when implementing such a model.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Originally developed in bioinformatics, sequence analysis is being increasingly used in social sciences for the study of life-course processes. The methodology generally employed consists in computing dissimilarities between the trajectories and, if typologies are sought, in clustering the trajectories according to their similarities or dissemblances. The choice of an appropriate dissimilarity measure is a major issue when dealing with sequence analysis for life sequences. Several dissimilarities are available in the literature, but neither of them succeeds to become indisputable. In this paper, instead of deciding upon one dissimilarity measure, we propose to use an optimal convex combination of different dissimilarities. The optimality is automatically determined by the clustering procedure and is defined with respect to the within-class variance.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper is devoted to the analysis of career paths and employability. The state-of-the-art on this topic is rather poor in methodologies. Some authors propose distances well adapted to the data, but are limiting their analysis to hierarchical clustering. Other authors apply sophisticated methods, but only after paying the price of transforming the categorical data into continuous, via a factorial analysis. The latter approach has an important drawback since it makes a linear assumption on the data. We propose a new methodology, inspired from biology and adapted to career paths, combining optimal matching and self-organizing maps. A complete study on real-life data will illustrate our proposal.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

OBJECTIVE(S): An individual's risk of developing cardiovascular disease (CVD) is influenced by genetic factors. This study focussed on mapping genetic loci for CVD-risk traits in a unique population isolate derived from Norfolk Island. METHODS: This investigation focussed on 377 individuals descended from the population founders. Principal component analysis was used to extract orthogonal components from 11 cardiovascular risk traits. Multipoint variance component methods were used to assess genome-wide linkage using SOLAR to the derived factors. A total of 285 of the 377 related individuals were informative for linkage analysis. RESULTS: A total of 4 principal components accounting for 83% of the total variance were derived. Principal component 1 was loaded with body size indicators; principal component 2 with body size, cholesterol and triglyceride levels; principal component 3 with the blood pressures; and principal component 4 with LDL-cholesterol and total cholesterol levels. Suggestive evidence of linkage for principal component 2 (h(2) = 0.35) was observed on chromosome 5q35 (LOD = 1.85; p = 0.0008). While peak regions on chromosome 10p11.2 (LOD = 1.27; p = 0.005) and 12q13 (LOD = 1.63; p = 0.003) were observed to segregate with principal components 1 (h(2) = 0.33) and 4 (h(2) = 0.42), respectively. CONCLUSION(S): This study investigated a number of CVD risk traits in a unique isolated population. Findings support the clustering of CVD risk traits and provide interesting evidence of a region on chromosome 5q35 segregating with weight, waist circumference, HDL-c and total triglyceride levels.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We propose a cluster ensemble method to map the corpus documents into the semantic space embedded in Wikipedia and group them using multiple types of feature space. A heterogeneous cluster ensemble is constructed with multiple types of relations i.e. document-term, document-concept and document-category. A final clustering solution is obtained by exploiting associations between document pairs and hubness of the documents. Empirical analysis with various real data sets reveals that the proposed meth-od outperforms state-of-the-art text clustering approaches.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Crashes on motorway contribute to a significant proportion (40-50%) of non-recurrent motorway congestions. Hence reduce crashes will help address congestion issues (Meyer, 2008). Crash likelihood estimation studies commonly focus on traffic conditions in a Short time window around the time of crash while longer-term pre-crash traffic flow trends are neglected. In this paper we will show, through data mining techniques, that a relationship between pre-crash traffic flow patterns and crash occurrence on motorways exists, and that this knowledge has the potential to improve the accuracy of existing models and opens the path for new development approaches. The data for the analysis was extracted from records collected between 2007 and 2009 on the Shibuya and Shinjuku lines of the Tokyo Metropolitan Expressway in Japan. The dataset includes a total of 824 rear-end and sideswipe crashes that have been matched with traffic flow data of one hour prior to the crash using an incident detection algorithm. Traffic flow trends (traffic speed/occupancy time series) revealed that crashes could be clustered with regards of the dominant traffic flow pattern prior to the crash. Using the k-means clustering method allowed the crashes to be clustered based on their flow trends rather than their distance. Four major trends have been found in the clustering results. Based on these findings, crash likelihood estimation algorithms can be fine-tuned based on the monitored traffic flow conditions with a sliding window of 60 minutes to increase accuracy of the results and minimize false alarms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Crashes that occur on motorways contribute to a significant proportion (40-50%) of non-recurrent motorway congestions. Hence, reducing the frequency of crashes assists in addressing congestion issues (Meyer, 2008). Crash likelihood estimation studies commonly focus on traffic conditions in a short time window around the time of a crash while longer-term pre-crash traffic flow trends are neglected. In this paper we will show, through data mining techniques that a relationship between pre-crash traffic flow patterns and crash occurrence on motorways exists. We will compare them with normal traffic trends and show this knowledge has the potential to improve the accuracy of existing models and opens the path for new development approaches. The data for the analysis was extracted from records collected between 2007 and 2009 on the Shibuya and Shinjuku lines of the Tokyo Metropolitan Expressway in Japan. The dataset includes a total of 824 rear-end and sideswipe crashes that have been matched with crashes corresponding to traffic flow data using an incident detection algorithm. Traffic trends (traffic speed time series) revealed that crashes can be clustered with regards to the dominant traffic patterns prior to the crash. Using the K-Means clustering method with Euclidean distance function allowed the crashes to be clustered. Then, normal situation data was extracted based on the time distribution of crashes and were clustered to compare with the “high risk” clusters. Five major trends have been found in the clustering results for both high risk and normal conditions. The study discovered traffic regimes had differences in the speed trends. Based on these findings, crash likelihood estimation models can be fine-tuned based on the monitored traffic conditions with a sliding window of 30 minutes to increase accuracy of the results and minimize false alarms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The continuous growth of the XML data poses a great concern in the area of XML data management. The need for processing large amounts of XML data brings complications to many applications, such as information retrieval, data integration and many others. One way of simplifying this problem is to break the massive amount of data into smaller groups by application of clustering techniques. However, XML clustering is an intricate task that may involve the processing of both the structure and the content of XML data in order to identify similar XML data. This research presents four clustering methods, two methods utilizing the structure of XML documents and the other two utilizing both the structure and the content. The two structural clustering methods have different data models. One is based on a path model and other is based on a tree model. These methods employ rigid similarity measures which aim to identifying corresponding elements between documents with different or similar underlying structure. The two clustering methods that utilize both the structural and content information vary in terms of how the structure and content similarity are combined. One clustering method calculates the document similarity by using a linear weighting combination strategy of structure and content similarities. The content similarity in this clustering method is based on a semantic kernel. The other method calculates the distance between documents by a non-linear combination of the structure and content of XML documents using a semantic kernel. Empirical analysis shows that the structure-only clustering method based on the tree model is more scalable than the structure-only clustering method based on the path model as the tree similarity measure for the tree model does not need to visit the parents of an element many times. Experimental results also show that the clustering methods perform better with the inclusion of the content information on most test document collections. To further the research, the structural clustering method based on tree model is extended and employed in XML transformation. The results from the experiments show that the proposed transformation process is faster than the traditional transformation system that translates and converts the source XML documents sequentially. Also, the schema matching process of XML transformation produces a better matching result in a shorter time.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Crashes that occur on motorways contribute to a significant proportion (40-50%) of non-recurrent motorway congestion. Hence, reducing the frequency of crashes assist in addressing congestion issues (Meyer, 2008). Analysing traffic conditions and discovering risky traffic trends and patterns are essential basics in crash likelihood estimations studies and still require more attention and investigation. In this paper we will show, through data mining techniques, that there is a relationship between pre-crash traffic flow patterns and crash occurrence on motorways, compare them with normal traffic trends, and that this knowledge has the potentiality to improve the accuracy of existing crash likelihood estimation models, and opens the path for new development approaches. The data for the analysis was extracted from records collected between 2007 and 2009 on the Shibuya and Shinjuku lines of the Tokyo Metropolitan Expressway in Japan. The dataset includes a total of 824 rear-end and sideswipe crashes that have been matched with crashes corresponding traffic flow data using an incident detection algorithm. Traffic trends (traffic speed time series) revealed that crashes can be clustered with regards to the dominant traffic patterns prior to the crash occurrence. K-Means clustering algorithm applied to determine dominant pre-crash traffic patterns. In the first phase of this research, traffic regimes identified by analysing crashes and normal traffic situations using half an hour speed in upstream locations of crashes. Then, the second phase investigated the different combination of speed risk indicators to distinguish crashes from normal traffic situations more precisely. Five major trends have been found in the first phase of this paper for both high risk and normal conditions. The study discovered traffic regimes had differences in the speed trends. Moreover, the second phase explains that spatiotemporal difference of speed is a better risk indicator among different combinations of speed related risk indicators. Based on these findings, crash likelihood estimation models can be fine-tuned to increase accuracy of estimations and minimize false alarms.