919 resultados para label hierarchical clustering
Resumo:
This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants.
Resumo:
This dissertation is primarily an applied statistical modelling investigation, motivated by a case study comprising real data and real questions. Theoretical questions on modelling and computation of normalization constants arose from pursuit of these data analytic questions. The essence of the thesis can be described as follows. Consider binary data observed on a two-dimensional lattice. A common problem with such data is the ambiguity of zeroes recorded. These may represent zero response given some threshold (presence) or that the threshold has not been triggered (absence). Suppose that the researcher wishes to estimate the effects of covariates on the binary responses, whilst taking into account underlying spatial variation, which is itself of some interest. This situation arises in many contexts and the dingo, cypress and toad case studies described in the motivation chapter are examples of this. Two main approaches to modelling and inference are investigated in this thesis. The first is frequentist and based on generalized linear models, with spatial variation modelled by using a block structure or by smoothing the residuals spatially. The EM algorithm can be used to obtain point estimates, coupled with bootstrapping or asymptotic MLE estimates for standard errors. The second approach is Bayesian and based on a three- or four-tier hierarchical model, comprising a logistic regression with covariates for the data layer, a binary Markov Random field (MRF) for the underlying spatial process, and suitable priors for parameters in these main models. The three-parameter autologistic model is a particular MRF of interest. Markov chain Monte Carlo (MCMC) methods comprising hybrid Metropolis/Gibbs samplers is suitable for computation in this situation. Model performance can be gauged by MCMC diagnostics. Model choice can be assessed by incorporating another tier in the modelling hierarchy. This requires evaluation of a normalization constant, a notoriously difficult problem. Difficulty with estimating the normalization constant for the MRF can be overcome by using a path integral approach, although this is a highly computationally intensive method. Different methods of estimating ratios of normalization constants (N Cs) are investigated, including importance sampling Monte Carlo (ISMC), dependent Monte Carlo based on MCMC simulations (MCMC), and reverse logistic regression (RLR). I develop an idea present though not fully developed in the literature, and propose the Integrated mean canonical statistic (IMCS) method for estimating log NC ratios for binary MRFs. The IMCS method falls within the framework of the newly identified path sampling methods of Gelman & Meng (1998) and outperforms ISMC, MCMC and RLR. It also does not rely on simplifying assumptions, such as ignoring spatio-temporal dependence in the process. A thorough investigation is made of the application of IMCS to the three-parameter Autologistic model. This work introduces background computations required for the full implementation of the four-tier model in Chapter 7. Two different extensions of the three-tier model to a four-tier version are investigated. The first extension incorporates temporal dependence in the underlying spatio-temporal process. The second extensions allows the successes and failures in the data layer to depend on time. The MCMC computational method is extended to incorporate the extra layer. A major contribution of the thesis is the development of a fully Bayesian approach to inference for these hierarchical models for the first time. Note: The author of this thesis has agreed to make it open access but invites people downloading the thesis to send her an email via the 'Contact Author' function.
Resumo:
Digital collections are growing exponentially in size as the information age takes a firm grip on all aspects of society. As a result Information Retrieval (IR) has become an increasingly important area of research. It promises to provide new and more effective ways for users to find information relevant to their search intentions. Document clustering is one of the many tools in the IR toolbox and is far from being perfected. It groups documents that share common features. This grouping allows a user to quickly identify relevant information. If these groups are misleading then valuable information can accidentally be ignored. There- fore, the study and analysis of the quality of document clustering is important. With more and more digital information available, the performance of these algorithms is also of interest. An algorithm with a time complexity of O(n2) can quickly become impractical when clustering a corpus containing millions of documents. Therefore, the investigation of algorithms and data structures to perform clustering in an efficient manner is vital to its success as an IR tool. Document classification is another tool frequently used in the IR field. It predicts categories of new documents based on an existing database of (doc- ument, category) pairs. Support Vector Machines (SVM) have been found to be effective when classifying text documents. As the algorithms for classifica- tion are both efficient and of high quality, the largest gains can be made from improvements to representation. Document representations are vital for both clustering and classification. Representations exploit the content and structure of documents. Dimensionality reduction can improve the effectiveness of existing representations in terms of quality and run-time performance. Research into these areas is another way to improve the efficiency and quality of clustering and classification results. Evaluating document clustering is a difficult task. Intrinsic measures of quality such as distortion only indicate how well an algorithm minimised a sim- ilarity function in a particular vector space. Intrinsic comparisons are inherently limited by the given representation and are not comparable between different representations. Extrinsic measures of quality compare a clustering solution to a “ground truth” solution. This allows comparison between different approaches. As the “ground truth” is created by humans it can suffer from the fact that not every human interprets a topic in the same manner. Whether a document belongs to a particular topic or not can be subjective.
Resumo:
It is important to examine the nature of the relationships between roadway, environmental, and traffic factors and motor vehicle crashes, with the aim to improve the collective understanding of causal mechanisms involved in crashes and to better predict their occurrence. Statistical models of motor vehicle crashes are one path of inquiry often used to gain these initial insights. Recent efforts have focused on the estimation of negative binomial and Poisson regression models (and related deviants) due to their relatively good fit to crash data. Of course analysts constantly seek methods that offer greater consistency with the data generating mechanism (motor vehicle crashes in this case), provide better statistical fit, and provide insight into data structure that was previously unavailable. One such opportunity exists with some types of crash data, in particular crash-level data that are collected across roadway segments, intersections, etc. It is argued in this paper that some crash data possess hierarchical structure that has not routinely been exploited. This paper describes the application of binomial multilevel models of crash types using 548 motor vehicle crashes collected from 91 two-lane rural intersections in the state of Georgia. Crash prediction models are estimated for angle, rear-end, and sideswipe (both same direction and opposite direction) crashes. The contributions of the paper are the realization of hierarchical data structure and the application of a theoretically appealing and suitable analysis approach for multilevel data, yielding insights into intersection-related crashes by crash type.
Resumo:
Traffic control at road junctions is one of the major concerns in most metropolitan cities. Controllers of various approaches are available and the required control action is the effective green-time assigned to each traffic stream within a traffic-light cycle. The application of fuzzy logic provides the controller with the capability to handle uncertain natures of the system, such as drivers’ behaviour and random arrivals of vehicles. When turning traffic is allowed at the junction, the number of phases in the traffic-light cycle increases. The additional input variables inevitably complicate the controller and hence slow down the decision-making process, which is critical in this real-time control problem. In this paper, a hierarchical fuzzy logic controller is proposed to tackle this traffic control problem at a 2-way road junction with turning traffic. The two levels of fuzzy logic controllers devise the minimum effective green-time and fine-tune it respectively at each phase of a traffic-light cycle. The complexity of the controller at each level is reduced with smaller rule-set. The performance of this hierarchical controller is examined by comparison with a fixed-time controller under various traffic conditions. Substantial delay reduction has been achieved as a result and the performance and limitation of the controller will be discussed.
Resumo:
Traffic control at a road junction by a complex fuzzy logic controller is investigated. The increase in the complexity of junction means more number of input variables must be taken into account, which will increase the number of fuzzy rules in the system. A hierarchical fuzzy logic controller is introduced to reduce the number of rules. Besides, the increase in the complexity of the controller makes formulation of the fuzzy rules difficult. A genetic algorithm based off-line leaning algorithm is employed to generate the fuzzy rules. The learning algorithm uses constant flow-rates as training sets. The system is tested by both constant and time-varying flow-rates. Simulation results show that the proposed controller produces lower average delay than a fixed-time controller does under various traffic conditions.
Resumo:
Many cities worldwide face the prospect of major transformation as the world moves towards a global information order. In this new era, urban economies are being radically altered by dynamic processes of economic and spatial restructuring. The result is the creation of ‘informational cities’ or its new and more popular name, ‘knowledge cities’. For the last two centuries, social production had been primarily understood and shaped by neo-classical economic thought that recognized only three factors of production: land, labor and capital. Knowledge, education, and intellectual capacity were secondary, if not incidental, factors. Human capital was assumed to be either embedded in labor or just one of numerous categories of capital. In the last decades, it has become apparent that knowledge is sufficiently important to deserve recognition as a fourth factor of production. Knowledge and information and the social and technological settings for their production and communication are now seen as keys to development and economic prosperity. The rise of knowledge-based opportunity has, in many cases, been accompanied by a concomitant decline in traditional industrial activity. The replacement of physical commodity production by more abstract forms of production (e.g. information, ideas, and knowledge) has, however paradoxically, reinforced the importance of central places and led to the formation of knowledge cities. Knowledge is produced, marketed and exchanged mainly in cities. Therefore, knowledge cities aim to assist decision-makers in making their cities compatible with the knowledge economy and thus able to compete with other cities. Knowledge cities enable their citizens to foster knowledge creation, knowledge exchange and innovation. They also encourage the continuous creation, sharing, evaluation, renewal and update of knowledge. To compete nationally and internationally, cities need knowledge infrastructures (e.g. universities, research and development institutes); a concentration of well-educated people; technological, mainly electronic, infrastructure; and connections to the global economy (e.g. international companies and finance institutions for trade and investment). Moreover, they must possess the people and things necessary for the production of knowledge and, as importantly, function as breeding grounds for talent and innovation. The economy of a knowledge city creates high value-added products using research, technology, and brainpower. Private and the public sectors value knowledge, spend money on its discovery and dissemination and, ultimately, harness it to create goods and services. Although many cities call themselves knowledge cities, currently, only a few cities around the world (e.g., Barcelona, Delft, Dublin, Montreal, Munich, and Stockholm) have earned that label. Many other cities aspire to the status of knowledge city through urban development programs that target knowledge-based urban development. Examples include Copenhagen, Dubai, Manchester, Melbourne, Monterrey, Singapore, and Shanghai. Knowledge-Based Urban Development To date, the development of most knowledge cities has proceeded organically as a dependent and derivative effect of global market forces. Urban and regional planning has responded slowly, and sometimes not at all, to the challenges and the opportunities of the knowledge city. That is changing, however. Knowledge-based urban development potentially brings both economic prosperity and a sustainable socio-spatial order. Its goal is to produce and circulate abstract work. The globalization of the world in the last decades of the twentieth century was a dialectical process. On one hand, as the tyranny of distance was eroded, economic networks of production and consumption were constituted at a global scale. At the same time, spatial proximity remained as important as ever, if not more so, for knowledge-based urban development. Mediated by information and communication technology, personal contact, and the medium of tacit knowledge, organizational and institutional interactions are still closely associated with spatial proximity. The clustering of knowledge production is essential for fostering innovation and wealth creation. The social benefits of knowledge-based urban development extend beyond aggregate economic growth. On the one hand is the possibility of a particularly resilient form of urban development secured in a network of connections anchored at local, national, and global coordinates. On the other hand, quality of place and life, defined by the level of public service (e.g. health and education) and by the conservation and development of the cultural, aesthetic and ecological values give cities their character and attract or repel the creative class of knowledge workers, is a prerequisite for successful knowledge-based urban development. The goal is a secure economy in a human setting: in short, smart growth or sustainable urban development.
Resumo:
This paper presents an overview of the experiments conducted using Hybrid Clustering of XML documents using Constraints (HCXC) method for the clustering task in the INEX 2009 XML Mining track. This technique utilises frequent subtrees generated from the structure to extract the content for clustering the XML documents. It also presents the experimental study using several data representations such as the structure-only, content-only and using both the structure and the content of XML documents for the purpose of clustering them. Unlike previous years, this year the XML documents were marked up using the Wiki tags and contains categories derived by using the YAGO ontology. This paper also presents the results of studying the effect of these tags on XML clustering using the HCXC method.
Resumo:
Plant biosecurity requires statistical tools to interpret field surveillance data in order to manage pest incursions that threaten crop production and trade. Ultimately, management decisions need to be based on the probability that an area is infested or free of a pest. Current informal approaches to delimiting pest extent rely upon expert ecological interpretation of presence / absence data over space and time. Hierarchical Bayesian models provide a cohesive statistical framework that can formally integrate the available information on both pest ecology and data. The overarching method involves constructing an observation model for the surveillance data, conditional on the hidden extent of the pest and uncertain detection sensitivity. The extent of the pest is then modelled as a dynamic invasion process that includes uncertainty in ecological parameters. Modelling approaches to assimilate this information are explored through case studies on spiralling whitefly, Aleurodicus dispersus and red banded mango caterpillar, Deanolis sublimbalis. Markov chain Monte Carlo simulation is used to estimate the probable extent of pests, given the observation and process model conditioned by surveillance data. Statistical methods, based on time-to-event models, are developed to apply hierarchical Bayesian models to early detection programs and to demonstrate area freedom from pests. The value of early detection surveillance programs is demonstrated through an application to interpret surveillance data for exotic plant pests with uncertain spread rates. The model suggests that typical early detection programs provide a moderate reduction in the probability of an area being infested but a dramatic reduction in the expected area of incursions at a given time. Estimates of spiralling whitefly extent are examined at local, district and state-wide scales. The local model estimates the rate of natural spread and the influence of host architecture, host suitability and inspector efficiency. These parameter estimates can support the development of robust surveillance programs. Hierarchical Bayesian models for the human-mediated spread of spiralling whitefly are developed for the colonisation of discrete cells connected by a modified gravity model. By estimating dispersal parameters, the model can be used to predict the extent of the pest over time. An extended model predicts the climate restricted distribution of the pest in Queensland. These novel human-mediated movement models are well suited to demonstrating area freedom at coarse spatio-temporal scales. At finer scales, and in the presence of ecological complexity, exploratory models are developed to investigate the capacity for surveillance information to estimate the extent of red banded mango caterpillar. It is apparent that excessive uncertainty about observation and ecological parameters can impose limits on inference at the scales required for effective management of response programs. The thesis contributes novel statistical approaches to estimating the extent of pests and develops applications to assist decision-making across a range of plant biosecurity surveillance activities. Hierarchical Bayesian modelling is demonstrated as both a useful analytical tool for estimating pest extent and a natural investigative paradigm for developing and focussing biosecurity programs.
Resumo:
New-generation biomaterials for bone regenerations should be highly bioactive, resorbable and mechanically strong. Mesoporous bioactive glass (MBG), as a novel bioactive material, has been used for the study of bone regeneration due to its excellent bioactivity, degradation and drug-delivery ability; however, how to construct a 3D MBG scaffold (including other bioactive inorganic scaffolds) for bone regeneration still maintains a significant challenge due to its/their inherit brittleness and low strength. In this brief communication, we reported a new facile method to prepare hierarchical and multifunctional MBG scaffolds with controllable pore architecture, excellent mechanical strength and mineralization ability for bone regeneration application by a modified 3D-printing technique using polyvinylalcohol (PVA), as a binder. The method provides a new way to solve the commonly existing issues for inorganic scaffold materials, for example, uncontrollable pore architecture, low strength, high brittleness and the requirement for the second sintering at high temperature. The obtained 3D-printing MBG scaffolds possess a high mechanical strength which is about 200 times for that of traditional polyurethane foam template-resulted MBG scaffolds. They have highly controllable pore architecture, excellent apatite-mineralization ability and sustained drug-delivery property. Our study indicates that the 3D-printed MBG scaffolds may be an excellent candidate for bone regeneration.
Resumo:
Background: Waist circumference has been identified as a valuable predictor of cardiovascular risk in children. The development of waist circumference percentiles and cut-offs for various ethnic groups are necessary because of differences in body composition. The purpose of this study was to develop waist circumference percentiles for Chinese children and to explore optimal waist circumference cut-off values for predicting cardiovascular risk factors clustering in this population.----- ----- Methods: Height, weight, and waist circumference were measured in 5529 children (2830 boys and 2699 girls) aged 6-12 years randomly selected from southern and northern China. Blood pressure, fasting triglycerides, low-density lipoprotein cholesterol, high-density lipoprotein cholesterol, and glucose were obtained in a subsample (n = 1845). Smoothed percentile curves were produced using the LMS method. Receiver-operating characteristic analysis was used to derive the optimal age- and gender-specific waist circumference thresholds for predicting the clustering of cardiovascular risk factors.----- ----- Results: Gender-specific waist circumference percentiles were constructed. The waist circumference thresholds were at the 90th and 84th percentiles for Chinese boys and girls respectively, with sensitivity and specificity ranging from 67% to 83%. The odds ratio of a clustering of cardiovascular risk factors among boys and girls with a higher value than cut-off points was 10.349 (95% confidence interval 4.466 to 23.979) and 8.084 (95% confidence interval 3.147 to 20.767) compared with their counterparts.----- ----- Conclusions: Percentile curves for waist circumference of Chinese children are provided. The cut-off point for waist circumference to predict cardiovascular risk factors clustering is at the 90th and 84th percentiles for Chinese boys and girls, respectively.
Resumo:
We present a hierarchical model for assessing an object-oriented program's security. Security is quantified using structural properties of the program code to identify the ways in which `classified' data values may be transferred between objects. The model begins with a set of low-level security metrics based on traditional design characteristics of object-oriented classes, such as data encapsulation, cohesion and coupling. These metrics are then used to characterise higher-level properties concerning the overall readability and writability of classified data throughout the program. In turn, these metrics are then mapped to well-known security design principles such as `assigning the least privilege' and `reducing the size of the attack surface'. Finally, the entire program's security is summarised as a single security index value. These metrics allow different versions of the same program, or different programs intended to perform the same task, to be compared for their relative security at a number of different abstraction levels. The model is validated via an experiment involving five open source Java programs, using a static analysis tool we have developed to automatically extract the security metrics from compiled Java bytecode.
Resumo:
The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information.
Resumo:
Hyperthermia and local drug delivery have been proposed the potential therapeutic approaches for bone defects resulting from malignant bone tumors. Development of bioactive materials with magnetic and drug-delivery properties may potentially meet this target. The aim of this study is to develop a multifunctional mesoporous bioactive glass (MBG) scaffold system for both hyperthermia and local-drug delivery application potentially. For this aim, Iron (Fe) containing MBG (Fe-MBG) scaffolds with hierarchically large pores (300-500 µm) and fingerprint-like mesopores (4.5 nm) have been successfully prepared. The effect of Fe on the mesopore structure, physiochemical, magnetism, drug delivery and biological properties of MBG scaffolds has been systematically investigated. The results showed that the morphology of the mesopore varied from straight channels to curved fingerprint-like channels after incorporated parts of Fe into MBG scaffolds. The magnetism magnitude of MBG scaffolds can be tailored by controlling Fe contents. Furthermore, the incorporating of Fe into mesoporous MBG glass scaffolds enhanced the mitochondrial activity and bone-relative gene (ALP and OCN) expression of human bone marrow mesenchymal stem cells (BMSCs) on the scaffolds. The obtained Fe-MBG scaffolds also possessed high specific surface areas and sustained drug delivery. Therefore, Fe-MBG scaffolds are magnetic, degradable and bioactive. The multifunction of Fe-MBG scaffolds indicates that there is a great potential for Fe-MBG scaffolds to be used for the therapy and regeneration of large-bone defects caused by malignant bone tumors through the combination of hyperthermia, local drug delivery and their osteoconductivity.