977 resultados para incorporate probabilistic techniques
Resumo:
The concept of feature selection in a nonparametric unsupervised learning environment is practically undeveloped because no true measure for the effectiveness of a feature exists in such an environment. The lack of a feature selection phase preceding the clustering process seriously affects the reliability of such learning. New concepts such as significant features, level of significance of features, and immediate neighborhood are introduced which result in meeting implicitly the need for feature slection in the context of clustering techniques.
Resumo:
Flow-graph techniques are applied in this article for the analysis of an epicyclic gear train. A gear system based on this is designed and constructed for use in Numerical Control Systems.
Resumo:
Sampling design is critical to the quality of quantitative research, yet it does not always receive appropriate attention in nursing research. The current article details how balancing probability techniques with practical considerations produced a representative sample of Australian nursing homes (NHs). Budgetary, logistical, and statistical constraints were managed by excluding some NHs (e.g., those too difficult to access) from the sampling frame; a stratified, random sampling methodology yielded a final sample of 53 NHs from a population of 2,774. In testing the adequacy of representation of the study population, chi-square tests for goodness of fit generated nonsignificant results for distribution by distance from major city and type of organization. A significant result for state/territory was expected and was easily corrected for by the application of weights. The current article provides recommendations for conducting high-quality, probability-based samples and stresses the importance of testing the representativeness of achieved samples.
Resumo:
Clustering identities in a video is a useful task to aid in video search, annotation and retrieval, and cast identification. However, reliably clustering faces across multiple videos is challenging task due to variations in the appearance of the faces, as videos are captured in an uncontrolled environment. A person's appearance may vary due to session variations including: lighting and background changes, occlusions, changes in expression and make up. In this paper we propose the novel Local Total Variability Modelling (Local TVM) approach to cluster faces across a news video corpus; and incorporate this into a novel two stage video clustering system. We first cluster faces within a single video using colour, spatial and temporal cues; after which we use face track modelling and hierarchical agglomerative clustering to cluster faces across the entire corpus. We compare different face recognition approaches within this framework. Experiments on a news video database show that the Local TVM technique is able effectively model the session variation observed in the data, resulting in improved clustering performance, with much greater computational efficiency than other methods.
Resumo:
Advancements in the analysis techniques have led to a rapid accumulation of biological data in databases. Such data often are in the form of sequences of observations, examples including DNA sequences and amino acid sequences of proteins. The scale and quality of the data give promises of answering various biologically relevant questions in more detail than what has been possible before. For example, one may wish to identify areas in an amino acid sequence, which are important for the function of the corresponding protein, or investigate how characteristics on the level of DNA sequence affect the adaptation of a bacterial species to its environment. Many of the interesting questions are intimately associated with the understanding of the evolutionary relationships among the items under consideration. The aim of this work is to develop novel statistical models and computational techniques to meet with the challenge of deriving meaning from the increasing amounts of data. Our main concern is on modeling the evolutionary relationships based on the observed molecular data. We operate within a Bayesian statistical framework, which allows a probabilistic quantification of the uncertainties related to a particular solution. As the basis of our modeling approach we utilize a partition model, which is used to describe the structure of data by appropriately dividing the data items into clusters of related items. Generalizations and modifications of the partition model are developed and applied to various problems. Large-scale data sets provide also a computational challenge. The models used to describe the data must be realistic enough to capture the essential features of the current modeling task but, at the same time, simple enough to make it possible to carry out the inference in practice. The partition model fulfills these two requirements. The problem-specific features can be taken into account by modifying the prior probability distributions of the model parameters. The computational efficiency stems from the ability to integrate out the parameters of the partition model analytically, which enables the use of efficient stochastic search algorithms.
Resumo:
Topic detection and tracking (TDT) is an area of information retrieval research the focus of which revolves around news events. The problems TDT deals with relate to segmenting news text into cohesive stories, detecting something new, previously unreported, tracking the development of a previously reported event, and grouping together news that discuss the same event. The performance of the traditional information retrieval techniques based on full-text similarity has remained inadequate for online production systems. It has been difficult to make the distinction between same and similar events. In this work, we explore ways of representing and comparing news documents in order to detect new events and track their development. First, however, we put forward a conceptual analysis of the notions of topic and event. The purpose is to clarify the terminology and align it with the process of news-making and the tradition of story-telling. Second, we present a framework for document similarity that is based on semantic classes, i.e., groups of words with similar meaning. We adopt people, organizations, and locations as semantic classes in addition to general terms. As each semantic class can be assigned its own similarity measure, document similarity can make use of ontologies, e.g., geographical taxonomies. The documents are compared class-wise, and the outcome is a weighted combination of class-wise similarities. Third, we incorporate temporal information into document similarity. We formalize the natural language temporal expressions occurring in the text, and use them to anchor the rest of the terms onto the time-line. Upon comparing documents for event-based similarity, we look not only at matching terms, but also how near their anchors are on the time-line. Fourth, we experiment with an adaptive variant of the semantic class similarity system. The news reflect changes in the real world, and in order to keep up, the system has to change its behavior based on the contents of the news stream. We put forward two strategies for rebuilding the topic representations and report experiment results. We run experiments with three annotated TDT corpora. The use of semantic classes increased the effectiveness of topic tracking by 10-30\% depending on the experimental setup. The gain in spotting new events remained lower, around 3-4\%. The anchoring the text to a time-line based on the temporal expressions gave a further 10\% increase the effectiveness of topic tracking. The gains in detecting new events, again, remained smaller. The adaptive systems did not improve the tracking results.
Resumo:
Minimum Description Length (MDL) is an information-theoretic principle that can be used for model selection and other statistical inference tasks. There are various ways to use the principle in practice. One theoretically valid way is to use the normalized maximum likelihood (NML) criterion. Due to computational difficulties, this approach has not been used very often. This thesis presents efficient floating-point algorithms that make it possible to compute the NML for multinomial, Naive Bayes and Bayesian forest models. None of the presented algorithms rely on asymptotic analysis and with the first two model classes we also discuss how to compute exact rational number solutions.
Resumo:
This thesis studies human gene expression space using high throughput gene expression data from DNA microarrays. In molecular biology, high throughput techniques allow numerical measurements of expression of tens of thousands of genes simultaneously. In a single study, this data is traditionally obtained from a limited number of sample types with a small number of replicates. For organism-wide analysis, this data has been largely unavailable and the global structure of human transcriptome has remained unknown. This thesis introduces a human transcriptome map of different biological entities and analysis of its general structure. The map is constructed from gene expression data from the two largest public microarray data repositories, GEO and ArrayExpress. The creation of this map contributed to the development of ArrayExpress by identifying and retrofitting the previously unusable and missing data and by improving the access to its data. It also contributed to creation of several new tools for microarray data manipulation and establishment of data exchange between GEO and ArrayExpress. The data integration for the global map required creation of a new large ontology of human cell types, disease states, organism parts and cell lines. The ontology was used in a new text mining and decision tree based method for automatic conversion of human readable free text microarray data annotations into categorised format. The data comparability and minimisation of the systematic measurement errors that are characteristic to each lab- oratory in this large cross-laboratories integrated dataset, was ensured by computation of a range of microarray data quality metrics and exclusion of incomparable data. The structure of a global map of human gene expression was then explored by principal component analysis and hierarchical clustering using heuristics and help from another purpose built sample ontology. A preface and motivation to the construction and analysis of a global map of human gene expression is given by analysis of two microarray datasets of human malignant melanoma. The analysis of these sets incorporate indirect comparison of statistical methods for finding differentially expressed genes and point to the need to study gene expression on a global level.
Resumo:
Special switching sequences can be employed in space-vector-based generation of pulsewidth-modulated (PWM) waveforms for voltage-source inverters. These sequences involve switching a phase twice, switching the second phase once, and clamping the third phase in a subcycle. Advanced bus-clamping PWM (ABCPWM) techniques have been proposed recently that employ such switching sequences. This letter studies the spectral properties of the waveforms produced by these PWM techniques. Further, analytical closed-form expressions are derived for the total rms harmonic distortion due to these techniques. It is shown that the ABCPWM techniques lead to lower distortion than conventional space vector PWM and discontinuous PWM at higher modulation indexes. The findings are validated on a 2.2-kW constant $V/f$ induction motor drive and also on a 100-kW motor drive.
Resumo:
By applying the theory of the asymptotic distribution of extremes and a certain stability criterion to the question of the domain of convergence in the probability sense, of the renormalized perturbation expansion (RPE) for the site self-energy in a cellularly disordered system, an expression has been obtained in closed form for the probability of nonconvergence of the RPE on the real-energy axis. Hence, the intrinsic mobility mu (E) as a function of the carrier energy E is deduced to be given by mu (E)= mu 0exp(-exp( mod E mod -Ec) Delta ), where Ec is a nominal 'mobility edge' and Delta is the width of the random site-energy distribution. Thus mobility falls off sharply but continuously for mod E mod >Ec, in contradistinction with the notion of an abrupt 'mobility edge' proposed by Cohen et al. and Mott. Also, the calculated electrical conductivity shows a temperature dependence in qualitative agreement with experiments on disordered semiconductors.
Resumo:
This paper investigates the long- and short-run relationships between energy consumption and economic growth in Australia using the bound testing and the ARDL approach. The analytical framework utilized in this paper includes both production and demand side models and a unified model comprising both production and demand side variables. The energy-GDP relationships are investigated at aggregate as well as several disaggregated energy categories, such as coal, oil, gas and electricity. The possibilities of one or more structural break(s) in the data series are examined by applying the recent advances in techniques. We find that the results of the cointegration tests could be affected by the structural break(s) in the data. It is, therefore, crucial to incorporate the information on structural break(s) in the subsequent modelling and inferences. Moreover, neither the production side nor the demand side framework alone can provide sufficient information to draw an ultimate conclusion on the cointegration and causal direction between energy and output. When alternative frameworks and structural break(s) in time series are explored properly, strong evidence of a bidirectional relationship between energy and output can be observed. The finding is true at both the aggregate and the disaggregate levels of energy consumption.