999 resultados para Random trees


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or max-margin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O(1/ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log(1/ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Analytical expressions are derived for the mean and variance, of estimates of the bispectrum of a real-time series assuming a cosinusoidal model. The effects of spectral leakage, inherent in discrete Fourier transform operation when the modes present in the signal have a nonintegral number of wavelengths in the record, are included in the analysis. A single phase-coupled triad of modes can cause the bispectrum to have a nonzero mean value over the entire region of computation owing to leakage. The variance of bispectral estimates in the presence of leakage has contributions from individual modes and from triads of phase-coupled modes. Time-domain windowing reduces the leakage. The theoretical expressions for the mean and variance of bispectral estimates are derived in terms of a function dependent on an arbitrary symmetric time-domain window applied to the record. the number of data, and the statistics of the phase coupling among triads of modes. The theoretical results are verified by numerical simulations for simple test cases and applied to laboratory data to examine phase coupling in a hypothesis testing framework

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The CDKN2 gene, encoding the cyclin-dependent kinase inhibitor p16, is a tumour suppressor gene that maps to chromosome band 9p21-p22. The most common mechanism of inactivation of this gene in human cancers is through homozygous deletion; however, in a smaller proportion of tumours and tumour cell lines intragenic mutations occur. In this study we have compiled a database of over 120 published point mutations in the CDKN2 gene from a wide variety of tumour types. A further 50 deletions, insertions, and splice mutations in CDKN2 have also been compiled. Furthermore, we have standardised the numbering of all mutations according to the full-length 156 amino acid form of p16. From this study we are able to define several hot spots, some of which occur at conserved residues within the ankyrin domains of p16. While many of the hotspots are shared by a number of cancers, the relative importance of each position varies, possibly reflecting the role of different carcinogens in the development of certain tumours. As reported previously, the mutational spectrum of CDKN2 in melanomas differs from that of internal malignancies and supports the involvement of UV in melanoma tumorigenesis. Notably, 52% of all substitutions in melanoma-derived samples occurred at just six nucleotide positions. Nonsense mutations comprise a comparatively high proportion of mutations present in the CDKN2 gene, and possible explanations for this are discussed.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A novel m-ary tree based approach is presented to solve asset management decisions which are combinatorial in nature. The approach introduces a new dynamic constraint based control mechanism which is capable of excluding infeasible solutions from the solution space. The approach also provides a solution to the challenges with ordering of assets decisions.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Fusion techniques have received considerable attention for achieving performance improvement with biometrics. While a multi-sample fusion architecture reduces false rejects, it also increases false accepts. This impact on performance also depends on the nature of subsequent attempts, i.e., random or adaptive. Expressions for error rates are presented and experimentally evaluated in this work by considering the multi-sample fusion architecture for text-dependent speaker verification using HMM based digit dependent speaker models. Analysis incorporating correlation modeling demonstrates that the use of adaptive samples improves overall fusion performance compared to randomly repeated samples. For a text dependent speaker verification system using digit strings, sequential decision fusion of seven instances with three random samples is shown to reduce the overall error of the verification system by 26% which can be further reduced by 6% for adaptive samples. This analysis novel in its treatment of random and adaptive multiple presentations within a sequential fused decision architecture, is also applicable to other biometric modalities such as finger prints and handwriting samples.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Poisson distribution has often been used for count like accident data. Negative Binomial (NB) distribution has been adopted in the count data to take care of the over-dispersion problem. However, Poisson and NB distributions are incapable of taking into account some unobserved heterogeneities due to spatial and temporal effects of accident data. To overcome this problem, Random Effect models have been developed. Again another challenge with existing traffic accident prediction models is the distribution of excess zero accident observations in some accident data. Although Zero-Inflated Poisson (ZIP) model is capable of handling the dual-state system in accident data with excess zero observations, it does not accommodate the within-location correlation and between-location correlation heterogeneities which are the basic motivations for the need of the Random Effect models. This paper proposes an effective way of fitting ZIP model with location specific random effects and for model calibration and assessment the Bayesian analysis is recommended.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed using intrinsic and extrinsic approaches to the evaluation of document cluster quality. This includes the classical clusters to categories approach and a novel approach that uses ad hoc information retrieval. The divergence from a random baseline approach is able to differentiate ineffective clusterings encountered in the INEX XML Mining track. It also appears to perform a normalisation similar to the Normalised Mutual Information (NMI) measure but it can be applied to any measure of cluster quality. When it is applied to the intrinsic measure of distortion as measured by RMSE, subtraction from a random baseline provides a clear optimum that is not apparent otherwise. This approach can be applied to any clustering evaluation. This paper describes its use in the context of document clustering evaluation.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Carbon dioxide (CO2), as a primary product of combustion, is a known factor affecting climate change and global warming. In Australia, CO2 emissions from biomass burning are a significant contributor to total carbon in the atmosphere and therefore, it is important to quantify the CO2 emission factors from biomass burning in order to estimate their magnitude and impact on the Australian atmosphere. This paper presents the quantification of CO2 emission factors for five common tree species found in South East Queensland forests, as well as several grasses taken from savannah lands in the Northern Territory of Australia, under controlled ‘fast burning’ and ‘slow burning’ laboratory conditions. The results showed that CO2 emission factors varied according to the type of vegetation and burning conditions, with emission factors for fast burning being 2574 ± 254 g/kg for wood, 394 ± 40 g/kg for branches and leaves, and 2181 ± 120 g/kg for grass. Under slow burning conditions, the CO2 emission factors were 218 ± 20 g/kg for wood, 392± 80 g/kg for branches and leaves, and 2027 ± 809 g/kg for grass.