28 resultados para Learning Algorithm
em Aston University Research Archive
Resumo:
The performance of feed-forward neural networks in real applications can be often be improved significantly if use is made of a-priori information. For interpolation problems this prior knowledge frequently includes smoothness requirements on the network mapping, and can be imposed by the addition to the error function of suitable regularization terms. The new error function, however, now depends on the derivatives of the network mapping, and so the standard back-propagation algorithm cannot be applied. In this paper, we derive a computationally efficient learning algorithm, for a feed-forward network of arbitrary topology, which can be used to minimize the new error function. Networks having a single hidden layer, for which the learning algorithm simplifies, are treated as a special case.
Resumo:
There are been a resurgence of interest in the neural networks field in recent years, provoked in part by the discovery of the properties of multi-layer networks. This interest has in turn raised questions about the possibility of making neural network behaviour more adaptive by automating some of the processes involved. Prior to these particular questions, the process of determining the parameters and network architecture required to solve a given problem had been a time consuming activity. A number of researchers have attempted to address these issues by automating these processes, concentrating in particular on the dynamic selection of an appropriate network architecture.The work presented here specifically explores the area of automatic architecture selection; it focuses upon the design and implementation of a dynamic algorithm based on the Back-Propagation learning algorithm. The algorithm constructs a single hidden layer as the learning process proceeds using individual pattern error as the basis of unit insertion. This algorithm is applied to several problems of differing type and complexity and is found to produce near minimal architectures that are shown to have a high level of generalisation ability.
Resumo:
The Vapnik-Chervonenkis (VC) dimension is a combinatorial measure of a certain class of machine learning problems, which may be used to obtain upper and lower bounds on the number of training examples needed to learn to prescribed levels of accuracy. Most of the known bounds apply to the Probably Approximately Correct (PAC) framework, which is the framework within which we work in this paper. For a learning problem with some known VC dimension, much is known about the order of growth of the sample-size requirement of the problem, as a function of the PAC parameters. The exact value of sample-size requirement is however less well-known, and depends heavily on the particular learning algorithm being used. This is a major obstacle to the practical application of the VC dimension. Hence it is important to know exactly how the sample-size requirement depends on VC dimension, and with that in mind, we describe a general algorithm for learning problems having VC dimension 1. Its sample-size requirement is minimal (as a function of the PAC parameters), and turns out to be the same for all non-trivial learning problems having VC dimension 1. While the method used cannot be naively generalised to higher VC dimension, it suggests that optimal algorithm-dependent bounds may improve substantially on current upper bounds.
Resumo:
We analyse the dynamics of a number of second order on-line learning algorithms training multi-layer neural networks, using the methods of statistical mechanics. We first consider on-line Newton's method, which is known to provide optimal asymptotic performance. We determine the asymptotic generalization error decay for a soft committee machine, which is shown to compare favourably with the result for standard gradient descent. Matrix momentum provides a practical approximation to this method by allowing an efficient inversion of the Hessian. We consider an idealized matrix momentum algorithm which requires access to the Hessian and find close correspondence with the dynamics of on-line Newton's method. In practice, the Hessian will not be known on-line and we therefore consider matrix momentum using a single example approximation to the Hessian. In this case good asymptotic performance may still be achieved, but the algorithm is now sensitive to parameter choice because of noise in the Hessian estimate. On-line Newton's method is not appropriate during the transient learning phase, since a suboptimal unstable fixed point of the gradient descent dynamics becomes stable for this algorithm. A principled alternative is to use Amari's natural gradient learning algorithm and we show how this method provides a significant reduction in learning time when compared to gradient descent, while retaining the asymptotic performance of on-line Newton's method.
Resumo:
Text classification is essential for narrowing down the number of documents relevant to a particular topic for further pursual, especially when searching through large biomedical databases. Protein-protein interactions are an example of such a topic with databases being devoted specifically to them. This paper proposed a semi-supervised learning algorithm via local learning with class priors (LL-CP) for biomedical text classification where unlabeled data points are classified in a vector space based on their proximity to labeled nodes. The algorithm has been evaluated on a corpus of biomedical documents to identify abstracts containing information about protein-protein interactions with promising results. Experimental results show that LL-CP outperforms the traditional semisupervised learning algorithms such as SVMand it also performs better than local learning without incorporating class priors.
Resumo:
Bayesian algorithms pose a limit to the performance learning algorithms can achieve. Natural selection should guide the evolution of information processing systems towards those limits. What can we learn from this evolution and what properties do the intermediate stages have? While this question is too general to permit any answer, progress can be made by restricting the class of information processing systems under study. We present analytical and numerical results for the evolution of on-line algorithms for learning from examples for neural network classifiers, which might include or not a hidden layer. The analytical results are obtained by solving a variational problem to determine the learning algorithm that leads to maximum generalization ability. Simulations using evolutionary programming, for programs that implement learning algorithms, confirm and expand the results. The principal result is not just that the evolution is towards a Bayesian limit. Indeed it is essentially reached. In addition we find that evolution is driven by the discovery of useful structures or combinations of variables and operators. In different runs the temporal order of the discovery of such combinations is unique. The main result is that combinations that signal the surprise brought by an example arise always before combinations that serve to gauge the performance of the learning algorithm. This latter structures can be used to implement annealing schedules. The temporal ordering can be understood analytically as well by doing the functional optimization in restricted functional spaces. We also show that there is data suggesting that the appearance of these traits also follows the same temporal ordering in biological systems. © 2006 American Institute of Physics.
Resumo:
It is well known that the addition of noise to the input data of a neural network during training can, in some circumstances, lead to significant improvements in generalization performance. Previous work has shown that such training with noise is equivalent to a form of regularization in which an extra term is added to the error function. However, the regularization term, which involves second derivatives of the error function, is not bounded below, and so can lead to difficulties if used directly in a learning algorithm based on error minimization. In this paper we show that, for the purposes of network training, the regularization term can be reduced to a positive definite form which involves only first derivatives of the network mapping. For a sum-of-squares error function, the regularization term belongs to the class of generalized Tikhonov regularizers. Direct minimization of the regularized error function provides a practical alternative to training with noise.
Resumo:
We propose a Bayesian framework for regression problems, which covers areas which are usually dealt with by function approximation. An online learning algorithm is derived which solves regression problems with a Kalman filter. Its solution always improves with increasing model complexity, without the risk of over-fitting. In the infinite dimension limit it approaches the true Bayesian posterior. The issues of prior selection and over-fitting are also discussed, showing that some of the commonly held beliefs are misleading. The practical implementation is summarised. Simulations using 13 popular publicly available data sets are used to demonstrate the method and highlight important issues concerning the choice of priors.
Resumo:
In recent years there has been an increased interest in applying non-parametric methods to real-world problems. Significant research has been devoted to Gaussian processes (GPs) due to their increased flexibility when compared with parametric models. These methods use Bayesian learning, which generally leads to analytically intractable posteriors. This thesis proposes a two-step solution to construct a probabilistic approximation to the posterior. In the first step we adapt the Bayesian online learning to GPs: the final approximation to the posterior is the result of propagating the first and second moments of intermediate posteriors obtained by combining a new example with the previous approximation. The propagation of em functional forms is solved by showing the existence of a parametrisation to posterior moments that uses combinations of the kernel function at the training points, transforming the Bayesian online learning of functions into a parametric formulation. The drawback is the prohibitive quadratic scaling of the number of parameters with the size of the data, making the method inapplicable to large datasets. The second step solves the problem of the exploding parameter size and makes GPs applicable to arbitrarily large datasets. The approximation is based on a measure of distance between two GPs, the KL-divergence between GPs. This second approximation is with a constrained GP in which only a small subset of the whole training dataset is used to represent the GP. This subset is called the em Basis Vector, or BV set and the resulting GP is a sparse approximation to the true posterior. As this sparsity is based on the KL-minimisation, it is probabilistic and independent of the way the posterior approximation from the first step is obtained. We combine the sparse approximation with an extension to the Bayesian online algorithm that allows multiple iterations for each input and thus approximating a batch solution. The resulting sparse learning algorithm is a generic one: for different problems we only change the likelihood. The algorithm is applied to a variety of problems and we examine its performance both on more classical regression and classification tasks and to the data-assimilation and a simple density estimation problems.
Resumo:
A number of researchers have investigated the application of neural networks to visual recognition, with much of the emphasis placed on exploiting the network's ability to generalise. However, despite the benefits of such an approach it is not at all obvious how networks can be developed which are capable of recognising objects subject to changes in rotation, translation and viewpoint. In this study, we suggest that a possible solution to this problem can be found by studying aspects of visual psychology and in particular, perceptual organisation. For example, it appears that grouping together lines based upon perceptually significant features can facilitate viewpoint independent recognition. The work presented here identifies simple grouping measures based on parallelism and connectivity and shows how it is possible to train multi-layer perceptrons (MLPs) to detect and determine the perceptual significance of any group presented. In this way, it is shown how MLPs which are trained via backpropagation to perform individual grouping tasks, can be brought together into a novel, large scale network capable of determining the perceptual significance of the whole input pattern. Finally the applicability of such significance values for recognition is investigated and results indicate that both the NILP and the Kohonen Feature Map can be trained to recognise simple shapes described in terms of perceptual significances. This study has also provided an opportunity to investigate aspects of the backpropagation algorithm, particularly the ability to generalise. In this study we report the results of various generalisation tests. In applying the backpropagation algorithm to certain problems, we found that there was a deficiency in performance with the standard learning algorithm. An improvement in performance could however, be obtained when suitable modifications were made to the algorithm. The modifications and consequent results are reported here.
Resumo:
We explore the effects of over-specificity in learning algorithms by investigating the behavior of a student, suited to learn optimally from a teacher B, learning from a teacher B' ? B. We only considered the supervised, on-line learning scenario with teachers selected from a particular family. We found that, in the general case, the application of the optimal algorithm to the wrong teacher produces a residual generalization error, even if the right teacher is harder. By imposing mild conditions to the learning algorithm form, we obtained an approximation for the residual generalization error. Simulations carried out in finite networks validate the estimate found.
Resumo:
We present CORDER (COmmunity Relation Discovery by named Entity Recognition) an un-supervised machine learning algorithm that exploits named entity recognition and co-occurrence data to associate individuals in an organization with their expertise and associates. We discuss the problems associated with evaluating unsupervised learners and report our initial evaluation experiments.
Resumo:
A formalism for describing the dynamics of Genetic Algorithms (GAs) using method s from statistical mechanics is applied to the problem of generalization in a perceptron with binary weights. The dynamics are solved for the case where a new batch of training patterns is presented to each population member each generation, which considerably simplifies the calculation. The theory is shown to agree closely to simulations of a real GA averaged over many runs, accurately predicting the mean best solution found. For weak selection and large problem size the difference equations describing the dynamics can be expressed analytically and we find that the effects of noise due to the finite size of each training batch can be removed by increasing the population size appropriately. If this population resizing is used, one can deduce the most computationally efficient size of training batch each generation. For independent patterns this choice also gives the minimum total number of training patterns used. Although using independent patterns is a very inefficient use of training patterns in general, this work may also prove useful for determining the optimum batch size in the case where patterns are recycled.
Resumo:
To solve multi-objective problems, multiple reward signals are often scalarized into a single value and further processed using established single-objective problem solving techniques. While the field of multi-objective optimization has made many advances in applying scalarization techniques to obtain good solution trade-offs, the utility of applying these techniques in the multi-objective multi-agent learning domain has not yet been thoroughly investigated. Agents learn the value of their decisions by linearly scalarizing their reward signals at the local level, while acceptable system wide behaviour results. However, the non-linear relationship between weighting parameters of the scalarization function and the learned policy makes the discovery of system wide trade-offs time consuming. Our first contribution is a thorough analysis of well known scalarization schemes within the multi-objective multi-agent reinforcement learning setup. The analysed approaches intelligently explore the weight-space in order to find a wider range of system trade-offs. In our second contribution, we propose a novel adaptive weight algorithm which interacts with the underlying local multi-objective solvers and allows for a better coverage of the Pareto front. Our third contribution is the experimental validation of our approach by learning bi-objective policies in self-organising smart camera networks. We note that our algorithm (i) explores the objective space faster on many problem instances, (ii) obtained solutions that exhibit a larger hypervolume, while (iii) acquiring a greater spread in the objective space.
Resumo:
In this paper we present a new approach to ontology learning. Its basis lies in a dynamic and iterative view of knowledge acquisition for ontologies. The Abraxas approach is founded on three resources, a set of texts, a set of learning patterns and a set of ontological triples, each of which must remain in equilibrium. As events occur which disturb this equilibrium various actions are triggered to re-establish a balance between the resources. Such events include acquisition of a further text from external resources such as the Web or the addition of ontological triples to the ontology. We develop the concept of a knowledge gap between the coverage of an ontology and the corpus of texts as a measure triggering actions. We present an overview of the algorithm and its functionalities.