53 resultados para Policy parameters

em Cambridge University Engineering Department Publications Database


Relevância:

70.00% 70.00%

Publicador:

Resumo:

This article presents a novel algorithm for learning parameters in statistical dialogue systems which are modeled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy that selects the system's responses based on the inferred state; and a reward function that specifies the desired behavior of the system. Ideally both the model parameters and the policy would be designed to maximize the cumulative reward. However, while there are many techniques available for learning the optimal policy, no good ways of learning the optimal model parameters that scale to real-world dialogue systems have been found yet. The presented algorithm, called the Natural Actor and Belief Critic (NABC), is a policy gradient method that offers a solution to this problem. Based on observed rewards, the algorithm estimates the natural gradient of the expected cumulative reward. The resulting gradient is then used to adapt both the prior distribution of the dialogue model parameters and the policy parameters. In addition, the article presents a variant of the NABC algorithm, called the Natural Belief Critic (NBC), which assumes that the policy is fixed and only the model parameters need to be estimated. The algorithms are evaluated on a spoken dialogue system in the tourist information domain. The experiments show that model parameters estimated to maximize the expected cumulative reward result in significantly improved performance compared to the baseline hand-crafted model parameters. The algorithms are also compared to optimization techniques using plain gradients and state-of-the-art random search algorithms. In all cases, the algorithms based on the natural gradient work significantly better. © 2011 ACM.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Reinforcement techniques have been successfully used to maximise the expected cumulative reward of statistical dialogue systems. Typically, reinforcement learning is used to estimate the parameters of a dialogue policy which selects the system's responses based on the inferred dialogue state. However, the inference of the dialogue state itself depends on a dialogue model which describes the expected behaviour of a user when interacting with the system. Ideally the parameters of this dialogue model should be also optimised to maximise the expected cumulative reward. This article presents two novel reinforcement algorithms for learning the parameters of a dialogue model. First, the Natural Belief Critic algorithm is designed to optimise the model parameters while the policy is kept fixed. This algorithm is suitable, for example, in systems using a handcrafted policy, perhaps prescribed by other design considerations. Second, the Natural Actor and Belief Critic algorithm jointly optimises both the model and the policy parameters. The algorithms are evaluated on a statistical dialogue system modelled as a Partially Observable Markov Decision Process in a tourist information domain. The evaluation is performed with a user simulator and with real users. The experiments indicate that model parameters estimated to maximise the expected reward function provide improved performance compared to the baseline handcrafted parameters. © 2011 Elsevier Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Although it is widely believed that reinforcement learning is a suitable tool for describing behavioral learning, the mechanisms by which it can be implemented in networks of spiking neurons are not fully understood. Here, we show that different learning rules emerge from a policy gradient approach depending on which features of the spike trains are assumed to influence the reward signals, i.e., depending on which neural code is in effect. We use the framework of Williams (1992) to derive learning rules for arbitrary neural codes. For illustration, we present policy-gradient rules for three different example codes - a spike count code, a spike timing code and the most general "full spike train" code - and test them on simple model problems. In addition to classical synaptic learning, we derive learning rules for intrinsic parameters that control the excitability of the neuron. The spike count learning rule has structural similarities with established Bienenstock-Cooper-Munro rules. If the distribution of the relevant spike train features belongs to the natural exponential family, the learning rules have a characteristic shape that raises interesting prediction problems.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Supersonic cluster beam deposition has been used to produce films with different nanostructures by controlling the deposition parameters such as the film thickness, substrate temperature and cluster mass distribution. The field emission properties of cluster-assembled carbon films have been characterized and correlated to the evolution of the film nanostructure. Threshold fields ranging between 4 and 10 V/mum and saturation current densities as high as 0.7 mA have been measured for samples heated during deposition. A series of voltage ramps, i.e., a conditioning process, was found to initiate more stable and reproducible emission. It was found that the presence of graphitic particles (onions, nanotube embryos) in the films substantially enhances the field emission performance. Films patterned on a micrometer scale have been conditioned spot by spot by a ball-tip anode, showing that a relatively high emission site density can be achieved from the cluster-assembled material. (C) 2002 American Institute of Physics.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper considers a class of dynamic Spatial Point Processes (PP) that evolves over time in a Markovian fashion. This Markov in time PP is hidden and observed indirectly through another PP via thinning, displacement and noise. This statistical model is important for Multi object Tracking applications and we present an approximate likelihood based method for estimating the model parameters. The work is supported by an extensive numerical study.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador: