29 resultados para reward
em Cambridge University Engineering Department Publications Database
Resumo:
Humans appear to have an inherent prosocial tendency toward one another in that we often take pleasure in seeing others succeed. This fact is almost certainly exploited by game shows, yet why watching others win elicits a pleasurable vicarious rewarding feeling in the absence of personal economic gain is unclear. One explanation is that game shows use contestants who have similarities to the viewing population, thereby kindling kin-motivated responses (for example, prosocial behavior). Using a game show-inspired paradigm, we show that the interactions between the ventral striatum and anterior cingulate cortex subserve the modulation of vicarious reward by similarity, respectively. Our results support studies showing that similarity acts as a proximate neurobiological mechanism where prosocial behavior extends to unrelated strangers.
Resumo:
Establishing a function for the neuromodulator serotonin in human decision-making has proved remarkably difficult because if its complex role in reward and punishment processing. In a novel choice task where actions led concurrently and independently to the stochastic delivery of both money and pain, we studied the impact of decreased brain serotonin induced by acute dietary tryptophan depletion. Depletion selectively impaired both behavioral and neural representations of reward outcome value, and hence the effective exchange rate by which rewards and punishments were compared. This effect was computationally and anatomically distinct from a separate effect on increasing outcome-independent choice perseveration. Our results provide evidence for a surprising role for serotonin in reward processing, while illustrating its complex and multifarious effects.
Resumo:
Theories of instrumental learning are centred on understanding how success and failure are used to improve future decisions. These theories highlight a central role for reward prediction errors in updating the values associated with available actions. In animals, substantial evidence indicates that the neurotransmitter dopamine might have a key function in this type of learning, through its ability to modulate cortico-striatal synaptic efficacy. However, no direct evidence links dopamine, striatal activity and behavioural choice in humans. Here we show that, during instrumental learning, the magnitude of reward prediction error expressed in the striatum is modulated by the administration of drugs enhancing (3,4-dihydroxy-L-phenylalanine; L-DOPA) or reducing (haloperidol) dopaminergic function. Accordingly, subjects treated with L-DOPA have a greater propensity to choose the most rewarding action relative to subjects treated with haloperidol. Furthermore, incorporating the magnitude of the prediction errors into a standard action-value learning algorithm accurately reproduced subjects' behavioural choices under the different drug conditions. We conclude that dopamine-dependent modulation of striatal activity can account for how the human brain uses reward prediction errors to improve future decisions.
Resumo:
Food preferences are acquired through experience and can exert strong influence on choice behavior. In order to choose which food to consume, it is necessary to maintain a predictive representation of the subjective value of the associated food stimulus. Here, we explore the neural mechanisms by which such predictive representations are learned through classical conditioning. Human subjects were scanned using fMRI while learning associations between arbitrary visual stimuli and subsequent delivery of one of five different food flavors. Using a temporal difference algorithm to model learning, we found predictive responses in the ventral midbrain and a part of ventral striatum (ventral putamen) that were related directly to subjects' actual behavioral preferences. These brain structures demonstrated divergent response profiles, with the ventral midbrain showing a linear response profile with preference, and the ventral striatum a bivalent response. These results provide insight into the neural mechanisms underlying human preference behavior.
Resumo:
Recent experiments have shown that spike-timing-dependent plasticity is influenced by neuromodulation. We derive theoretical conditions for successful learning of reward-related behavior for a large class of learning rules where Hebbian synaptic plasticity is conditioned on a global modulatory factor signaling reward. We show that all learning rules in this class can be separated into a term that captures the covariance of neuronal firing and reward and a second term that presents the influence of unsupervised learning. The unsupervised term, which is, in general, detrimental for reward-based learning, can be suppressed if the neuromodulatory signal encodes the difference between the reward and the expected reward-but only if the expected reward is calculated for each task and stimulus separately. If several tasks are to be learned simultaneously, the nervous system needs an internal critic that is able to predict the expected reward for arbitrary stimuli. We show that, with a critic, reward-modulated spike-timing-dependent plasticity is capable of learning motor trajectories with a temporal resolution of tens of milliseconds. The relation to temporal difference learning, the relevance of block-based learning paradigms, and the limitations of learning with a critic are discussed.
Resumo:
When a racing driver steers a car around a sharp bend, there is a trade-off between speed and accuracy, in that high speed can lead to a skid whereas a low speed increases lap time, both of which can adversely affect the driver's payoff function. While speed-accuracy trade-offs have been studied extensively, their susceptibility to risk sensitivity is much less understood, since most theories of motor control are risk neutral with respect to payoff, i.e., they only consider mean payoffs and ignore payoff variability. Here we investigate how individual risk attitudes impact a motor task that involves such a speed-accuracy trade-off. We designed an experiment where a target had to be hit and the reward (given in points) increased as a function of both subjects' endpoint accuracy and endpoint velocity. As faster movements lead to poorer endpoint accuracy, the variance of the reward increased for higher velocities. We tested subjects on two reward conditions that had the same mean reward but differed in the variance of the reward. A risk-neutral account predicts that subjects should only maximize the mean reward and hence perform identically in the two conditions. In contrast, we found that some (risk-averse) subjects chose to move with lower velocities and other (risk-seeking) subjects with higher velocities in the condition with higher reward variance (risk). This behavior is suboptimal with regard to maximizing the mean number of points but is in accordance with a risk-sensitive account of movement selection. Our study suggests that individual risk sensitivity is an important factor in motor tasks with speed-accuracy trade-offs.
Resumo:
This article presents a novel algorithm for learning parameters in statistical dialogue systems which are modeled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy that selects the system's responses based on the inferred state; and a reward function that specifies the desired behavior of the system. Ideally both the model parameters and the policy would be designed to maximize the cumulative reward. However, while there are many techniques available for learning the optimal policy, no good ways of learning the optimal model parameters that scale to real-world dialogue systems have been found yet. The presented algorithm, called the Natural Actor and Belief Critic (NABC), is a policy gradient method that offers a solution to this problem. Based on observed rewards, the algorithm estimates the natural gradient of the expected cumulative reward. The resulting gradient is then used to adapt both the prior distribution of the dialogue model parameters and the policy parameters. In addition, the article presents a variant of the NABC algorithm, called the Natural Belief Critic (NBC), which assumes that the policy is fixed and only the model parameters need to be estimated. The algorithms are evaluated on a spoken dialogue system in the tourist information domain. The experiments show that model parameters estimated to maximize the expected cumulative reward result in significantly improved performance compared to the baseline hand-crafted model parameters. The algorithms are also compared to optimization techniques using plain gradients and state-of-the-art random search algorithms. In all cases, the algorithms based on the natural gradient work significantly better. © 2011 ACM.
Resumo:
Perceptual learning improves perception through training. Perceptual learning improves with most stimulus types but fails when . certain stimulus types are mixed during training (roving). This result is surprising because classical supervised and unsupervised neural network models can cope easily with roving conditions. What makes humans so inferior compared to these models? As experimental and conceptual work has shown, human perceptual learning is neither supervised nor unsupervised but reward-based learning. Reward-based learning suffers from the so-called unsupervised bias, i.e., to prevent synaptic " drift" , the . average reward has to be exactly estimated. However, this is impossible when two or more stimulus types with different rewards are presented during training (and the reward is estimated by a running average). For this reason, we propose no learning occurs in roving conditions. However, roving hinders perceptual learning only for combinations of similar stimulus types but not for dissimilar ones. In this latter case, we propose that a critic can estimate the reward for each stimulus type separately. One implication of our analysis is that the critic cannot be located in the visual system. © 2011 Elsevier Ltd.
Resumo:
Reinforcement techniques have been successfully used to maximise the expected cumulative reward of statistical dialogue systems. Typically, reinforcement learning is used to estimate the parameters of a dialogue policy which selects the system's responses based on the inferred dialogue state. However, the inference of the dialogue state itself depends on a dialogue model which describes the expected behaviour of a user when interacting with the system. Ideally the parameters of this dialogue model should be also optimised to maximise the expected cumulative reward. This article presents two novel reinforcement algorithms for learning the parameters of a dialogue model. First, the Natural Belief Critic algorithm is designed to optimise the model parameters while the policy is kept fixed. This algorithm is suitable, for example, in systems using a handcrafted policy, perhaps prescribed by other design considerations. Second, the Natural Actor and Belief Critic algorithm jointly optimises both the model and the policy parameters. The algorithms are evaluated on a statistical dialogue system modelled as a Partially Observable Markov Decision Process in a tourist information domain. The evaluation is performed with a user simulator and with real users. The experiments indicate that model parameters estimated to maximise the expected reward function provide improved performance compared to the baseline handcrafted parameters. © 2011 Elsevier Ltd. All rights reserved.
Resumo:
Current commercial dialogue systems typically use hand-crafted grammars for Spoken Language Understanding (SLU) operating on the top one or two hypotheses output by the speech recogniser. These systems are expensive to develop and they suffer from significant degradation in performance when faced with recognition errors. This paper presents a robust method for SLU based on features extracted from the full posterior distribution of recognition hypotheses encoded in the form of word confusion networks. Following [1], the system uses SVM classifiers operating on n-gram features, trained on unaligned input/output pairs. Performance is evaluated on both an off-line corpus and on-line in a live user trial. It is shown that a statistical discriminative approach to SLU operating on the full posterior ASR output distribution can substantially improve performance both in terms of accuracy and overall dialogue reward. Furthermore, additional gains can be obtained by incorporating features from the previous system output. © 2012 IEEE.
Resumo:
Humans are creatures of routine and habit. When faced with situations in which a default option is available, people show a consistent tendency to stick with the default. Why this occurs is unclear. To elucidate its neural basis, we used a novel gambling task in conjunction with functional magnetic resonance imaging. Behavioral results revealed that participants were more likely to choose the default card and felt enhanced emotional responses to outcomes after making the decision to switch. We show that increased tendency to switch away from the default during the decision phase was associated with decreased activity in the anterior insula; activation in this same area in reaction to "switching away from the default and losing" was positively related with experienced frustration. In contrast, decisions to choose the default engaged the ventral striatum, the same reward area as seen in winning. Our findings highlight aversive processes in the insula as underlying the default bias and suggest that choosing the default may be rewarding in itself.
Resumo:
Studies of human decision making emerge from two dominant traditions: learning theorists [1-3] study choices in which options are evaluated on the basis of experience, whereas behavioral economists and financial decision theorists study choices in which the key decision variables are explicitly stated. Growing behavioral evidence suggests that valuation based on these different classes of information involves separable mechanisms [4-8], but the relevant neuronal substrates are unknown. This is important for understanding the all-too-common situation in which choices must be made between alternatives that involve one or another kind of information. We studied behavior and brain activity while subjects made decisions between risky financial options, in which the associated utilities were either learned or explicitly described. We show a characteristic effect in subjects' behavior when comparing information acquired from experience with that acquired from description, suggesting that these kinds of information are treated differently. This behavioral effect was reflected neurally, and we show differential sensitivity to learned and described value and risk in brain regions commonly associated with reward processing. Our data indicate that, during decision making under risk, both behavior and the neural encoding of key decision variables are strongly influenced by the manner in which value information is presented.