999 resultados para REWARD PREDICTION
Resumo:
Theories of instrumental learning are centred on understanding how success and failure are used to improve future decisions. These theories highlight a central role for reward prediction errors in updating the values associated with available actions. In animals, substantial evidence indicates that the neurotransmitter dopamine might have a key function in this type of learning, through its ability to modulate cortico-striatal synaptic efficacy. However, no direct evidence links dopamine, striatal activity and behavioural choice in humans. Here we show that, during instrumental learning, the magnitude of reward prediction error expressed in the striatum is modulated by the administration of drugs enhancing (3,4-dihydroxy-L-phenylalanine; L-DOPA) or reducing (haloperidol) dopaminergic function. Accordingly, subjects treated with L-DOPA have a greater propensity to choose the most rewarding action relative to subjects treated with haloperidol. Furthermore, incorporating the magnitude of the prediction errors into a standard action-value learning algorithm accurately reproduced subjects' behavioural choices under the different drug conditions. We conclude that dopamine-dependent modulation of striatal activity can account for how the human brain uses reward prediction errors to improve future decisions.
Resumo:
Before choosing, it helps to know both the expected value signaled by a predictive cue and the associated uncertainty that the reward will be forthcoming. Recently, Fiorillo et al. (2003) found the dopamine (DA) neurons of the SNc exhibit sustained responses related to the uncertainty that a cure will be followed by reward, in addition to phasic responses related to reward prediction errors (RPEs). This suggests that cue-dependent anticipations of the timing, magnitude, and uncertainty of rewards are learned and reflected in components of the DA signals broadcast by SNc neurons. What is the minimal local circuit model that can explain such multifaceted reward-related learning? A new computational model shows how learned uncertainty responses emerge robustly on single trial along with phasic RPE responses, such that both types of DA responses exhibit the empirically observed dependence on conditional probability, expected value of reward, and time since onset of the reward-predicting cue. The model includes three major pathways for computing: immediate expected values of cures, timed predictions of reward magnitudes (and RPEs), and the uncertainty associated with these predictions. The first two model pathways refine those previously modeled by Brown et al. (1999). A third, newly modeled, pathway is formed by medium spiny projection neurons (MSPNs) of the matrix compartment of the striatum, whose axons co-release GABA and a neuropeptide, substance P, both at synapses with GABAergic neurons in the SNr and with the dendrites (in SNr) of DA neurons whose somas are in ventral SNc. Co-release enables efficient computation of sustained DA uncertainty responses that are a non-monotonic function of the conditonal probability that a reward will follow the cue. The new model's incorporation of a striatal microcircuit allowed it to reveals that variability in striatal cholinergic transmission can explain observed difference, between monkeys, in the amplitutude of the non-monotonic uncertainty function. Involvement of matriceal MSPNs and striatal cholinergic transmission implpies a relation between uncertainty in the cue-reward contigency and action-selection functions of the basal ganglia. The model synthesizes anatomical, electrophysiological and behavioral data regarding the midbrain DA system in a novel way, by relating the ability to compute uncertainty, in parallel with other aspects of reward contingencies, to the unique distribution of SP inputs in ventral SN.
Resumo:
In probabilistic decision tasks, an expected value (EV) of a choice is calculated, and after the choice has been made, this can be updated based on a temporal difference (TD) prediction error between the EV and the reward magnitude (RM) obtained. The EV is measured as the probability of obtaining a reward x RM. To understand the contribution of different brain areas to these decision-making processes, functional magnetic resonance imaging activations related to EV versus RM (or outcome) were measured in a probabilistic decision task. Activations in the medial orbitofrontal cortex were correlated with both RM and with EV and confirmed in a conjunction analysis to extend toward the pregenual cingulate cortex. From these representations, TD reward prediction errors could be produced. Activations in areas that receive from the orbitofrontal cortex including the ventral striatum, midbrain, and inferior frontal gyrus were correlated with the TD error. Activations in the anterior insula were correlated negatively with EV, occurring when low reward outcomes were expected, and also with the uncertainty of the reward, implicating this region in basic and crucial decision-making parameters, low expected outcomes, and uncertainty.
Resumo:
Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.
Resumo:
Recent electrophysical data inspired the claim that dopaminergic neurons adapt their mismatch sensitivities to reflect variances of expected rewards. This contradicts reward prediction error theory and most basal ganglia models. Application of learning principles points to a testable alternative interpretation-of the same data-that is compatible with existing theory.
Resumo:
This work was developed in the context of the MIT Portugal Program, area of Bioengineering Systems, in collaboration with the Champalimaud Research Programme, Champalimaud Center for the Unknown, Lisbon, Portugal. The project entitled Dynamics of serotonergic neurons revealed by fiber photometry was carried out at Instituto Gulbenkian de Ciência, Oeiras, Portugal and at the Champalimaud Research Programme, Champalimaud Center for the Unknown, Lisbon, Portugal
Resumo:
Tout au long de la vie, le cerveau développe des représentations de son environnement permettant à l’individu d’en tirer meilleur profit. Comment ces représentations se développent-elles pendant la quête de récompenses demeure un mystère. Il est raisonnable de penser que le cortex est le siège de ces représentations et que les ganglions de la base jouent un rôle important dans la maximisation des récompenses. En particulier, les neurones dopaminergiques semblent coder un signal d’erreur de prédiction de récompense. Cette thèse étudie le problème en construisant, à l’aide de l’apprentissage machine, un modèle informatique intégrant de nombreuses évidences neurologiques. Après une introduction au cadre mathématique et à quelques algorithmes de l’apprentissage machine, un survol de l’apprentissage en psychologie et en neuroscience et une revue des modèles de l’apprentissage dans les ganglions de la base, la thèse comporte trois articles. Le premier montre qu’il est possible d’apprendre à maximiser ses récompenses tout en développant de meilleures représentations des entrées. Le second article porte sur l'important problème toujours non résolu de la représentation du temps. Il démontre qu’une représentation du temps peut être acquise automatiquement dans un réseau de neurones artificiels faisant office de mémoire de travail. La représentation développée par le modèle ressemble beaucoup à l’activité de neurones corticaux dans des tâches similaires. De plus, le modèle montre que l’utilisation du signal d’erreur de récompense peut accélérer la construction de ces représentations temporelles. Finalement, il montre qu’une telle représentation acquise automatiquement dans le cortex peut fournir l’information nécessaire aux ganglions de la base pour expliquer le signal dopaminergique. Enfin, le troisième article évalue le pouvoir explicatif et prédictif du modèle sur différentes situations comme la présence ou l’absence d’un stimulus (conditionnement classique ou de trace) pendant l’attente de la récompense. En plus de faire des prédictions très intéressantes en lien avec la littérature sur les intervalles de temps, l’article révèle certaines lacunes du modèle qui devront être améliorées. Bref, cette thèse étend les modèles actuels de l’apprentissage des ganglions de la base et du système dopaminergique au développement concurrent de représentations temporelles dans le cortex et aux interactions de ces deux structures.
Resumo:
Context: The aberrant processing of salience is thought to be a fundamental factor underlying psychosis. Cannabis can induce acute psychotic symptoms, and its chronic use may increase the risk of schizophrenia. We investigated whether its psychotic effects are mediated through an influence on attentional salience processing. Objective: To examine the effects of Delta 9-tetrahydrocannabinol (Delta 9-THC) and cannabidiol (CBD) on regional brain function during salience processing. Design: Volunteers were studied using event-related functional magnetic resonance imaging on 3 occasions after administration of Delta 9-THC, CBD, or placebo while performing a visual oddball detection paradigm that involved allocation of attention to infrequent (oddball) stimuli within a string of frequent (standard) stimuli. Setting: University center. Participants: Fifteen healthy men with minimal previous cannabis use. Main Outcome Measures: Symptom ratings, task performance, and regional brain activation. Results: During the processing of oddball stimuli, relative to placebo, Delta 9-THC attenuated activation in the right caudate but augmented it in the right prefrontal cortex. Delta 9-Tetrahydrocannabinol also reduced the response latency to standard relative to oddball stimuli. The effect of Delta 9-THC in the right caudate was negatively correlated with the severity of the psychotic symptoms it induced and its effect on response latency. The effects of CBD on task-related activation were in the opposite direction of those of Delta 9-THC; relative to placebo, CBD augmented left caudate and hippocampal activation but attenuated right prefrontal activation. Conclusions: Delta 9-Tetrahydrocannabinol and CBD differentially modulate prefrontal, striatal, and hippocampal function during attentional salience processing. These effects may contribute to the effects of cannabis on psychotic symptoms and on the risk of psychotic disorders.
Resumo:
Economic theory distinguishes two concepts of utility: decision utility, objectively quantifiable by choices, and experienced utility, referring to the satisfaction by an obtainment. To date, experienced utility is typically measured with subjective ratings. This study intended to quantify experienced utility by global levels of neuronal activity. Neuronal activity was measured by means of electroencephalographic (EEG) responses to gain and omission of graded monetary rewards at the level of the EEG topography in human subjects. A novel analysis approach allowed approximating psychophysiological value functions for the experienced utility of monetary rewards. In addition, we identified the time windows of the event-related potentials (ERP) and the respective intracortical sources, in which variations in neuronal activity were significantly related to the value or valence of outcomes. Results indicate that value functions of experienced utility and regret disproportionally increase with monetary value, and thus contradict the compressing value functions of decision utility. The temporal pattern of outcome evaluation suggests an initial (∼250 ms) coarse evaluation regarding the valence, concurrent with a finer-grained evaluation of the value of gained rewards, whereas the evaluation of the value of omitted rewards emerges later. We hypothesize that this temporal double dissociation is explained by reward prediction errors. Finally, a late, yet unreported, reward-sensitive ERP topography (∼500 ms) was identified. The sources of these topographical covariations are estimated in the ventromedial prefrontal cortex, the medial frontal gyrus, the anterior and posterior cingulate cortex and the hippocampus/amygdala. The results provide important new evidence regarding “how,” “when,” and “where” the brain evaluates outcomes with different hedonic impact.
Resumo:
We study two problems of online learning under restricted information access. In the first problem, prediction with limited advice, we consider a game of prediction with expert advice, where on each round of the game we query the advice of a subset of M out of N experts. We present an algorithm that achieves O(√(N/M)TlnN ) regret on T rounds of this game. The second problem, the multiarmed bandit with paid observations, is a variant of the adversarial N-armed bandit game, where on round t of the game we can observe the reward of any number of arms, but each observation has a cost c. We present an algorithm that achieves O((cNlnN) 1/3 T2/3+√TlnN ) regret on T rounds of this game in the worst case. Furthermore, we present a number of refinements that treat arm- and time-dependent observation costs and achieve lower regret under benign conditions. We present lower bounds that show that, apart from the logarithmic factors, the worst-case regret bounds cannot be improved.
Resumo:
Negative anticipatory contrast (NAC) corresponds to the suppression in consumption of a first rewarding substance (e.g., saccharin 0.15%) when it is followed daily by a second preferred substance (e.g., sucrose 32%). The NAC has been interpreted as resulting from anticipation of the impending preferred reward and its comparison with the currently available first reward [Flaherty, CF., Rowan, G.A., 1985. Anticipatory contrast: within-subjects analysis. Anim. Learn. Behav. 13, 2-5]. In this context, one should expect that devaluation of the preferred substance after the establishment of the NAC would either reduce or abolish the contrast effect. However, contrary to this prediction, the results of the present study show that the NAC is insensitive to devaluation of the second, preferred, substance. This allows one to question that interpretation. The results reported in this study support the view that the NAC effect is controlled by memory of the relative value of the first solution, which is updated daily by means of both a gustatory and/or post-ingestive comparison of the first and second solutions, and memory of past pairings. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Recent modeling of spike-timing-dependent plasticity indicates that plasticity involves as a third factor a local dendritic potential, besides pre- and postsynaptic firing times. We present a simple compartmental neuron model together with a non-Hebbian, biologically plausible learning rule for dendritic synapses where plasticity is modulated by these three factors. In functional terms, the rule seeks to minimize discrepancies between somatic firings and a local dendritic potential. Such prediction errors can arise in our model from stochastic fluctuations as well as from synaptic input, which directly targets the soma. Depending on the nature of this direct input, our plasticity rule subserves supervised or unsupervised learning. When a reward signal modulates the learning rate, reinforcement learning results. Hence a single plasticity rule supports diverse learning paradigms.
Resumo:
Background The brain reward circuitry innervated by dopamine is critically disturbed in schizophrenia. This study aims to investigate the role of dopamine-related brain activity during prediction of monetary reward and loss in first episode schizophrenia patients. Methods We measured blood–oxygen-level dependent (BOLD) activity in 10 patients with schizophrenia (SCH) and 12 healthy controls during dopamine depletion with α-methylparatyrosine (AMPT) and during a placebo condition (PLA). Results AMPT reduced the activation of striatal and cortical brain regions in SCH. In SCH vs. controls reduced activation was found in the AMPT condition in several regions during anticipation of reward and loss, including areas of the striatum and frontal cortex. In SCH vs. controls reduced activation of the superior temporal gyrus and posterior cingulate was observed in PLA during anticipation of rewarding stimuli. PLA patients had reduced activation in the ventral striatum, frontal and cingulate cortex in anticipation of loss. The findings of reduced dopamine-related brain activity during AMPT were verified by reduced levels of dopamine in urine, homovanillic-acid in plasma and increased prolactin levels. Conclusions Our results indicate that dopamine depletion affects functioning of the cortico-striatal reward circuitry in SCH. The findings also suggest that neuronal functions associated with dopamine neurotransmission and attribution of salience to reward predicting stimuli are altered in schizophrenia.
Resumo:
The Appetitive Motivation Scale (Jackson & Smillie, 2004) is a new trait conceptualisation of Gray's (I 970 199 1) Behavioural Activation System. In this experiment we explore relationships that the Appetitive Motivation Scale and other measures of Gray's model have with Approach and Active Avoidance responses. Using a sample of 144 undergraduate students, both Appetitive Motivation and Sensitivity to Reward (from the Sensitivity to Punishment and Sensitivity to Reward Questionnaire, SPSRQ; Torrubia, Avila, Molto, & Ceseras, 2001), were found to be significant predictors of Approach and Active Avoidance response latency. This confirms previous experimental validations of the SPSRQ (e.g., Avila, 2001) and provides the first experimental evidence for the validity of the Appetitive Motivation scale. Consistent with interactive views of Gray's model (e.g., Corr, 2001), high SPSRQ Sensitivity to Punishment diminished the relationship between Sensitivity to Reward and our BAS criteria. Measures of BIS did not however interact in this way with the appetitive motivation scale. A surprising result was the failure for any of Carver and White's (1994) BAS scales to correlate with RST criteria. Implications of these findings and potential future directions are discussed. (C) 2004 Elsevier Ltd. All rights reserved.