Model-based and model-free Learning

The distinction between model-free and model-based learning is fundamentally about learning solely from the rewards that one receives without any further knowledge about the context or states in which these rewards are collected vs. developing an elaborate model of the environment and how the states in that environment are connected. This model can then be searched for optimal planning. Whereas model-free expected values can be learned relatively quickly, the policies associated with these values remain relatively rigid. Model-based state representations on the other hand take longer to build, yet once the different possible state transitions are learned, the model can be use to flexibly plan different paths through the environment. Interestingly, these model-based state representations can be learned without any rewards, whereas model-free learning can only occur, when rewards are present and a reward prediction error can be computed.

One of the iconic example of model-based learning is Tolman's (1931) experiment with latent learning in rats. 3 groups of rats were trained to run through a complex maze from a start box (S) to a goal box (G), in which reward could sometimes be found. The first group of rats never received a reward (no reinforcement), where as the second group always found a food reward in the goal box (continuous reinforcement). The third group did not receive reward in the first 11 days of the experiment, but in the latter part (days 12-30) always received a reward (delayed reinforcement or latent learning). It is interesting to record the errors that the rats made, while running and exploring the maze. The "no reinforcement" group improved only very little, whereas the "continuous reinforcement" group improved throughout the experiment, but always in small incremental step. Interestingly, the "delayed reinforcement" group only improved after the reward has been seeded in the goal box and this improvement was dramatical and exceeded the improvement of the "continuous reinforcement" group. The conclusion of this finding was that the latent learning rates must have learned something about the layout of the maze that they were then able to apply in a goal-directed way after they were given a reward in the goal box. Importantly, this initial latent learning occurred without any motivating rewards.

A few years ago we designed a probabilistic Two-step Markov decision-making task that subject had to solve in order to obtain the most amount of reward. The experimental procedure closely followed that of the original Tolman study. Human subject undergoing fMRI scanning were first guided through this "maze", but they could make a decision of their own. This essentially equated all participants exposure to the maze - a factor that was uncontrolled in the original Tolman study.

Then the rewards were introduced and the participants were told how much each end state was worth. They rehearsed this reward mapping for a few trials. Finally, in the second part of the experiment subject were free to choose whichever path they wanted to take through the maze. State transitions were still probabilistic (i.e. subject could choose the optimal path, but still end up in a different, unexpected outcome state). 

We develop a computational model that combined both, a model-free TD learner with a model-based Forward learner, into a Hybrid learner that negotiated between both learning algorithms through a non-linear decaying function. The Hybrid learner provided the best fit to the behavioral choice data. 

Both the TD and the Forward learner give rise to different prediction errors, a reward prediction error and state prediction error, which we used in a model-based fMRI analysis. We were able to localize these different error signal in two distinct parts of the human brain: the ventral striatum for model-free prediction errors and the intra-parietal sulcus (IPS) for model-based state prediction errors. This was the first study of what has now become the fast-moving research theme in decision neuroscience that was able to show that the human brain maintains different prediction error representations and uses them to make more optimal decisions.


Prediction Errors during Stimulus Learning and Reward Learning

Learning about which event occur in the environment and what reward is associated with them are two fundamental characteristics that determine the value of a particular environment. This allow you to seek out those events and situations that are highly valuable and predict a lot of reward. It allows you to move through your environment in a goal-directed manner. In this case, learning about the frequency of the event is beneficial for learning about the reward as two "go together".

But what if the most frequent event in the environment is not associated with the most amount or the highest frequency of reward? In this case, learning about the event occurrence and seeking it out, is actually weakening the chances of obtaining reward. The stimulus probability is pitted against the reward probability and the best strategy would be to avoid the stimulus or event altogether.

In cooperation with Klaus Obermayer (TU Berlin) we have developed an experimental decision-making task that embodies these two situations. Subject had to predict the side of occurrence of a stimulus on the screen. When choosing correctly, then and only then, would the stimulus occur with a specific probability at the predicted location. If the stimulus appeared here, then the subject would receive a reward, again wth a certain probability. In an unbiased condition, the stimulus probability was not informative for the reward occurrence (50/50%), so the subject could focus on getting the reward probability correct. However, in the biased condition the stimulus was biased to occur more frequently on one side, whereas the reward was more likely to occur after choosing the other side. This creates a conflict between the stimulus and reward probability.

Model-based fMRI analyses revealed that both the stimulus prediction error (SPE) and the normal reward prediction error (RPE)  co-localized to the ventral striatum, but that the strength of that signal greatly depended on the experimental condition: in the unbiased condition, when the stimulus was uninformative for the reward location, the RPE evoked a stronger response, whereas in the biased condition, when the stimulus location was misleading with respect to the reward location, the SPE showed a stronger response in the ventral striatum. This demonstrates that prediction error representation are highly dynamic and influenced by the experimental context.