Predictive coding is the proposition that many neural functions come down to
the minimization of prediction error on sensory information. Predictive coding
models generally posit hierarchies of bidirectional prediction layers (Rao and
Ballard, 1999; Clark, 2013; Lee and Mumford, 2003; Friston, 2005), forming
internal models which can encode and predict high-level features. Predictive
mechanisms are also suspected to play a part in action and decision making
(Kilner et al., 2007; Friston, 2005; Friston et al., 2006; Finn et al., 2016), as well
as understanding of causality and intent (which would be required for effective
long-term predictions) (Kilner et al., 2007). A notable feature of predictive
coding is that it posits that the activations of particular neurons correspond to
prediction errors (Srinivasan et al., 1982; Rao and Ballard, 1999).
This concept inspired the PredNet architecture, a convolutional long short-
term memory (LSTM) artificial neural network designed to perform unsuper-
vised learning by predicting the next frame(s) of a video based on previous ones
(Lotter et al., 2016). The particular features at each layer represent the errors in
prediction at that layer. It was shown to be capable of encoding object identity,
motion and rotation in rendered faces (Lotter et al., 2016; Lotter et al., 2015).
As PredNet is based on models of the visual pathway, we will expect there
to be similarity between PredNet’s features and brain activity in visual areas.
The aim of this study is to test the validity of PredNet as a model of predictive
coding in the visual stream by comparing the error signals observed in PredNet
to fMRI data recorded in individuals watching natural video. If PredNet’s
error responses to the same video can be used to predict fMRI responses in
particular visual areas with substantially better accuracy than the raw pixels
of the video can, it suggests that the predictive principles used in constructing
it have some explanatory power for human visual processing. Furthermore, an
inability to predict voxelwise fMRI responses with PredNet’s features would
indicate that at least some aspects of PredNet’s architecture are unsuited for
modeling the behavior of neurons in the analyzed visual areas. This comparison
is done with an encoding model framework (Naselaris et al., 2011). Encoding
models are built using PredNet for feature extraction from the video stimuli,
followed by linear regression to build a linear relationship between network
features (prediction errors) at each layer and the responses of individual voxels.
This is to be compared to similar encoding models built on the raw pixels of the
stimulus video, PredNet’s layer-wise predictions of said video, and PredNet’s
layer-wise representations of the causes behind the stimulus.
Investigations into predictive coding are important due to the broad scope
of implications the proposition has. Computer vision, unsupervised learning,
and decision making in robotics and AI, may all benefit from implementation
of biologically-inspired predictive mechanisms. Comparison of predictive coding
models to observed activity in the brain is of special importance for understand-
ing the human brain, especially if the predictive coding hypothesis proves to be
as unifying as some proponents claim. (Friston et al., 2016; Friston et al., 2006;
Kilner et al., 2007; Roskies, 2016).
1.1 Predictive coding
Predictive coding in neuroscience refers to the proposition that the brain is
fundamentally a prediction machine, that it formulates models of the world
around it which predict sensory inputs. The brain modifies these models to
reduce prediction error on sensory inputs (Friston, 2005; Rao and Ballard, 1999;
Lee and Mumford, 2003). It is thought by some proponents that minimizing
prediction error is the primary objective of the brain (Friston, 2005; Clark,
The idea of the brain as a prediction machine is old, dating back at least to
Helmholtz (von Helmholtz, 1860; Clark, 2013). In his 1890 Treatise on Physi-
ological Optics (von Helmholtz, 1860) he describes, among other things, some
predictive behaviors of the brain, namely the subconscious inference of objects
or events. Recently interest in the idea has increased, and various mechanisms
have been proposed for how predictive models are formed, evaluated, and cor-
1.1.1 Retinal Center-surround inhibition and retinal predictive cod-
The first usage of the term predictive coding in neuroscience was in a 1982
paper by Srinivasan et al. titled: Predictive Coding: a fresh view of inhibi-
tion in the retina (Srinivasan et al., 1982). In this paper predictive coding is
the name given to a model intended to describe center-surround antagonism
in retinal ganglion cells and retinal bipolar cells. Center-surround antagonism
is a longer-standing understanding of retinal processing (Barlow, 1961; Barlow
et al., 1957), (Werblin, 1971), where interneurons in the retina are excited by
the sensory neurons in the center of their receptive field, but inhibited by sen-
sory neurons on the periphery of their receptive field (or vice versa). These
inhibitory connections also have an appreciable time delay (Ratliff et al., 1963;
Srinivasan et al., 1982), which allows them to encode temporal structure and
behave predictively. Later studies showed that similar mechanisms are at play
in the lateral geniculate nucleus (Dong and Atick, 1995; Dan et al., 1996) .
Srinivasan et al. argue that this is similar to using a weighted mean of pre-
vious values of surrounding sensory neurons to predict their own present value,
passing only an error signal down the neural chain. A similar mathematical oper-
ation was already used for some video compression techniques (Srinivasan et al.,
1982; Oliver, 1952; Harrison, 1952; Shi and Sun, 1999). Previously, proposed
cognitive benefits of center-surround inhibition included redundancy reduction
(Barlow, 1961), response range reduction (Barlow and Levick, 1976; Laughlin
and Hardie, 1978), deblurring (Mar?celja, 1979) and edge-finding (Ratliff, 1965).
Passing of predictive information was proposed as another. Quantitatively pre-
dicting the receptive field suggested by the predictive coding hypothesis resulted
in a field similar to that of X-type retinal ganglion cells.
The retinal predictive coding model did not account for most extra-classical
effects however, and so does not appear to generalize well beyond the retina.
An example is end-stopping, found in the visual cortex: where neurons in the
visual cortex are excited by a line in their classical receptive field until that line
extends beyond their classical receptive field, at which point they are excited
substantially less. This shows neurons being influenced by sensory information
classically considered to be outside their receptive field. End-stopping has been
observed in a number of cortical areas, including V1, V2, V4 and MT (Rao
and Ballard, 1999; Hubel and Wiesel, 1965; Hubel and Wiesel, 1968; Bolz and
Gilbert, 1986; Hubel and Livingstone, 1987; Desimone and Schein, 1987; Allman
et al., 1985).
1.1.2 Rao-Ballard model of predictive coding in the Visual cortex
Rao and Ballard proposed a more complex and ambitious prediction scheme in
1999 (Rao and Ballard, 1999). Their model is intended to account for effects
such as end-stopping by incorporating hierarchies of prediction and feedback
connections from ‘higher’ (i.e. further from the sensory neurons) layers to ‘lower’
ones (although this lower-higher distinction is not considered clear cut (Rauss
and Pourtois, 2013)). Each layer predicts the activations of the previous layer.
Feedback connections from higher layers to lower ones carry a prediction of
the lower level’s error. The feed-forward connection then carries the difference
between the predicted error and the observed error. Since higher layers have
larger receptive fields, feedback connections from higher levels to lower ones
have the potential to explain extra-classical effects such as end-stopping.
The model, once constructed and trained on natural images, did indeed
exhibit end-stopping similar to that observed in the visual cortex of a macaque
(Rao and Ballard, 1999; Zipser et al., 1996; Knierim and van Essen, 1992).
Additionally, cutting off the feedback connections in the model made it cease
to display end-stopping in a very similar manner to what is observed in the
visual cortex of the macaque when higher cortical areas are inactivated (Rao
and Ballard, 1999).
Rao and Ballard relate the propagation of predictions to an understanding of
the causal structure of the stimulus (Rao and Ballard, 1999). Since then, many
more potential properties of predictive coding have been suggested. Some arti-
cles relate predictive coding to understanding of the intentions of other agents
in the environment (Kilner et al., 2007; Friston, 2005; Friston et al., 2006). A
number of researchers argue that predictive coding is also the driving mech-
anism behind action (Kilner et al., 2007; Friston et al., 2011; Friston et al.,
2016). Another spin on the concept of predictive coding is the proposition that
the brain maintains multiple hypotheses on the nature of the outside world,
and they are selected based on their error (Hohwy et al., 2008; Tong et al.,
2006). Evidence for predictive coding in various areas of the brain has become
increasingly plentiful (Hesselmann et al., 2010; Wacongne et al., 2012; Summer-
field et al., 2006; Tobias Egner, 2010; Spratling, 2012). Nonetheless, there is
currently no accepted general model of visual predictive coding.
The potential applications of predictive coding principles to the field of video
prediction are clear. Multiscale predictions encoding higher level features can al-
low for more accurate predictions, and predictions over longer timescales (Math-
ieu et al., 2015; Kalchbrenner et al., 2016).
PredNet uses this concept to predict future frames in a video based on prior
frames. PredNet’s predictive characteristics and performance suggest that it
may offer a model of visual predictive coding.
In PredNet each prediction layer has four convolutional filters which are
applied to an image consisting of the upsampled representation output by the
previous layer as well as the LSTM states, a combination of some of their outputs
are stored in the LSTM (to be used on the next iteraton). Another combina-
tion is also stored, then upsampled and provided to the next layer (this is the
upsampled representation of the image features).
The upsampled representations then have another convolution applied to
them, the output of which is the predicted image for that layer. For the lowest
layer this is a prediction of the frame, for higher layers it is a prediction of the
prediction errors of the layer below. These predictions are then compared to
the actual values; the resulting error signals are stored as part of the LSTM.
The error signals are divided into two sets of positive and negative errors, which
is intended to reflect the existence of off-center and on-center neurons in the
visual cortex (Lotter et al., 2016).