Using Fast Weights to Deblur Old Memories

		     Geoffrey E. Hinton and David C. Plaut
			  Computer Science Department
			  Carnegie-Mellon University

				   Abstract

Connectionist models usually have a single weight on each connection.  Some
interesting new properties emerge if each connection has two weights: A slowly
changing, plastic weight which stores long-term knowledge and a fast-changing,
elastic weight which stores temporary knowledge and spontaneously decays
towards zero.  If a network learns a set of associations and then these
associations are "blurred" by subsequent learning, @i(all) the original
associations can be "deblurred" by rehearsing on just a few of them.  The
rehearsal allows the fast weights to take on values that temporarily cancel out
the changes in the slow weights caused by the subsequent learning.


INTRODUCTION

Most connectionist models have assumed that each connection has a single weight
which is adjusted during the course of learning.  Despite the emerging
biological evidence that changes in synaptic efficacy at a single synapse occur
at many different time-scales (Kupferman, 1979; Hartzell, 1981), there have
been relatively few attempts to investigate the computational advantages of
giving each connection several different weights that change at different
speeds.  Even for phenomena like short-term memory where fast-changing weights
might seem appropriate, connectionist models have typically used the activation
levels of units rather than the weights to store temporary memories (Little and
Shaw, 1975; Touretzky and Hinton, 1985).  We know very little about the range
of potential uses of fast weights.  How do they alter the way networks behave,
and what extra computational properties do they provide?

In this paper we assume that each connection has both a fast, elastic weight
and a slow, plastic weight.  The slow weights are like the weights normally
used in connectionist networks@math[-]they change slowly and they hold all the
long-term knowledge of the network.  The fast weights change more rapidly and
they continually regress towards zero so that their magnitude is determined
solely by their recent past.  The effective weight on the connection is the sum
of these two.

At any instant, we can think of the system's knowledge as consisting of the
slow weights with a temporary overlay of fast weights.  The overlay gives a
temporary context@math[-]a temporary associative memory that allows networks to
do more flexible information processing.  Many ways of using this temporary
memory have been suggested.

1. It can be used for rapid temporary learning.  When presented with a new
association the network can store it in one trial, provided the storage only
needs to be temporary.

2. It can be used for creating temporary bindings between features.  Recent
work by Von@ der@ Malsburg (1981) and Feldman (1982) has shown that
fast-changing weights can be used to dynamically bind together a number of
different properties into a coherent whole, or to discover approximate
homomorphisms between two structured domains.

3. It can be used to allow truly recursive processing.  During execution of a
procedure, the values of local variables and the stage reached in the procedure
(the program counter) can be stored in the fast weights.  This allows the
procedure to call a subprocedure whose execution involves different patterns of
activity in the very same units as the calling procedure.  When the
subprocedure has finished using the units, the state of the calling procedure
can be reconstructed from the temporary associative memory.  In this way the
state does not need to be stored in the @i(activity) of the units, so the very
same units can be used for running the subprocedure.  A working simulation of
this kind is described briefly in McClelland & Kawamoto (1986).

4. It can be used to implement a learning method called "shortest descent"
which is a way of minimizing the amount of interference caused by new learning.
Shortest descent will be described in a separate paper and is not discussed
further here.

In this paper we describe a novel use for a temporary memory stored in the fast
weights: it can be used for cancelling out the interference in a set of old
associations caused by more recent learning.  Consider a network which has
slowly and painfully learned a set of associations in its slow weights.  If
this network is then taught a new set of associations without any more
rehearsal of the old associations, there is normally some degradation of the
old associations.  We show that by using fast weights it is possible to quickly
restore a whole set of old associations by rehearsing on just a subset of them.
The fast weights cancel out the changes in the slow weights that have occurred
since the old associations were learned, so the combination of the current slow
weights and the fast weights approximates the earlier slow weights.  The fast
weights therefore create a context in which the old associations are present
again.  When the fast weights decay away, the new associations return.


A DEBLURRING ANALOGY

There is an analogy between the use of fast weights to recover unretrained
associations and a technique called "deblurring" which is sometimes used for
cleaning up blurred images.  Suppose that you are in your office and you want
to take some photographs of what is on your computer screen.  You set up a
tripod in front of the screen and then you carefully focus the camera and take
one photograph.  Unfortunately, before you can take any more, your office mate
moves the camera to a tripod in front of his screen and refocuses it.  You now
move the camera back to your tripod and it seems as if you must refocus, but
there is an interesting alternative.  You take another photograph of the same
screen without refocussing and you compare it with your first photograph.  The
first photo is the "desired output" of the process that maps from screens to
photos, and the difference between it and the "actual output" achieved with the
out-of-focus camera can be used to estimate a deblurring operator which can be
applied to the actual output to convert it to the desired output.  This same
deblurring operator can then be applied to any image taken with the
out-of-focus camera.  Focussing the camera is analogous to learning the slow
weights that are required to map from input vectors (screens) to output vectors
(photos).  Estimating the deblurring operator is analogous to learning the fast
weights that are required to compensate for the noise that has been added to
the slow weights since the original learning (see figure
@ref<image-deblurring>).

@begin[figure]
@blankspace[4in]
@fillcaption[An illustration of the deblurring analogy between (a) images
and (b) networks.  The laws of projection (P) determine the mapping from the
scene to the image.  After applying a blurring function (B), applying the
inverse of B to the blurred image restores the image.  Similarly, the
weights (W) define a mapping from input to output.  After noise (N) is
added, determining and applying the inverse of N allows the network to
produce the original output.]
@tag<image-deblurring>
@end[figure]

The advantage of using fast weights rather than slow ones for deblurring is
that it does not permanently interfere with the new associations.  As soon
as the fast weights have decayed back to zero, the new knowledge is restored.


THE LEARNING PROCEDURE

We used the back propagation learning procedure (Rumelhart @i[et al.], 1986a;
1986b) to investigate the properties of networks that learn with both fast and
slow weights.  We summarize the procedure below@math[-]the full mathematical
details can be found in the above references.

The procedure operates on layered, feed-forward networks of deterministic,
neuron-like units.  These networks have a layer of input units at the bottom,
any number of intermediate layers of hidden units, and a layer of output units
at the top.  Connections are allowed only from lower to higher layers.

The aim of the learning procedure is to find a set of weights on the
connections such that, when the network is presented with each of a set of
input vectors, the output vector produced by the network is sufficiently close
to the corresponding desired output vector.  The error produced by the network
is defined to be the squared difference between the actual and desired output
vectors summed over all input-output cases.  The learning procedure minimizes
this error by performing gradient descent in weight space.  This requires
adjusting each weight in the network in proportion to the partial derivative of
the error with respect to that weight.  These error derivatives are calculated
in two passes.

The forward pass determines the output (or @i[state]) of each unit in the
network.  An input vector is presented to the network by setting the states of
the input units.  Layers are then processed sequentially, working from the
bottom upward, with the states of units within a layer set in parallel.  The
input to a unit is the scalar product of the states of units in lower layers
from which it receives connections, and the weights on these connections.  The
output of a unit is a real-valued smooth non-linear function of its input.  The
forward pass ends when the states of the output units are set.

The backward pass starts with the output units at the top of the network and
works downward through each successive layer, "back-propagating" error
derivatives to each weight in the network.  For each layer, computing the error
derivatives of the weights on incoming connections involves first computing the
error derivatives with respect to the outputs, and then with respect to the
inputs, of the units in that layer.  The simple form of the unit input-output
functions makes these computations straightforward.  The backward pass is
complete when the derivative of the error with respect to each weight in the
network has been determined.

The simplest version of gradient descent is to decrement each weight in
proportion, @g[e], to its error derivative.  A more complicated version that
usually converges faster is to add to the current weight change a proportion,
@g[a], of the previous weight change.  This is analogous to introducing
@i[momentum] to movement in weight space.

Since the effective weight on a connection is the sum of the fast and slow
weights, these weights experience the same error derivative.  Hence they behave
differently only because they are modified using different weight change
parameters @g[e] and @g[a], and because the fast weights decay towards zero.
This is achieved by reducing the magnitude of each fast weight by some
fraction, @m[h#], after each weight change.  @begin[comment]


A SIMULATION OF THE DEBLURRING EFFECT

To demonstrate the deblurring effect, we used a simple task and a simple
network.  The task was to associate random binary 10-bit input vectors with
random binary 10-bit output vectors.  We selected 100 input vectors at random
(without replacement) from the set of all 2@+(10) binary vectors of length 10,
and for each input vector we chose a random output vector.  The network we used
had three layers: 10 input units, 100 hidden units, and 10 output units.  Each
input unit was connected to all the hidden units, and each hidden unit was
connected to all the output units.  The hidden and output units also had
variable biases that were modified during the learning.

We trained the network by repeatedly sweeping through the whole set of 100
associations and changing the weights after each sweep.  After prolonged
training (1300 sweeps with @m[@g(e)#=#.02] and @m[@g(a)#=#.9]) the network knew
the associations perfectly, and all the knowledge was in the slow weights.  The
fast weights were very close to zero because the errors were very small towards
the end of the training, so the tiny error derivatives were dominated by the
tendency of the fast weights to decay towards zero by 1% after each weight
update.

Once the 100 associations were learned, we trained the network on 5 new random
associations without further rehearsal on the original 100.  Again, we
continued the training until the new knowledge was in the slow weights (400
sweeps).  We then retrained the network on only a subset of the original
associations, and compared the improvement in performance on this retrained
subset with the (incidental) improvement in performance on the rest of the
associations.  The main result, shown in figure@ @ref<transfer>, was that in
the early stages of retraining the improvements in the associations that were
not retrained were very nearly as good as in the associations that were
explicitly retrained.  This rather surprising result held even if only 10% of
the old associations were retrained.

@begin[figure]
@blankspace[2.75in]
@fillcaption[Performance on retrained and unretrained subsets of 100
input-output cases after (1) learning all 100 cases; and (2) interfering with
the weights by learning 5 unrelated cases.   The average error of an
output unit is shown for the first 20 sweeps through the retrained subset.]
@tag<transfer>
@end[figure]

The reason for this effect is that the knowledge about each association is
distributed over many different connections, and when the network is retrained
on @i(some) of the associations @i(all) the weights are pushed back towards the
values that they used to have after the associations were first learned.  So
there is improvement in the unretrained associations even though they are only
randomly related to the retrained ones.  Of course, if the retrained and
unretrained associations share some regularities the transfer will be even
better.

A Geometric Explanation of the Transfer Effect

Consider the very simple network shown in figure @ref<simple-network>.  There
are two input units which are directly connected to a single, linear output
unit.  The network is trained on two different associations each of which maps
a two component input vector into a one component output vector.  For each of
the two input vectors, there will be many different combinations of the weights
@m[w@-[1]] and @m[w@-[2]] that give the desired output vector, and since the
output unit is linear, these combinations will form straight lines in weight
space as shown is figure@ @ref<constraints>.  In general, the only combination
of weights that will satisfy both associations lies at the intersection of the
two lines.  So in this simple network, a gradient descent learning procedure
will converge on the intersection point.

@begin[figure]
@blankspace[2.3in]
@caption[A simple network that learns to associate @m[(x@-[1]#,@ x@-[2]#)] with @m[y] and @m[(x@-[1]'#,@ x@-[2]'#)] with @m[y'].]
@tag<simple-network>
@end[figure]

@begin[figure]
@blankspace[3.1in]
@caption[A plot of weight space for the simple network in figure@ @ref<simple-network>.]
@tag<constraints>
@end[figure]
@blankspace[.4in]

Now, suppose we add to a network that has learned the associations a noise
vector of length @m[r#] that is selected at random from a distribution that is
uniformly distributed over all possible orientations.  The new combination of
weights will be uniformly distributed over the circle shown in figure
@ref<constraints>.  If we now retrain an infinitessimal amount on association
1, we will move perpendicularly towards line 1. If we start from a point such
as P that lies on one of the larger arcs of the circle, the movement towards
line 1 will have a component @i(towards) line 2, whereas if we start from a
point on one of the smaller arcs, the movement will have a component @i(away
from) line 2.  Thus, when we retrain a small amount on association 1, we are
more likely than not to improve the performance on association 2, even though
the associations are only randomly related to each other.

The expected positive transfer between retraining on one association and
performance on the other can only be a result of starting the retraining from a
point in weight space that has some special relationship to the solution point,
O.  We initially thought that the starting point for the retraining needed to
be @i(near) the solution point, but the geometrical argument we have just given
is independent of the magnitude, @m[r#], of the noise vector.  The crucial
property of the starting point is that it is selected at random from a
distribution of points that are all the same distance from O, and selecting
starting points in this way induces a bias towards starting points that cause
positive transfer between the two associations that define the point O.

The positive transfer effect obviously disappears if the two associations are
orthogonal, and so we might expect the effect to be rather small when the
network is large, because randomly related high-dimensional vectors tend to be
almost orthogonal.  In fact, it is important to distinguish between two
qualitatively different cases.  If the activity levels of the input units have
a mean of zero, most pairs of high-dimensional vectors will be approximately
orthogonal and so the effect is small and the expected transfer from one
retrained association to one unretrained one gets smaller as the dimensionality
increases.  If, however, the activity levels of the input units range between 0
and 1 (as in the simulation described above) random high-dimensional vectors
are not close to orthogonal and we show in the next section that the expected
transfer from one retrained association to one unretrained one is independent
of the dimensionality of the vectors.  @begin[comment]


A MATHEMATICAL ANALYSIS

It is hard to analyze the expected size of the transfer effect for networks
with multiple layers of non-linear units or for tasks with complex regularities
among the input-output associations.  However, an analysis of the transfer
effect in simple networks learning random associations may provide a useful
guide to what can be expected in more complex cases.  We therefore derive the
magnitude of the expected transfer from one retrained association, @m[A#], to
one unretrained association, @m[B#], in a network that has no hidden units and
only one linear output unit.  @foot{For layered, feed-forward networks with no
hidden units, each output unit has its own separate set of weights.  So a
network with many output units can be viewed as composed of many separate
networks each of which has a single output unit.}  We assume that the network
has previously learned to produce the correct activity level in the output unit
for two input vectors, @m[@b(a)] and @m[@b(b)], and that it is now retraining
on association A after independent gaussian noise has been added to each
weight.  The noise added to the @m[i@+(#th)] weight is @m[g@-(i)] and is chosen
from a distribution with mean 0 and standard deviation @g(s).  We also assume
that the retraining involves changing the weights by a fixed, infinitesimal
amount in the direction of steepest descent in the error function
 @m<@over[num 1, denom 2]#(#y@-[A]#-#d@-[A]#)@+[#2]> where @m<y@-[A]> is the
actual output of the output unit and @m<d@-[A]> is the desired output that
was achieved before the noise was added.

A good measure of the magnitude of the transfer effect is the the ratio of two
improvements: the improvement in association @m[B] caused by retraining on
@m[A] and the improvement in association @m[B] that would have been caused by
retraining directly on @m[B#].  If the retraining involves changing the weight
vector by a fixed amount in the direction of steepest descent, this ratio is
simply the cosine of the angle between the direction of steepest descent for
association @m[A] and the direction of steepest descent for association
@m[B#].@foot{If the retraining involves multiplying the gradient by a
coefficient, @g(e), to determine the weight change, the actual ratio may differ
because the magnitudes of the gradients may differ for @m[A] and @m[B#].  But
this will not affect the @i(expected) ratio, so the analysis we give for the
@i(expected) transfer is still valid.}

If the original learning was perfect, all of the error in the output would
be caused by the noise added to the weights, so

@begin[equation]
@over[num "@partial E@-[A]", denom "@partial y@-[A]"]@quad =@quad@~
y@-[A]@ - @ d@-[A]@quad =@quad@~
@sum[from "i"]a@-[i]#g@-[i]
@end[equation]

where @m[a@-[i]] is the activity level of the @m[i@+[#th]] input unit in
association @m[A].  So, for a weight, @m[w@-(i)#], the derivatives of the error
@m[E@-(A)] for association @m[A] and @m[E@-(B)] for association @m[B] are given
by

@begin[equation]
@over[num "@partial E@-[A]", denom "@partial w@-[i]"]@quad =@quad@~
@over[num "@partial y@-[A]", denom "@partial w@-[i]"]@ :@ @~
@over[num "@partial E@-[A]", denom "@partial y@-[A]"]@ :@ @~
a@-[i]@ :@~
@sum[from "i"]a@-[i]#g@-[i]@ ,@~
@\@~
@over[num "@partial E@-[B]", denom "@partial w@-[i]"]@quad =@quad@~
@over[num "@partial y@-[B]", denom "@partial w@-[i]"]@ :@ @~
@over[num "@partial E@-[B]", denom "@partial y@-[B]"]@ :@ @~
b@-[i]@ :@~
@sum[from "i"]b@-[i]#g@-[i]
@end[equation]

Hence, the cosine of the angle, @g(q), between the directions of steepest
descent for the two associations is given by

@begin[equation]
cos#(@g[q])@quad@^= @quad@~
@over[num   "@sum[from i]a@-[i]#b@-[i]@ :@~
	     @sum[from i]a@-[i]#g@-[i]@ :@~
	     @sum[from i]b@-[i]#g@-[i]",
      denom "@bigleftbracket@~
	     @sum[from i]a@-[i]@+[2]#@~
	     @bigleftparen@~
	     @sum[from i]a@-[i]#g@-[i]#@~
	     @bigrightparen@+[2]@ @ :@ @~
	     @sum[from i]b@-[i]@+[2]#@~
	     @bigleftparen@~
	     @sum[from i]b@-[i]#g@-[i]#@~
	     @bigrightparen@+[2]@~
	     @bigrightbracket@+[1/#2]"]

@\= @quad @~
@over[num   "@sum[from i]a@-[i]#b@-[i]",
      denom "@bigleftparen@~
	     @sum[from i]a@-[i]@+[2]@ :@~
	     @sum[from i]b@-[i]@+[2]@~
	     @bigrightparen@+[1/#2]"]@~
@quad : @quad @~
@over[num   "@sum[from i]a@-[i]#g@-[i]@ :@~
	     @sum[from i]b@-[i]#g@-[i]",
      denom "@bigsinglebar@~
	     @sum[from i]a@-[i]#g@-[i]@ :@~
	     @sum[from i]b@-[i]#g@-[i]#@~
	     @bigsinglebar"]@tag<cosine-theta>
@end[equation]


The Zero-One Case

The first part of equation@ @ref<cosine-theta> is simply the cosine of the
angle between the two input vectors, and so it is independent of the weights.
If the components of @m[@b(a)] and @m[@b(b)] are all 0 or 1 it can be written
as

@begin[equation]
@over[num "n@-[@b[ab]]", denom "(n@-[@b[a]]#n@-[@b[b]]#)@+[#1/#2]"]
@end[equation]

where @m[n@-(@b[a])] is the number of components that have value 1 in the
vector @m[@b(a)], and @m[n@-(@b[ab])] is the number that have value 1 in both
@m[@b(a)] and @m[@b(b)].

The second part of equation@ @ref<cosine-theta> depends on the weights.  It
always has a value of 1 or @m[-1], depending on whether @m[E@-[A]] and
@m[E@-[B]] have the same or opposite signs.  Its numerator can be written as

@begin[equation]
@bigleftparen@~
@sum[from "i @in S@-[@b[ab]]"]g@-[i]@ @ +@~
@sum[from "i @in S@-[@b[ab]]"]g@-[i]#@~
@bigrightparen@quad @~
@bigleftparen@~
@sum[from "i @in S@-[@b[ab]]"]g@-[i]@ @ +@~
@sum[from "i @in S@-[@b[ab]]"]g@-[i]#@~
@bigrightparen@quad
@end[equation]

where @m[S@-(@b[ab])#] is the set of input units that have value 1 in both
 @m[@b(a)] and @m[@b(b)], @m[S@-(@b[ab])#] is the set that have value 1 in
 @b(a) and 0 in @m[@b(b)], and @m[S@-(@b[ab])#] is the set that have value 0 in
@m[@b(a)] and 1 in @m[@b(b)].  Since these sets are disjoint and the expected
value of the products of independent zero-mean gaussians is 0, all the
cross-products in the numerator of equation@ @ref<cosine-theta> vanish when we
take expected values except for the term

@begin[equation]
@bigleftanglebracket@~
@sum[from "i @in S@-[@b[ab]]"]g@-[i]@ :@~
@sum[from "i @in S@-[@b[ab]]"]g@-[i]@ @~
@bigrightanglebracket@quad = @quad@bigleftanglebracket@~
@sum[from "i @in S@-[@b[ab]]"]g@-[i]@+[2]#@~
@bigrightanglebracket@quad = @quad@~
n@-[@b[ab]]#@g[s]@+[2]
@end[equation]

So the expected value of equation@ @ref<cosine-theta> can be written as

@begin[equation]
@over[num   "n@-[@b[ab]]#@g[s]@+[2]",
      denom "@bigleftanglebracket#@~
	     @bigsinglebar@~
	     @sum[from i]a@-[i]#g@-[i]@ :@~
	     @sum[from i]b@-[i]#g@-[i]#@~
	     @bigsinglebar#@~
	     @bigrightanglebracket"]@~
@quad = @quad@~
@over[num   "n@-[@b[ab]]#@g[s]@+[2]",
      denom "n@-[@b[a]]@+[1/#2]#@g[s]@ @~
	     n@-[@b[b]]@+[1/#2]#@g[s]"]@~
@quad = @quad@~
@over[num   "n@-[@b[ab]]",
      denom "n@-[@b[a]]@+[1/#2]#@~
	     n@-[@b[b]]@+[1/#2]"]
@end[equation]

and so the expected value of @m[cos#(@g(q))] is given by

@begin[equation]
@leftanglebracket#cos#(@g[q])#@rightanglebracket@quad = @quad@~
@over[num "n@-[@b[ab]]@+[2]", denom "n@-[@b[a]]#n@-[@b[b]]"]@tag<expected-cos>
@end[equation]

If each component of @m[@b(a)] and @m[@b(b)] has value 1 with probability 0.5
and value 0 otherwise, the expected value of @m[cos#(@g(q))] is 0.25.  So if we
add random noise to the weights of a simple network that has learned 100
associations and then we retrain on 50 of them using very small weight changes,
the ratio of the expected improvements on an unretrained and a retrained
association at the start of the retraining will be 

@begin[equation]
@over[num "50@ @mult@ .25", denom "1@ +@ 49@ @mult@ .25"]@quad =@quad .943
@end[equation]

This agrees well with simulations we have done using networks with no hidden
units.


The 1, 0, -1 Case

When the components of the input vectors can also take on values of @m<->1,
a modification of the derivation used above leads to

@begin[equation]
@leftanglebracket#cos#(@g[q])#@rightanglebracket@quad = @quad@~
@over[num   "(n@-[agree]@ -@ n@-[disagree]#)@+[2]",
      denom "n@-[@b[a]]@ n@-[@b[b]]"]@tag<symmetric-case>
@end[equation]

where @m<n@-(agree)> is the number of input units for which
@m<a@-(i)#b@-(i)#=#1> and @m<n@-(disagree)> is the number of input units for
which @m<a@-(i)#b@-(i)#=#-1>.

For a pair of random vectors in which each component has a probability, @m[p],
of being a 1 and the same probability of being a @m[-1], the expected value of
the numerator of equation@ @ref(symmetric-case) is just the square of the
expected length of a random walk in which each step has a probability of
@m<2p@+(2)> of being to the left, @m<2p@+(2)> of being to the right, and
@m<1#-#4p@+(2)> of being zero.  The expected value of the numerator is
therefore @m<4p@+(2)n>, where @m[n] is the dimensionality of the input vector.
Since the denominator has an expected value of order @m[n@+(2)], the expected
transfer effect is inversely proportional to @m[n].

The decrease in the expected transfer between a single pair of vectors is
balanced by the fact that larger networks can hold more associations.  If the
number of associations learned is proportional to the dimensionality of the
input vectors, and if a constant @i(fraction) of the associations are retrained
after adding noise to the weights, the ratio of the initial improvement on the
unretrained associations to the improvement on the retrained associations is
independent of the dimensionality.


How the Transfer Effect Decreases During Retraining

The analyses we have presented are for the transfer between random associations
during the earliest stage of retraining.  As retraining proceeds, the transfer
effect diminishes.  Figure@ @ref<constraints> presents a simple geometrical
explanation of why this occurs.  At the start of retraining, the point in
weight space lies somewhere on the circle, and the probability that the
transfer will be positive is simply the fraction of the circumference that lies
in the two larger arcs of the circle.  Retraining on association 1 moves each
point on the circle perpendicularly towards the line 1 by an amount
proportional to its distance from 1, so after a given amount of retraining,
each point on the circle will move to the corresponding point on the ellipse.
The probability of positive transfer is now equal to the fraction of the
circumference of the ellipse that lies in the two larger arcs.


CONCLUSIONS

This paper demonstrates and analyses a powerful and surprising transfer effect
that was first described (but not analysed) by Hinton and Sejnowski (1986).
The analysis applies just as well if there are no fast weights and the
relearning takes place in the slow weights.  The advantage of fast weights is
that they allow this effect to be used for temporarily deblurring old memories
without causing significant interference to new memories.  When the fast
weights decay away the newer memories are restored.

There are many anecdotal descriptions of phenomena that could be explained by
the kind of mechanism we have proposed, but we know of no well controlled
studies.  We predict that if a two unrelated sets of associations are learned
at the same time, and if the internal representations used for different
associations share the same units, then retraining on one set of associations
will substantially improve performance on the other set.  One problem with this
prediction is that with sufficient learning, connectionist networks tend to use
different sets of units for representing unrelated items.  A second problem is
that even if the unretrained associations are enhanced, the effect may be
masked by response competition due to facilitation of the responses to the
retrained items.  Nevertheless, the type of effect we have described can be
very large in simulated networks and so a well designed experiment should be
able to detect whether or not it occurs in people.