Using Fast Weights to Deblur Old Memories Geoffrey E. Hinton and David C. Plaut Computer Science Department Carnegie-Mellon University Abstract Connectionist models usually have a single weight on each connection. Some interesting new properties emerge if each connection has two weights: A slowly changing, plastic weight which stores long-term knowledge and a fast-changing, elastic weight which stores temporary knowledge and spontaneously decays towards zero. If a network learns a set of associations and then these associations are "blurred" by subsequent learning, @i(all) the original associations can be "deblurred" by rehearsing on just a few of them. The rehearsal allows the fast weights to take on values that temporarily cancel out the changes in the slow weights caused by the subsequent learning. INTRODUCTION Most connectionist models have assumed that each connection has a single weight which is adjusted during the course of learning. Despite the emerging biological evidence that changes in synaptic efficacy at a single synapse occur at many different time-scales (Kupferman, 1979; Hartzell, 1981), there have been relatively few attempts to investigate the computational advantages of giving each connection several different weights that change at different speeds. Even for phenomena like short-term memory where fast-changing weights might seem appropriate, connectionist models have typically used the activation levels of units rather than the weights to store temporary memories (Little and Shaw, 1975; Touretzky and Hinton, 1985). We know very little about the range of potential uses of fast weights. How do they alter the way networks behave, and what extra computational properties do they provide? In this paper we assume that each connection has both a fast, elastic weight and a slow, plastic weight. The slow weights are like the weights normally used in connectionist networks@math[-]they change slowly and they hold all the long-term knowledge of the network. The fast weights change more rapidly and they continually regress towards zero so that their magnitude is determined solely by their recent past. The effective weight on the connection is the sum of these two. At any instant, we can think of the system's knowledge as consisting of the slow weights with a temporary overlay of fast weights. The overlay gives a temporary context@math[-]a temporary associative memory that allows networks to do more flexible information processing. Many ways of using this temporary memory have been suggested. 1. It can be used for rapid temporary learning. When presented with a new association the network can store it in one trial, provided the storage only needs to be temporary. 2. It can be used for creating temporary bindings between features. Recent work by Von@ der@ Malsburg (1981) and Feldman (1982) has shown that fast-changing weights can be used to dynamically bind together a number of different properties into a coherent whole, or to discover approximate homomorphisms between two structured domains. 3. It can be used to allow truly recursive processing. During execution of a procedure, the values of local variables and the stage reached in the procedure (the program counter) can be stored in the fast weights. This allows the procedure to call a subprocedure whose execution involves different patterns of activity in the very same units as the calling procedure. When the subprocedure has finished using the units, the state of the calling procedure can be reconstructed from the temporary associative memory. In this way the state does not need to be stored in the @i(activity) of the units, so the very same units can be used for running the subprocedure. A working simulation of this kind is described briefly in McClelland & Kawamoto (1986). 4. It can be used to implement a learning method called "shortest descent" which is a way of minimizing the amount of interference caused by new learning. Shortest descent will be described in a separate paper and is not discussed further here. In this paper we describe a novel use for a temporary memory stored in the fast weights: it can be used for cancelling out the interference in a set of old associations caused by more recent learning. Consider a network which has slowly and painfully learned a set of associations in its slow weights. If this network is then taught a new set of associations without any more rehearsal of the old associations, there is normally some degradation of the old associations. We show that by using fast weights it is possible to quickly restore a whole set of old associations by rehearsing on just a subset of them. The fast weights cancel out the changes in the slow weights that have occurred since the old associations were learned, so the combination of the current slow weights and the fast weights approximates the earlier slow weights. The fast weights therefore create a context in which the old associations are present again. When the fast weights decay away, the new associations return. A DEBLURRING ANALOGY There is an analogy between the use of fast weights to recover unretrained associations and a technique called "deblurring" which is sometimes used for cleaning up blurred images. Suppose that you are in your office and you want to take some photographs of what is on your computer screen. You set up a tripod in front of the screen and then you carefully focus the camera and take one photograph. Unfortunately, before you can take any more, your office mate moves the camera to a tripod in front of his screen and refocuses it. You now move the camera back to your tripod and it seems as if you must refocus, but there is an interesting alternative. You take another photograph of the same screen without refocussing and you compare it with your first photograph. The first photo is the "desired output" of the process that maps from screens to photos, and the difference between it and the "actual output" achieved with the out-of-focus camera can be used to estimate a deblurring operator which can be applied to the actual output to convert it to the desired output. This same deblurring operator can then be applied to any image taken with the out-of-focus camera. Focussing the camera is analogous to learning the slow weights that are required to map from input vectors (screens) to output vectors (photos). Estimating the deblurring operator is analogous to learning the fast weights that are required to compensate for the noise that has been added to the slow weights since the original learning (see figure @ref). @begin[figure] @blankspace[4in] @fillcaption[An illustration of the deblurring analogy between (a) images and (b) networks. The laws of projection (P) determine the mapping from the scene to the image. After applying a blurring function (B), applying the inverse of B to the blurred image restores the image. Similarly, the weights (W) define a mapping from input to output. After noise (N) is added, determining and applying the inverse of N allows the network to produce the original output.] @tag @end[figure] The advantage of using fast weights rather than slow ones for deblurring is that it does not permanently interfere with the new associations. As soon as the fast weights have decayed back to zero, the new knowledge is restored. THE LEARNING PROCEDURE We used the back propagation learning procedure (Rumelhart @i[et al.], 1986a; 1986b) to investigate the properties of networks that learn with both fast and slow weights. We summarize the procedure below@math[-]the full mathematical details can be found in the above references. The procedure operates on layered, feed-forward networks of deterministic, neuron-like units. These networks have a layer of input units at the bottom, any number of intermediate layers of hidden units, and a layer of output units at the top. Connections are allowed only from lower to higher layers. The aim of the learning procedure is to find a set of weights on the connections such that, when the network is presented with each of a set of input vectors, the output vector produced by the network is sufficiently close to the corresponding desired output vector. The error produced by the network is defined to be the squared difference between the actual and desired output vectors summed over all input-output cases. The learning procedure minimizes this error by performing gradient descent in weight space. This requires adjusting each weight in the network in proportion to the partial derivative of the error with respect to that weight. These error derivatives are calculated in two passes. The forward pass determines the output (or @i[state]) of each unit in the network. An input vector is presented to the network by setting the states of the input units. Layers are then processed sequentially, working from the bottom upward, with the states of units within a layer set in parallel. The input to a unit is the scalar product of the states of units in lower layers from which it receives connections, and the weights on these connections. The output of a unit is a real-valued smooth non-linear function of its input. The forward pass ends when the states of the output units are set. The backward pass starts with the output units at the top of the network and works downward through each successive layer, "back-propagating" error derivatives to each weight in the network. For each layer, computing the error derivatives of the weights on incoming connections involves first computing the error derivatives with respect to the outputs, and then with respect to the inputs, of the units in that layer. The simple form of the unit input-output functions makes these computations straightforward. The backward pass is complete when the derivative of the error with respect to each weight in the network has been determined. The simplest version of gradient descent is to decrement each weight in proportion, @g[e], to its error derivative. A more complicated version that usually converges faster is to add to the current weight change a proportion, @g[a], of the previous weight change. This is analogous to introducing @i[momentum] to movement in weight space. Since the effective weight on a connection is the sum of the fast and slow weights, these weights experience the same error derivative. Hence they behave differently only because they are modified using different weight change parameters @g[e] and @g[a], and because the fast weights decay towards zero. This is achieved by reducing the magnitude of each fast weight by some fraction, @m[h#], after each weight change. @begin[comment] A SIMULATION OF THE DEBLURRING EFFECT To demonstrate the deblurring effect, we used a simple task and a simple network. The task was to associate random binary 10-bit input vectors with random binary 10-bit output vectors. We selected 100 input vectors at random (without replacement) from the set of all 2@+(10) binary vectors of length 10, and for each input vector we chose a random output vector. The network we used had three layers: 10 input units, 100 hidden units, and 10 output units. Each input unit was connected to all the hidden units, and each hidden unit was connected to all the output units. The hidden and output units also had variable biases that were modified during the learning. We trained the network by repeatedly sweeping through the whole set of 100 associations and changing the weights after each sweep. After prolonged training (1300 sweeps with @m[@g(e)#=#.02] and @m[@g(a)#=#.9]) the network knew the associations perfectly, and all the knowledge was in the slow weights. The fast weights were very close to zero because the errors were very small towards the end of the training, so the tiny error derivatives were dominated by the tendency of the fast weights to decay towards zero by 1% after each weight update. Once the 100 associations were learned, we trained the network on 5 new random associations without further rehearsal on the original 100. Again, we continued the training until the new knowledge was in the slow weights (400 sweeps). We then retrained the network on only a subset of the original associations, and compared the improvement in performance on this retrained subset with the (incidental) improvement in performance on the rest of the associations. The main result, shown in figure@ @ref, was that in the early stages of retraining the improvements in the associations that were not retrained were very nearly as good as in the associations that were explicitly retrained. This rather surprising result held even if only 10% of the old associations were retrained. @begin[figure] @blankspace[2.75in] @fillcaption[Performance on retrained and unretrained subsets of 100 input-output cases after (1) learning all 100 cases; and (2) interfering with the weights by learning 5 unrelated cases. The average error of an output unit is shown for the first 20 sweeps through the retrained subset.] @tag @end[figure] The reason for this effect is that the knowledge about each association is distributed over many different connections, and when the network is retrained on @i(some) of the associations @i(all) the weights are pushed back towards the values that they used to have after the associations were first learned. So there is improvement in the unretrained associations even though they are only randomly related to the retrained ones. Of course, if the retrained and unretrained associations share some regularities the transfer will be even better. A Geometric Explanation of the Transfer Effect Consider the very simple network shown in figure @ref. There are two input units which are directly connected to a single, linear output unit. The network is trained on two different associations each of which maps a two component input vector into a one component output vector. For each of the two input vectors, there will be many different combinations of the weights @m[w@-[1]] and @m[w@-[2]] that give the desired output vector, and since the output unit is linear, these combinations will form straight lines in weight space as shown is figure@ @ref. In general, the only combination of weights that will satisfy both associations lies at the intersection of the two lines. So in this simple network, a gradient descent learning procedure will converge on the intersection point. @begin[figure] @blankspace[2.3in] @caption[A simple network that learns to associate @m[(x@-[1]#,@ x@-[2]#)] with @m[y] and @m[(x@-[1]'#,@ x@-[2]'#)] with @m[y'].] @tag @end[figure] @begin[figure] @blankspace[3.1in] @caption[A plot of weight space for the simple network in figure@ @ref.] @tag @end[figure] @blankspace[.4in] Now, suppose we add to a network that has learned the associations a noise vector of length @m[r#] that is selected at random from a distribution that is uniformly distributed over all possible orientations. The new combination of weights will be uniformly distributed over the circle shown in figure @ref. If we now retrain an infinitessimal amount on association 1, we will move perpendicularly towards line 1. If we start from a point such as P that lies on one of the larger arcs of the circle, the movement towards line 1 will have a component @i(towards) line 2, whereas if we start from a point on one of the smaller arcs, the movement will have a component @i(away from) line 2. Thus, when we retrain a small amount on association 1, we are more likely than not to improve the performance on association 2, even though the associations are only randomly related to each other. The expected positive transfer between retraining on one association and performance on the other can only be a result of starting the retraining from a point in weight space that has some special relationship to the solution point, O. We initially thought that the starting point for the retraining needed to be @i(near) the solution point, but the geometrical argument we have just given is independent of the magnitude, @m[r#], of the noise vector. The crucial property of the starting point is that it is selected at random from a distribution of points that are all the same distance from O, and selecting starting points in this way induces a bias towards starting points that cause positive transfer between the two associations that define the point O. The positive transfer effect obviously disappears if the two associations are orthogonal, and so we might expect the effect to be rather small when the network is large, because randomly related high-dimensional vectors tend to be almost orthogonal. In fact, it is important to distinguish between two qualitatively different cases. If the activity levels of the input units have a mean of zero, most pairs of high-dimensional vectors will be approximately orthogonal and so the effect is small and the expected transfer from one retrained association to one unretrained one gets smaller as the dimensionality increases. If, however, the activity levels of the input units range between 0 and 1 (as in the simulation described above) random high-dimensional vectors are not close to orthogonal and we show in the next section that the expected transfer from one retrained association to one unretrained one is independent of the dimensionality of the vectors. @begin[comment] A MATHEMATICAL ANALYSIS It is hard to analyze the expected size of the transfer effect for networks with multiple layers of non-linear units or for tasks with complex regularities among the input-output associations. However, an analysis of the transfer effect in simple networks learning random associations may provide a useful guide to what can be expected in more complex cases. We therefore derive the magnitude of the expected transfer from one retrained association, @m[A#], to one unretrained association, @m[B#], in a network that has no hidden units and only one linear output unit. @foot{For layered, feed-forward networks with no hidden units, each output unit has its own separate set of weights. So a network with many output units can be viewed as composed of many separate networks each of which has a single output unit.} We assume that the network has previously learned to produce the correct activity level in the output unit for two input vectors, @m[@b(a)] and @m[@b(b)], and that it is now retraining on association A after independent gaussian noise has been added to each weight. The noise added to the @m[i@+(#th)] weight is @m[g@-(i)] and is chosen from a distribution with mean 0 and standard deviation @g(s). We also assume that the retraining involves changing the weights by a fixed, infinitesimal amount in the direction of steepest descent in the error function @m<@over[num 1, denom 2]#(#y@-[A]#-#d@-[A]#)@+[#2]> where @m is the actual output of the output unit and @m is the desired output that was achieved before the noise was added. A good measure of the magnitude of the transfer effect is the the ratio of two improvements: the improvement in association @m[B] caused by retraining on @m[A] and the improvement in association @m[B] that would have been caused by retraining directly on @m[B#]. If the retraining involves changing the weight vector by a fixed amount in the direction of steepest descent, this ratio is simply the cosine of the angle between the direction of steepest descent for association @m[A] and the direction of steepest descent for association @m[B#].@foot{If the retraining involves multiplying the gradient by a coefficient, @g(e), to determine the weight change, the actual ratio may differ because the magnitudes of the gradients may differ for @m[A] and @m[B#]. But this will not affect the @i(expected) ratio, so the analysis we give for the @i(expected) transfer is still valid.} If the original learning was perfect, all of the error in the output would be caused by the noise added to the weights, so @begin[equation] @over[num "@partial E@-[A]", denom "@partial y@-[A]"]@quad =@quad@~ y@-[A]@ - @ d@-[A]@quad =@quad@~ @sum[from "i"]a@-[i]#g@-[i] @end[equation] where @m[a@-[i]] is the activity level of the @m[i@+[#th]] input unit in association @m[A]. So, for a weight, @m[w@-(i)#], the derivatives of the error @m[E@-(A)] for association @m[A] and @m[E@-(B)] for association @m[B] are given by @begin[equation] @over[num "@partial E@-[A]", denom "@partial w@-[i]"]@quad =@quad@~ @over[num "@partial y@-[A]", denom "@partial w@-[i]"]@ :@ @~ @over[num "@partial E@-[A]", denom "@partial y@-[A]"]@ :@ @~ a@-[i]@ :@~ @sum[from "i"]a@-[i]#g@-[i]@ ,@~ @\@~ @over[num "@partial E@-[B]", denom "@partial w@-[i]"]@quad =@quad@~ @over[num "@partial y@-[B]", denom "@partial w@-[i]"]@ :@ @~ @over[num "@partial E@-[B]", denom "@partial y@-[B]"]@ :@ @~ b@-[i]@ :@~ @sum[from "i"]b@-[i]#g@-[i] @end[equation] Hence, the cosine of the angle, @g(q), between the directions of steepest descent for the two associations is given by @begin[equation] cos#(@g[q])@quad@^= @quad@~ @over[num "@sum[from i]a@-[i]#b@-[i]@ :@~ @sum[from i]a@-[i]#g@-[i]@ :@~ @sum[from i]b@-[i]#g@-[i]", denom "@bigleftbracket@~ @sum[from i]a@-[i]@+[2]#@~ @bigleftparen@~ @sum[from i]a@-[i]#g@-[i]#@~ @bigrightparen@+[2]@ @ :@ @~ @sum[from i]b@-[i]@+[2]#@~ @bigleftparen@~ @sum[from i]b@-[i]#g@-[i]#@~ @bigrightparen@+[2]@~ @bigrightbracket@+[1/#2]"] @\= @quad @~ @over[num "@sum[from i]a@-[i]#b@-[i]", denom "@bigleftparen@~ @sum[from i]a@-[i]@+[2]@ :@~ @sum[from i]b@-[i]@+[2]@~ @bigrightparen@+[1/#2]"]@~ @quad : @quad @~ @over[num "@sum[from i]a@-[i]#g@-[i]@ :@~ @sum[from i]b@-[i]#g@-[i]", denom "@bigsinglebar@~ @sum[from i]a@-[i]#g@-[i]@ :@~ @sum[from i]b@-[i]#g@-[i]#@~ @bigsinglebar"]@tag @end[equation] The Zero-One Case The first part of equation@ @ref is simply the cosine of the angle between the two input vectors, and so it is independent of the weights. If the components of @m[@b(a)] and @m[@b(b)] are all 0 or 1 it can be written as @begin[equation] @over[num "n@-[@b[ab]]", denom "(n@-[@b[a]]#n@-[@b[b]]#)@+[#1/#2]"] @end[equation] where @m[n@-(@b[a])] is the number of components that have value 1 in the vector @m[@b(a)], and @m[n@-(@b[ab])] is the number that have value 1 in both @m[@b(a)] and @m[@b(b)]. The second part of equation@ @ref depends on the weights. It always has a value of 1 or @m[-1], depending on whether @m[E@-[A]] and @m[E@-[B]] have the same or opposite signs. Its numerator can be written as @begin[equation] @bigleftparen@~ @sum[from "i @in S@-[@b[ab]]"]g@-[i]@ @ +@~ @sum[from "i @in S@-[@b[ab]]"]g@-[i]#@~ @bigrightparen@quad @~ @bigleftparen@~ @sum[from "i @in S@-[@b[ab]]"]g@-[i]@ @ +@~ @sum[from "i @in S@-[@b[ab]]"]g@-[i]#@~ @bigrightparen@quad @end[equation] where @m[S@-(@b[ab])#] is the set of input units that have value 1 in both @m[@b(a)] and @m[@b(b)], @m[S@-(@b[ab])#] is the set that have value 1 in @b(a) and 0 in @m[@b(b)], and @m[S@-(@b[ab])#] is the set that have value 0 in @m[@b(a)] and 1 in @m[@b(b)]. Since these sets are disjoint and the expected value of the products of independent zero-mean gaussians is 0, all the cross-products in the numerator of equation@ @ref vanish when we take expected values except for the term @begin[equation] @bigleftanglebracket@~ @sum[from "i @in S@-[@b[ab]]"]g@-[i]@ :@~ @sum[from "i @in S@-[@b[ab]]"]g@-[i]@ @~ @bigrightanglebracket@quad = @quad@bigleftanglebracket@~ @sum[from "i @in S@-[@b[ab]]"]g@-[i]@+[2]#@~ @bigrightanglebracket@quad = @quad@~ n@-[@b[ab]]#@g[s]@+[2] @end[equation] So the expected value of equation@ @ref can be written as @begin[equation] @over[num "n@-[@b[ab]]#@g[s]@+[2]", denom "@bigleftanglebracket#@~ @bigsinglebar@~ @sum[from i]a@-[i]#g@-[i]@ :@~ @sum[from i]b@-[i]#g@-[i]#@~ @bigsinglebar#@~ @bigrightanglebracket"]@~ @quad = @quad@~ @over[num "n@-[@b[ab]]#@g[s]@+[2]", denom "n@-[@b[a]]@+[1/#2]#@g[s]@ @~ n@-[@b[b]]@+[1/#2]#@g[s]"]@~ @quad = @quad@~ @over[num "n@-[@b[ab]]", denom "n@-[@b[a]]@+[1/#2]#@~ n@-[@b[b]]@+[1/#2]"] @end[equation] and so the expected value of @m[cos#(@g(q))] is given by @begin[equation] @leftanglebracket#cos#(@g[q])#@rightanglebracket@quad = @quad@~ @over[num "n@-[@b[ab]]@+[2]", denom "n@-[@b[a]]#n@-[@b[b]]"]@tag @end[equation] If each component of @m[@b(a)] and @m[@b(b)] has value 1 with probability 0.5 and value 0 otherwise, the expected value of @m[cos#(@g(q))] is 0.25. So if we add random noise to the weights of a simple network that has learned 100 associations and then we retrain on 50 of them using very small weight changes, the ratio of the expected improvements on an unretrained and a retrained association at the start of the retraining will be @begin[equation] @over[num "50@ @mult@ .25", denom "1@ +@ 49@ @mult@ .25"]@quad =@quad .943 @end[equation] This agrees well with simulations we have done using networks with no hidden units. The 1, 0, -1 Case When the components of the input vectors can also take on values of @m<->1, a modification of the derivation used above leads to @begin[equation] @leftanglebracket#cos#(@g[q])#@rightanglebracket@quad = @quad@~ @over[num "(n@-[agree]@ -@ n@-[disagree]#)@+[2]", denom "n@-[@b[a]]@ n@-[@b[b]]"]@tag @end[equation] where @m is the number of input units for which @m and @m is the number of input units for which @m. For a pair of random vectors in which each component has a probability, @m[p], of being a 1 and the same probability of being a @m[-1], the expected value of the numerator of equation@ @ref(symmetric-case) is just the square of the expected length of a random walk in which each step has a probability of @m<2p@+(2)> of being to the left, @m<2p@+(2)> of being to the right, and @m<1#-#4p@+(2)> of being zero. The expected value of the numerator is therefore @m<4p@+(2)n>, where @m[n] is the dimensionality of the input vector. Since the denominator has an expected value of order @m[n@+(2)], the expected transfer effect is inversely proportional to @m[n]. The decrease in the expected transfer between a single pair of vectors is balanced by the fact that larger networks can hold more associations. If the number of associations learned is proportional to the dimensionality of the input vectors, and if a constant @i(fraction) of the associations are retrained after adding noise to the weights, the ratio of the initial improvement on the unretrained associations to the improvement on the retrained associations is independent of the dimensionality. How the Transfer Effect Decreases During Retraining The analyses we have presented are for the transfer between random associations during the earliest stage of retraining. As retraining proceeds, the transfer effect diminishes. Figure@ @ref presents a simple geometrical explanation of why this occurs. At the start of retraining, the point in weight space lies somewhere on the circle, and the probability that the transfer will be positive is simply the fraction of the circumference that lies in the two larger arcs of the circle. Retraining on association 1 moves each point on the circle perpendicularly towards the line 1 by an amount proportional to its distance from 1, so after a given amount of retraining, each point on the circle will move to the corresponding point on the ellipse. The probability of positive transfer is now equal to the fraction of the circumference of the ellipse that lies in the two larger arcs. CONCLUSIONS This paper demonstrates and analyses a powerful and surprising transfer effect that was first described (but not analysed) by Hinton and Sejnowski (1986). The analysis applies just as well if there are no fast weights and the relearning takes place in the slow weights. The advantage of fast weights is that they allow this effect to be used for temporarily deblurring old memories without causing significant interference to new memories. When the fast weights decay away the newer memories are restored. There are many anecdotal descriptions of phenomena that could be explained by the kind of mechanism we have proposed, but we know of no well controlled studies. We predict that if a two unrelated sets of associations are learned at the same time, and if the internal representations used for different associations share the same units, then retraining on one set of associations will substantially improve performance on the other set. One problem with this prediction is that with sufficient learning, connectionist networks tend to use different sets of units for representing unrelated items. A second problem is that even if the unretrained associations are enhanced, the effect may be masked by response competition due to facilitation of the responses to the retrained items. Nevertheless, the type of effect we have described can be very large in simulated networks and so a well designed experiment should be able to detect whether or not it occurs in people.