0:00

In this video, I'm going to talk about the storage capacity of Hopfield nets.

Â Their ability to store a lot of memories is limited by what are called spurious

Â memories. These occur when two nearby energy minima

Â combine to make a new minimum in the wrong place.

Â Attempts to remove these spurious minima eventually led to a very interesting way

Â of doing learning in things considerably more complicated than a basic Hopfield

Â net. At the end of the video,

Â I'll also talk about a curious historical rediscovery where the physicist trying to

Â increase the capacity of Hopfield nets, rediscovered the perceptron convergence

Â procedure. Off to Hopfield, invented Hopfield nets as

Â memory storage devices. The field became obsessed by the storage

Â capacity of a Hopfield net. Using Hopfield Storage Rule for a totally

Â connected net, the capacity is about 0.15N memories.

Â That is, if you have N binary threshold units, the number of memories you can

Â store is about 0.15N before memories start getting confused with one another.

Â So that's the number you can store and still hope to retrieve them sensibly.

Â 1:35

This doesn't make efficient use of the bits that are required to store the

Â weights. In other words, if you look at how many

Â bits the computer is using to store the weights, it's using well over 0.15N

Â squared2 bits to store the weights. And therefore, this kind of distributed

Â memory and local energy minima is not making efficient use of the bits in the

Â computer. We can analyze how many bits we should be

Â able to store if we were making efficient use of the bits in the computer.

Â Those N squared weights and biases in the net.

Â And after storing M memories, each connection weight has an integer value in

Â the range -M to M. That's because we increase it by one or decrease it by one

Â each time we store a memory, assuming that we used states of -one and one.

Â Now, of course, not all values will be equiprobable, so we could compress that

Â information. But ignoring that, the number bits it

Â would take us to store a connection rate in a naive way is log 2M + one, Cuz that's

Â the number of alternative connection rates and that's a log to the base two.

Â And so, the total number of bits of computer memory that we use is of the

Â order of N squared log 2M + one. So, notice that, that scales

Â logarithmically with M. Whereas, if you store things in the way

Â that Hopfield suggests, you get this constants 0.15 instead of something this

Â scale is logarithmically. So, we're not so worried about the fact

Â that the constant is a lot less than two, What we're worried about is this

Â logarithmic scaling. That shows we ought to be able to do

Â something better. If we ask, what limits the capacity of a

Â Hopfield net? What is it that causes it to break down? Then, its merging of energy

Â minima. So, each time we memorize a binary

Â configuration, we hope that we'll create a new energy minima.

Â So, we might have a state space for all the states of the net being depicted

Â horizontally here, and the energy being depicted vertically.

Â And, we might have one en, energy minimum for the blue pattern and another for the

Â green pattern. But, if those two patents are nearby, what

Â will happen is we won't get two seperate minima. They'll merge to create one

Â minimum at an intermediate location. And that means, we can't distinguish those two

Â seperate memories, and indeed we'll recall something, that's a blend of them rather

Â than the individual memories. That's what limits the capacity of a

Â Hopfield net, that kind of merging of nearby minima.

Â 4:22

One thing I should mention is this picture, is a big misrepresentation. The

Â states of a Hopfield matter really, the corners of a hyper cube. And, it's not

Â very good to show, the corners of a hyper cube, as if they were continous one

Â dimensional horizontal space. One very interesting idea that came out of

Â thinking about how to improve the crest of the Hopfield net is the idea of

Â unlearning. This was first suggested by Hopfield,

Â Feinstine and Palmer, who suggested the following strategies.

Â 4:55

You left a net settle from a random initial state, and then you do unlearning.

Â That is whatever binary state it settles to, you apply opposite of the storage

Â rule. I think you can see that with the previous

Â example, that red merge minimum. If you let the net settle there and did

Â some unlearning on that merge minimum, you'd get back the two separate minima cuz

Â you'd pull up that red point. So, by getting rid of deep spurious

Â minima, we can actually increase the memory capacity.

Â 5:38

Francis Crick, one of the discovers of the structure of DNA, and Graham Micherson,

Â proposed that unlearning might be what's going on during REM sleep, that is Rapid

Â Eye Movement sleep. So, the idea was that during the day, you

Â store lots of things, and you'll get spurious minima.

Â Then at night, you put the network in a random state, you settle to a minimum,

Â And you unlearn what you settled to. And that actually explains a big puzzle.

Â This is a puzzle that doesn't seem to puzzle most people that study sleep but it

Â ought to. Each night, you go to sleep and you dream

Â for several hours. When you wake up in the morning, those dreams are all gone. Well,

Â they're not quite all gone. The dream you had just before you woke up, you can get

Â into short term memory and you'll remember it for a while. And if you think about it,

Â you might remember it for a long time. But, we know perfectly well that if we'd

Â woken you up at other times in the night, you'd have been having other dreams, and

Â in the morning their just not there. So, it looks like you're simply not

Â storing what you're dreaming about, and the question is, why? In fact, why do you

Â bother to dream at all? Dreaming is paradoxical and that the state

Â of your brain looks extremely like the state of your brain when you're awake,

Â except that it's not being driven by real input. It's being driven by a relay

Â station just after the real input called the thalamus.

Â 7:20

But, there's another problem with unlearning, which is more mathematical

Â problem, Which is, how much unlearning should we do?

Â Now, given more you've seen in the school so far, a real solution to that problem

Â will be to show that unlearning is part of the process of fitting a model to data.

Â And, if you do maximum likelihood fitting of that model, then unlearning will

Â automatically come out and fit into the model.

Â And what's more, you'll know exactly how much unlearning to do.

Â 8:07

Before we get to that, I want to talk a little bit about ways that physicists

Â discovered for increasing the capacity of the Hopfield net.

Â As I said, this was a big obsession with the field.

Â I think it's because physicists really love the idea that math they already know

Â might explain how the brain works. That means, post doctoral fellows in

Â physics who can't get a job in physics might be able to get a job in

Â neuroscience. So, there are a very large number of

Â papers published in physics journals about Hopfield and their storage capacity.

Â Eventually, a very smart student called, Elizabeth Gardner, figured out that

Â there's actually a much better storage rule if you were concerned about capacity.

Â And that it would use the full capacity of the weights.

Â And I think this storage rule will be familiar to you.

Â 8:56

Instead of trying to store vectors in one go, what we're going to do is we're going

Â to cycle through the training set many times. So, we lose our nice online

Â property that you only have to go through the data once. But in return, we're going

Â to gain, more efficient storage. What we going to do is we going to use the

Â perceptual convergent procedure to train each unit to have the correct state given

Â the states of all the other units in that global vector that we want to store.

Â So, you take your net, you put it into the memory state you want to store, and then

Â you take each unit separately and say, would this unit adopt the state I want for

Â it, given the states of all the other units?

Â If it would, you leave its incoming weights alone.

Â If it wouldn't, you change its incoming weights in the weights specified by

Â convergence procedures. And notice, these would be integer changes to the weights.

Â 9:55

You may have to do this several times, and of course, if you give it to many

Â memories, this won't converge. You only get convergence with a perceptron

Â convergence procedure if there is a set of weights that will solve the problem.

Â But assuming there is, this is much more efficient way to store memories in a

Â Hopfield net. This technique is also being developed in

Â another field, statistics. And statisticians call the technique

Â pseudo-likelihood. The idea is to get one thing right given

Â all the other things. And so, with high dimensional data, if you

Â want to build a model of it, the idea is you build a model that tries to get the

Â value on one dimension right given the values on all the other dimensions.

Â The main difference between the perceptron convergence procedure that's normally used

Â and pseudo-likelihood is that, in the Hopfield net, the weights are symmetric.

Â So, we have to get two sets of gradients for each weight and average them.

Â But, apart from that, the way to use the full capacity of a Hopfield net is to use

Â the perceptron convergence procedure and to go through the data several times.

Â