A network is made ready for training with the resetNet command or by pressing the "Reset Network" button on the Main Window. This will randomize all non-frozen weights, clear weight update direction information, reset unit outputs to the initOutput and other unit fields to appropriate values, and set the network's totalUpdates, error, and several other values to zero.
Training is performed primarily using the train command. Most of the training
parameters must be set before-hand either using the entries in the Main
Window and Object Viewer or with setObject. The parameters to
train
can optionally set the number of weight updates, the
report interval, and the algorithm. If any of these values are left
out, the previous value is reused. Training can also be done using the
algorithm-specific commands steepest, momentum, dougsMomentum, and deltaBarDelta. These basically
just call the generic train
command with a flag for the
particular algorithm. dougsMomentum is the default algorithm and
the default learning rate is set accordingly. Smaller learning rates
may be needed with other algorithms.
Training consists of repeatedly performing forward and backward passes on a batch of examples, accumulating link derivatives, and then updating the link weights based on the derivatives and the training algorithm. The batchSize is the number of examples processed between weight updates. A batchSize of 0 indicates full batches, which will have the same number of examples as the training set, however large that may be. The reportInterval determines the number of weight updates that occur between progress reports. This should be kept large enough that reports don't occur annoyingly often.
Within a batch, examples are chosen based on the example set's selection mode. This can be ORDERED, RANDOMIZED, PERMUTED, PROBABILISTIC, PIPE, or CUSTOM. The selection mode is set when the example set is loaded and can be changed with the exampleSetMode command. ORDERED mode, which is the default, runs through examples in the order in which they were loaded from the example file. RANDOMIZED mode selects examples at random with equal probabilities and with replacement. PERMUTED mode selects examples at random with equal probabilities but without replacement. Thus, each example will be presented once before any are repeated.
PROBABILISTIC mode selects examples based on their given frequency. Frequency values specified in the example file will be normalized over all examples and this distribution used for selection. If example sets are concatenated, the distribution will be recalculated based on the specified frequencies. The default frequency for an example is 1.0. If the network's pseudoExampleFreq flag is on, pseudo example-frequencies will be used. This does not affect the rate at which examples are presented but will scale the error (and output unit error derivatives) on each example by the its frequency.
PIPE mode is used for example sets that do not have the examples stored but read each example from an open pipe as it is needed. This is useful if you want randomly generated examples or examples generated based on the performance of the network or other factors. It is also useful if the example set is too large to load completely into memory. The example set has a flag pipeLoop that determines what happens when the pipe is exhausted. If pipeLoop is TRUE, the default, it will be reopened when the end is reached. Otherwise, the pipe will close and attempting to use the example set will cause an error. An example of reading and writing to an example-generating pipe is given in the description of the XOR network.
CUSTOM mode allows you to write a procedure that generates the index of the next example. When it's time to choose the next example, the example set's chooseExample procedure will be called. This should return an integer between 0 and one less than the number of examples, inclusive.
If the reportInterval is 0, no progress reports will be produced. Otherwise, progress reports are generated following the first and last weight updates of training and after each reportInterval. A series of progress report looks like the following:
__Update____Error___UnitCost__Wgt.Cost__Grad.Lin__TimeUsed__TimeLeft__ 1) 339.271 9845.37 63.6118 - 0s 2m 27s 40) 222.877 7869.83 67.6063 0.91978 14s 2m 9s 80) 218.415 669.471 255.497 0.13872 27s 1m 53s 120) 182.575 1275.39 366.961 0.10128 41s 1m 38s 160) 155.519 1207.68 511.822 0.08328 55s 1m 24s 200) 127.880 1344.23 699.051 0.08551 1m 10s 1m 10s 240) 105.500 1339.00 889.689 0.00279 1m 24s 57s 280) 93.6190 1294.91 1020.21 -0.12509 1m 39s 43s 320) 87.4166 1251.86 1093.53 -0.22272 1m 54s 29s 360) 83.6865 1230.53 1138.31 -0.33909 2m 10s 15s 400) 81.8992 1228.21 1152.46 -0.46680 2m 25s 0s
The first column is the total number of weight updates that have been performed on the network across all training runs. It is taken from the network's numUpdates field. The second is the overall network error on the last batch of training, not accumulated across all batches since the last report.
The third column is the unitCost, which is summed over all groups that have a unit output cost function and is a measure of how binary the units are.
The fourth column is the weightCost which is the sum of the squared weight values. It will almost invariably increase unless you have some weightDecay going. The weight cost and gradient linearity are only calculated on updates for which they will be reported.
The fifth column is the Gradient Linearity. This is a measure of how consistent the gradient descent trajectory is and is equal to the negative of the cosine of the angle between the most recent link weight error derivative vector and the link weight delta vector (the amount each weight changed) from the weight update preceding that. A value of 1.0 indicates that the network is on a steady course, but the learning rate may be a bit slow. A negative value means you may be popping around randomly and might need to lower the learning rate. On the other hand, it may mean that you are just sitting in a local minimum and that raising the learning rate might jump you out and send the Gradient Linearity up towards 1.0 again.
The TimeUsed is relatively self-explanatory. The TimeLeft is an estimate of the training time remaining. It is based on the time-averaged rate at which training is progressing. If the load on the machine changes or you slow the simulator down by starting a graph or something, the time estimate will adapt over a few reports. The first couple of estimates are likely to be inaccurate, particularly on a shared machine.
Training continues until one of four things happens: numUpdates weight updates are performed, the overall accumulated error on a batch of examples is less than or equal to the network's criterion value, the group criterion is reached on a batch, or the user stops training by pressing "Stop Training" or by causing an interrupt signal.
The group criterion is met when the groupCriterionReached() is true for every group at the end of every event in every example in the batch. Each group has its own groupCriterionReached() function. The standard group criterion function is true during training if the output of every unit is within its group's trainGroupCrit of the target (or within the testGroupCrit during testing). If the group's trainGroupCrit is NaN, the network's value will be used. Clearly, this standard function is only relevant for output groups. It will be the default for output groups of continuous networks. Other networks have no groupCriterionReached() by default. If no groups have a criterion function, the group criterion will never be reached for the network.
The following network-level parameters are used in training. These and other parameters may be changed in the middle of training.
The process of accumulating error derivatives over a batch of examples is the same for all training algorithms. The algorithm only affects the updating of weights given their error derivatives. The weight update equations for each link, L, using steepest descent are as follows:
L->lastWeightDelta = -learningRate * L->deriv; if (weightDecay > 0.0) L->lastWeightDelta -= weightDecay * L->weight; L->weight += L->lastWeightDelta;
lastWeightDelta
, as used here, actually means the current
weight delta. Under momentum descent the equations are:
L->lastWeightDelta = -learningRate * L->deriv + momentum * L->lastWeightDelta; if (weightDecay > 0.0) L->lastWeightDelta -= weightDecay * L->weight; L->weight += L->lastWeightDelta;
dougsMomentum is exactly like momentum with the exception that the pre-momentum weight step vector is bounded so that its length cannot exceed 1.0. After the momentum is added, the length of the resulting weight change vector can grow as high as 1 / momentum. This change allows stable behavior with much higher initial learning rates, resulting in less need to adjust the learning rate as training progresses. It is usually safe to start training with high momentum in dougsMomentum, but not in standard momentum descent.
deltaBarDelta maintains individual learning rates for each
weight. These increase when consecutive weight changes are heading in
the same direction and decrease otherwise. This tends to be very good
at simple problems but not quite as good at complex ones where the
weight changes often switch direction, causing the learning rates to
become very small. The link learning rate is stored in the
lastValue
field of the link:
if (sameSign(L->deriv, L->lastWeightDelta)) L->lastValue *= rateDecrement; else L->lastValue += rateIncrement; L->lastWeightDelta = -L->lastValue * L->deriv + momentum * L->lastWeightDelta; if (weightDecay > 0.0) L->lastWeightDelta -= weightDecay * L->weight; L->weight += L->lastWeightDelta;
The test command is used to test the network. It will reset the testing set and then run a forward pass for each example in the set, accumulating the overall network error. If there is no testing set, the training set will be used. The order of the examples is determined by the example set mode of the testing set. If the mode is ORDERED or PERMUTED, each example will be presented exactly once. Otherwise, as many examples will be presented as there are examples in the set, but they will be chosen randomly.
The commands openNetOutputFile and closeNetOutputFile control the writing of examples and targets to a file during training and testing. This can be used to record the network's performance for analysis by an external program.
The network contains a number of procedure hooks that can be used to customize the behavior of the network without writing any C code. Each is a user-defined procedure that is executed at a particular time during training or testing of the network. They are as follows:
Although the names are pretty self-explanatory, these are described a bit more in the structures section.
If you wanted to network to print out the error after each example, you could do:
lens> setObj postExampleProc {puts [getObj error]}
You can also do special things by specifying example or event procs in the example set. Whether you use the network procs or the example set procs depends on whether the proc is more logically associated with the environment or the network.