Faison P. Gibson
Graduate School of Industrial Administration
Carnegie Mellon University
Pittsburgh, PA 15213-3890
gibson+@cmu.edu
David C. Plaut
Department of Psychology
Carnegie Mellon University, and
Center for the Neural Basis of Cognition
Pittsburgh, PA 15213-3890
plaut@cmu.edu
In Proceedings of the 17th Annual Conference of the Cognitive Science Society, pages 512-517. Hillsdale, NJ: Lawrence Erlbaum Associates.
Casting dynamic decision making in terms of control theory allows for the transfer of insights from other related domains (Hogarth, 1986). In motor learning, Jordan and Rumelhart (1992; Jordan, 1992, in press) address issues very similar to those addressed by Brehmer. The key to applying their approach to dynamic decision making is to divide the learning problem into two interdependent subproblems: (1) learning how actions affect the environment, and (2) learning what actions to take to achieve specific goals, given an understanding of (1). These two subproblems are solved simultaneously by two connectionist networks joined in series (see Figure 1).
The task of the action model is to take as input the current state of the environment and the specific goal to achieve, and to generate as output an action that achieves that goal. This action then leads to an outcome which can be compared with the goal to guide behavior. Unfortunately, when the outcome fails to match the goal---as it generally will until learning is complete---the environment does not provide direct feedback on how to adjust the action so as to improve the corresponding outcome's match to the goal.
Such feedback can, however, be derived from an internal model of the environment, in the form of a forward model. This network takes as input the current state of the environment and an action, and generates as output a predicted outcome. This predicted outcome can be compared with the actual outcome to derive an error signal. A gradient-descent procedure, such as back-propagation (Rumelhart, Hinton, & Williams, 1986), can then be used to adjust the parameters (i.e., connection weights) of the forward model to improve its ability to predict the effects of actions on the environment. Notice that learning in the forward model is dependent on the behavior of the action model because it can learn environmental outcomes only over the range of actions actually produced by the action model.
To the extent that the behavior of the forward model approximates that of the environment, it can provide the action model with feedback for learning in the following way. The actual outcome produced by the action is compared with the goal to derive a second error signal. Back-propagation can again be applied to the forward model (without changing its own parameters) to determine how changing the action would change the error. This information corresponds to the error signal that the action model requires to determine how to adjust its parameters so as to reduce the discrepancy between the goal and the actual outcome produced by its action.
Jordan and Rumelhart's (1992) framework provides an explicit formulation of the points left unclear in Brehmer's (1990a, 1992) original assertion that the environment model plays a central role in learning in dynamic decision-making tasks. In Jordan and Rumelhart's formulation, an internal or forward model of environment is formed and revised on the basis of goal-directed interaction with the environment. Furthermore, the importance of this forward model resides in its role of interpreting outcome feedback as the decision maker attempts to learn what actions to take in order to achieve given goals in an evolving context.
Stanley et al. (1989) report on the performance of eleven subjects trained on this task in three sessions taking place over three weeks. Each session was divided into twenty sets of 10 trials or time steps during which the subjects attempted to reach and maintain a goal level of 6 thousand tons of sugar production. At the start of each set of trials, initial workforce was always set at 9 hundred and initial production was allowed to vary randomly between 1 and 12 thousand. Subjects were told to try to reach the goal production exactly. However, due to the random element in the underlying system, Stanley et al. scored subject performance as correct if it ranged within +/-1 thousand tons of the goal. In addition, at the end of each set of 10 trials, subjects attempted to write down a set of instructions for yoked naive subjects to follow. The relative initial success of these yoked subjects compared with that of purely naive subjects was taken as a measure of the degree of explicit knowledge developed by the original subjects. The instruction writing also had a direct beneficial impact on the performance of the original subjects.
The Sugar Production Factory task contains all of the elements of more general dynamic decision-making environments, with the exception of time pressure. In this regard, Brehmer (1992) has observed that, although removing time pressure may lead to improved performance, the relative effects of other factors on performance are the same. Furthermore, although the task appears fairly simple, it exhibits complex behaviors that are challenging to subjects (Berry & Broadbent, 1984, 1988; Stanley et al., 1989). In particular, due to the lag term P(t), two separate, interdependent inputs are required at times t and t+1 to reach steady-state production. In addition, also due to the lag term, maintaining steady-state workforce at non-equilibrium values leads to oscillations in performance. Finally, the random element allows the system to change autonomously, forcing subjects to exercise adaptive control. The random element also bounds the expected percentage of trials at goal performance to between 11% (for randomly selected workforce values; Berry & Broadbent, 1984) and 83% (for a perfect model of the system; Stanley et al., 1989).
As described earlier, the network used two different error signals to train the forward and action models. The predicted outcome generated by the forward model was subtracted from the actual (scaled) production value generated by Equation 1 to produce the error signal for the forward model. The error signal for the action model was generated by subtracting the actual production generated by Equation 1 from the goal level and multiplying the difference by the scale factor.
One training trial with the model occurred as follows. The initial input values, including the goal, were placed on the input units. These then fed forward through the action model hidden layer. A single action unit took a linear weighted sum of the action hidden unit activations, and this sum served as the model's indication of the workforce for the next time period. This workforce value was used in two ways. First, conforming to the bounds stipulated in Stanley et al.'s original experiment, the value was used to determine the next period's production using Equation 1. Second, the unmodified workforce value served as input into the forward model, along with all of the inputs to the action model except the goal. These inputs fed through the forward hidden layer. A single predicted outcome unit computed a linear weighted sum of the forward hidden unit activations, and this sum served as the model's prediction of production for the next period. It is important to note that the forward and action models were trained simultaneously.
The model was trained under two conditions corresponding to different assumptions about the prior knowledge and expectations that subjects bring to the task. In the first condition, corresponding to no knowledge or expectations, the connection weights of both the forward and action models were set to random initial values sampled uniformly between +/-0.5. However, using the same task but a different training regimen, Berry and Broadbent (1984) observed that naive human subjects appear to adopt an initial ``direct'' strategy of moving workforce in the same direction that they want to move production. To approximate this strategy, in the second training condition, models were pretrained for two sets of ten trials on a system in which production was commensurate to size of workforce without lagged or random error terms.
For both initial conditions, the regimen of training on the Sugar Production Factory task exactly mimicked that of Stanley et al. (1989) for human subjects, as described above, except that no attempt was made to model instruction writing for yoked subjects. In the course of training, back-propagation (Rumelhart et al., 1986) was applied and the weights of both the forward and action models were updated after each trial (with a learning rate of 0.1 and no momentum). To get an accurate estimate of the abilities of the network, 200 instances (with different initial random weights prior to any pretraining) were trained in each experiment.
By contrast, the pretrained models perform equivalently to human subjects in the first training session, and actually learn somewhat more quickly than do subjects over the subsequent two sessions. This advantage may be due to the fact that the model is not subject to forgetting during an intervening week between each training session. The findings of the current modeling work suggest that the prior knowledge and expectations that subjects bring to the task are critical in accounting for their ability to learn the task as effectively as they do. Accordingly, the remainder of the paper presents data only from models with pretraining.
Why should pretraining, particularly on a system that differs in important aspects from the Sugar Production Factory, improve performance in learning to perform in the task? Pretraining provides the model with a coherent set of initial parameter estimates describing system performance. Although the initial model parameters do not describe the true system well, the model is systematic in applying them in attempting to control the system. By contrast, models with no pretraining do not have the benefit of a coherent (albeit incorrect) set of parameter estimates describing system performance when starting the Sugar Production Factory task. Thus, their initial attempts to control the system do not show the same systematicity and their learning does not have the advantage of adjusting an already coherent set of parameters.
Although there is substantial variability over the course of training, the subject appears to show a breakpoint around training set 30, when the improvement in performance is much more dramatic than at any prior or subsequent time. There is no apparent breakpoint for the model (and other models are broadly similar). One possibility is that the subject (but not the model) acquired an explicit insight into the behavior of the underlying system at the time of the breakpoint. To test this possibility, Stanley et al. (1989) analyzed the performance of the original subjects for breakpoints. They hypothesized that instructions that these subjects wrote for their naive yoked partners immediately after these breakpoints would have a significant positive impact on the naive yoked partners' performance, thereby indicating a link between explicit understanding of the system and performance. However, this hypothesis was not confirmed; instructions written just after breakpoints were no more effective than those written just prior to breakpoints in guiding yoked subjects. Thus, it appears that the breakpoints do not represent a measurable increase in subjects' verbalizeable knowledge about controlling the task. Furthermore, not all subjects exhibited clear breakpoints in learning. Nonetheless, the contrast between subject and model performance suggests that human learning may be more subject to rapid transitions than model learning (but see McClelland, 1994; McClelland & Jenkins, 1990, for examples of staged learning in connectionist networks).
The initial over- and under-correction is a hallmark of the model's systematic application of its pretrained conceptualization of the system. Attempting to bring about a change in production by a commensurate change in the workforce has the effect of increasing oscillation in production at non-equilibrium values. As training progresses, the model slowly revises its internal model of the system, as represented in its parameter estimates. By training set 60, the model has overcome its tendency to over- and under-correct.
As can be seen in Figure 6, model performance in the deviations condition starts out slightly better than base in the first session and slowly diverges over the next two sessions until it is almost a full point below in the third session. The reason for the divergence in performance appears to be as follows. The size of the error term relative to the action the model is trying to modify is larger in the deviations condition than in the base condition. At the beginning of learning, models in both conditions are trying to produce relatively large modifications in workforce (i.e., size of error term is large for both conditions), so the difference in conditions is not apparent. However, later in learning, the modifications that both models are trying to produce in the workforce levels they are learning to set become finer. It is here that the difference in size of the error term relative to the action to be modified becomes significant and affects learning performance.
Similar effects of feedback magnitude have been found in human learning. In a repeated prediction task, Hogarth, McKenzie, Gibbs, and Marquis (1991) found that subject performance was influenced by the absolute scale of the feedback they received. In particular, subjects receiving feedback with low-magnitude variance tended to undercorrect, whereas those receiving feedback with high-magnitude variance tended to overcorrect.
This model's approach may be contrasted with alternatives that rely on explicit hypothesis testing or sequences of training trials to initiate learning. Explicit hypothesis testing would imply that improved verbal knowledge of the task would co-occur with improved performance. However, the results of Stanley et al. (1989) indicate that improved verbal knowledge occurs well after improved performance.
Two sets of authors present theories that require sequences of attempts at controlling the system to initiate learning. First, Mitchell and Thrun (1993) present a learner implemented as a neural network that attempts to pick the best action based on its existing model of the environment. This model is updated based on its assessed accuracy in predicting the outcome of a sequence of trials once that sequence has occurred. Second, Stanley et al. (1989) conjecture that performance in the Sugar Production Factory depends on the learner's ability to make analogies between the current situation and prior (successful) sequences of examples. Thus, in this scheme, knowledge can be said to increase every time a successful sequence is encountered and retained. The model proposed here differs fundamentally from these two approaches in that it is able to use information from both successful and unsuccessful single control trials to alter its parameters (connection weights) to reduce the error in its performance. In particular, this property of the model is critical in producing a relatively rapid decrease in production oscillations as training progresses. If implemented to perform the Sugar Production Factory task, it seems unlikely that either Mitchell and Thrun's or Stanley et al.'s approach would produce similarly rapid decreases in oscillations.
Clearly, the model presented here has several limitations. It does not account for meta-strategies such as planning how to learn in the task. It also does not account for how verbalizeable knowledge is acquired during learning. Finally, it does not account for how relevant information presented across multiple time steps might be integrated while learning to perform in dynamic decision-making tasks. Empirical validation of the predictions made so far and this last limitation are the focus of ongoing research. Even with its limitation, the model constitutes one of the first explicit computational formulations of how subjects develop and use an internal model of the environment in learning to perform dynamic decision-making tasks.
Berry, D. C., & Broadbent, D. E. (1984).
On the relationship between task performance and associated
verbalizable knowledge. Quarterly Journal of Experimental
Psychology, 36A,
209-231.
Berry, D. C., & Broadbent,
D. E. (1988). Interactive tasks and the implicit-explicit distinction.
British Journal of Psychology, 79, 251-272.
Brehmer, B. (1990a). Strategies in
real-time, dynamic decision making. In R. Hogarth (Ed.), Insights from
decision making. Chicago:
University of Chicago Press.
Brehmer, B. (1990b). Variable errors
set a limit to adaptation. Ergonomics, 33, 1231-1239.
Brehmer, B. (1992). Dynamic decision
making: Human control of complex systems. Acta Psychologica,
81, 211-241.
Brehmer, B. (in press). Feedback
delays in complex dynamic decision tasks. In P. Frensch, & J. Funke (Eds.),
Complex problem solving:
The European perspective. Hillsdale, NJ: Lawrence Erlbaum Associates.
Edwards, W. (1962). Dynamic decision theory
and probabilistic information processing. Human Factors,
4, 59-73.
Hogarth, R. M. (1981).
Beyond discrete biases: Functional and dysfunctional aspects of
judgmental heuristics. Psychological Bulletin, 90,
197-217.
Hogarth, R. M. (1986). Generalization in
decision research: The role of formal models. IEEE Transactions on
Systems, Man, and Cybernetics, 16,
439-449.
Hogarth, R. M., McKenzie, R. M., Gibbs, B. J.,
& Marquis, M. A. (1991). Learning from feedback: Exactingness and
incentives. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 17(4), 734-752.
Jordan, M. I. (1992). Constrained supervised
learning. Journal of Mathematical Psychology, 36,
396-425.
Jordan, M. I. (in press). Computational
aspects of motor control and motor learning. In H. Heuer, & S. Keele (Eds.),
Handbook of perception and
action: Motor skills. New York: Academic Press.
Jordan, M. I., & Rumelhart,
D. E. (1992). Forward models: Supervised learning with a distal
teacher. Cognitive Science, 16(3), 307-354.
McClelland, J. L. (1994).
The interaction of nature and nurture in development: A parallel
distributed processing perspective. In P. Bertelson, P. Eelen, &
G. d'Ydewalle (Eds.), International perspectives on psychological science,
Volume 1: Leading
themes (pp. 57-88). Hillsdale, NJ: Lawrence Erlbaum Associates.
McClelland, J. L., & Jenkins, E. (1990).
Nature, nurture, and connections: Implications of connectionist
models for cognitive development. In K. VanLehn (Ed.), Architectures for
intelligence (pp.
41-73). Hillsdale, NJ: Lawrence Erlbaum Associates.
Mitchell, T. M., & Thrun,
S. B. (1993). Explanation-based neural network learning for robot
control. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in
neural information processing systems 5 (pp. 287-294). San Mateo, CA:
Morgan Kaufmann.