Downing, L. L. (1994). Criterion shaped behaviour: Pitfalls
of performance appraisal.
International Journal of Selection and Assessment, 2,
1-21.
PART 2
Leslie L. Downing
State University of New York
College at Oneonta
The Theory of Criterion Shaped Behavior (CSB) herein developed is based upon well established principles from the field of psychology. It relies extensively upon the traditional areas of learning and of measurement, and it also utilizes many of the findings in social psychology, as they relate to biases in person perception, and in cognitive psychology, especially concerning issues in information processing. The intended function of the theory is to make explicit the effects on behaviors, of those being evaluated, that can be anticipated from the imposition of performance evaluation systems. The Theory of Criterion Shaped Behavior is based upon a broader Universal Theory of Performance, to be developed in the next section of this paper. Specifically, it predicts behavioral consequences derived from the Universal Theory, that are likely to result from the imposition of evaluation systems having various features and characteristics.
UNIVERSAL THEORY OF PERFORMANCE
Learning theorists, especially
the Skinnerians (e.g., Skinner, 1938, 1953), have thoroughly developed
the concept of behavior shaping as related to changing the behavioral criteria
upon which contingent reinforcement is based. The Criterion Shaped
Behavior Theory (CSB Theory) developed here simply makes more explicit,
than is usually the case, exactly how these phenomena relate to issues
involving performance evaluations of people in academic and other organizational
settings. Behavior shaping in the field of learning is the means
by which organisms, whether laboratory rats or school teachers, increase
their frequency of engaging in whatever behaviors are systematically followed
by reinforcement. The behavior expected to increase is not a sequence
of motor responses, but is rather a class of behaviors, called operants,
the occurrence of which is systematically followed by contingent reinforcement.
For a rat in an operant conditioning box, for example, the operant response
class may include all behaviors that have the effect of depressing the
lever sufficiently to trigger the switch that releases the food pellet
reinforcer into the food cup.
In other terms, the operant
class is defined by a set of performance criteria, which if satisfied result
in reinforcing consequences. The Skinnerians have frequently demonstrated
that changes in these criteria result in changes in behavior (Skinner,
1953). Specifically, behaviors are increased when they satisfy the
criterion, or set of criteria, upon which reinforcement is based.
Changes in the criterion result in shaping of, or gradual acquisition of,
whatever behaviors satisfy the new criterion, assuming that the contingent
consequences are in fact reinforcing.
Where people's performances
are being evaluated, and where subsequent administration of valued consequences
is made contingent upon some criterion of that performance, it is expected
that the class of behaviors, operants, that result in such consequences
will increase in strength, frequency, or probability of occurrence.
Few will argue with this basic prediction. In fact the entire rationale
for improving performances of teachers, etc., by making them accountable,
and by differentially rewarding individuals based upon performance scores,
is dependent upon the validity of such a prediction.
When psychologists disagree
with each other, it is not usually on the validity of the basic predictions,
but is on which theoretical system best explains and describes the mediating
effects for the performances of people in varied situations. Few
contemporary psychologists accept the Skinnerian view as applied to such
explanation, which considers only directly observable variables and negates
the relevance of internal or hypothetical constructs, such as expectancy
and cognition. Cognitive/expectancy theory (e.g., Tolman, 1932, 1948),
and social learning theory (e.g., Rotter, 1972; and Bandura, 1977a) have
led the way to most current thinking about the mechanisms responsible for
mediating the relationship between responses, contingent reinforcement,
and changing or shaping of criterion relevant to performance. In
this view, prior reinforcement has an effect because it alters the expectancy
that a response, or class or responses, will be followed by contingent-valued
outcomes. It is the change in expectancy that produces a change in
behavior as the individual attempts to maximize some hypothetical outcome
(O), expected value (EV), or subjective expected utility (SEU). An
SEU for a given response is the subjectively experienced expectation that
that response will be followed by a contingent outcome, multiplied by the
subjective value attached to that outcome. Where more than one outcome
of a response is possible, the SEU will be the sum of the separate SEUs
associated with those outcomes. The SEU associated with a given performance
(P), will be referred to as SEUp.
While most performance theories
limit themselves to describing influences surrounding a single performance,
for our purposes it is necessary to describe, also, the role of other available
response alternatives. Each non-performance (NP), will also have
a subjective expected utility associated with it (SEUnp). Unless
otherwise specified, the SEUnp is to be understood as the highest SEU available
for a non-performance alternative response that is incompatible with the
occurrence of P.
Motivation to perform (Mp), or to not perform (Mnp), will be
based upon these SEUs, but will be further conditioned by the expectation
that Effort to Perform (Ep), or to not perform (Enp), will in fact produce
the performance in question. Bandura's (1977b) concept of "self-efficacy"
likewise concerns subjective beliefs about one's ability to convert effort
into performance. Even this Motivation to Perform does not, however,
directly translate into performance, for many factors influence whether
that effort will be successful. Chief among these is Ability, but
others, including environmental factors, may impose additional Constraints.
Many versions of expectancy
theory have been developed, using some or all of the concepts described
above. The version developed here is largely an elaboration and combination
of several of these (cf. Tolman, 1948; Vroom, 1964; Porter and Lawler,
1968; Rotter, 1972; and Bandura, 1977a, 1977b), with the role of alternative
available responses owing a debt to the Thibaut and Kelley (1959) social
exchange theory concept of Comparison Level for Alternatives. The
result is what I will call a Universal Theory of Performance. The
theory is based upon a set of variables, and relationships between variables,
as delineated in Appendix 1. While complete understanding of the
Universal Theory of Performance may be dependent upon examination of these
Appendix 1 definitions and theoretical statements, it is hoped that the
general reader will be capable of understanding the implications of the
theory for Criterion Shaped Behavior from the following discussion.
Basically, the theory states
the Motivation to perform a behavior is a direct result of one's subjectively
held expectation that trying to perform the behavior will likely result
in that behavior, which will itself be followed, with high probability,
by consequences that are highly valued. This "subjective expected
utility" of effort to perform, relative to a comparable "subjective expected
utility" of effort to do something else, determines Motivation to perform.
Actual Performance, however, only follows from Motivation, to the extent
that one possesses an Ability to perform, and to the extent one is not
subject to environmental Constraints that prevent or inhibit performance.
The Universal Theory
of Performance, as applied to situations in which individuals are evaluated
with criterion measures of their performance, and compensated in relation
to the scores that they obtain, can be roughly stated as follows:
Individuals
will be MOTIVATED to increase the rate of those behaviors that they believe
will most reliably
and with least effort increase their scores on Actual Criterion Measures
of performance; to the extent that
highly valued outcomes are expected to be contingent upon such increases;
and to the extent that other,
incompatible behaviors, are not perceived as being more efficient at achieving
desirable outcomes.
Actual PERFORMANCE will be positively related to such motivation to the
extent that ability to perform is
high, and constraints preventing performance are low.
Criterion
Shaped Behavior (CSB), is defined as any change in behavior that results
from efforts to
increase one's score on an Actual Criterion Measure of Performance.
Probability of performance of
such CSBs will be a function of the Motivational, Ability, and Constraint
variables of the Universal
Theory of Performance.
Whether the behaviors intended to increase scores will be desirable or undesirable is dependent upon the specific characteristics of the performance measure. The system we use for clarifying the important characteristics of performance evaluation systems is elaborated in the following section.
MEASUREMENT THEORY AND CRITERION SHAPED BEHAVIOR
The CSB Theory utilizes many
concepts from traditional measurement theories. The concept of the
Ultimate
Criterion (Thorndike, 1949) involves both a perfect, hypothetical,
measure of "ideal" performance, and the set of all factors (i.e., behaviors
and characteristics) assessed by such a measure. In our terms these
are respectively the Ultimate Criterion Measure and the Ultimate Criterion
Factors.
The Actual Criterion Measure is the specific instrument
used to assess performance scores of those being evaluated. Actual
Criterion Factors are all of the behaviors and characteristics of those
being evaluated that can influence scores on the Actual Criterion Measure.
Figure 1 represents these Ultimate and Actual Criterion Measures as two overlapping circles, and identifies the three resulting sectors as Criterion Contamination, Criterion Relevance, and Criterion Deficiency.
Figure 1: Overlap of the Ultimate and the Actual Criterion
Measures
Criterion Relevant Measures and Factors are those Actual Criterion Measures and Factors that overlap Ultimate Criterion Measures and Factors. Criterion Relevance is, essentially, the validity of the Actual Criterion Measure.
Criterion Contamination Measures and Factors refer to that portion of the Actual Criterion Measures and Factors that does not overlap with Ultimate Criterion Measures and Factors. Criterion Contamination one source of invalidity of the Actual Criterion Measure.
Criterion Deficiency Measures and Factors refer to that portion of the Ultimate Criterion Measures and Factors that does not overlap with Actual Criterion Measures and Factors. Criterion Deficiency is one source of invalidity of the Actual Criterion Measure.
For purposes of assessment
validity, some theorists argue that failure of a measure, the Actual Criterion,
to fully capture the Ultimate Criterion, i.e. Criterion Deficiency,
may not be very important (Landy & Trumbo, 1978, p. 138). If
a measure is assumed to be non-reactive, then it is true that measuring
only a part of the Ultimate Criterion is sufficient, as long as that part
is highly correlated with the Ultimate Criterion as a whole. Once
a measure becomes reactive, however, this approach is doomed to failure,
for not only will its use lead to undesirable Deficiency CSBs, to be described
below, but as a result of this effect the correlation of Actual Criterion
Measures scores with The Ultimate Criterion will steadily decrease.
Koretz (1989) has convincingly argued, without the aid of a systematic
theory of Criterion Shaped Behavior, that the previously mentioned NAEP
assessment test will be invalidated by attempts to broaden its use by rewarding
states or school systems whose students achieve the highest scores (U.S.
Department of Education, 1987). The inherent Criterion Deficiencies
in the NAEP test now in use are not much of a problem, for the test is
non-reactive, for it is used only for purposes of assessment. Rewarding
high scorers will make it reactive, and problems of deficiency will result
in undesirable behaviors (e.g., teaching to the text) which will in time
render scores meaningless.
Criterion Contamination
is an acknowledged problem for assessment, for it directly invalidates
a measure by allowing factors other than performance to influence scores.
As we will see, both Contamination and Deficiency are major problems for
Feedback and Criterion Shaped Behavior functions of performance evaluation.
Criterion Shaped Behavior (CSB)
Criterion Shaped Behavior refers to any change in behavior that results from Motivation to increase one's score on an Actual Criterion Measure of Performance. In the Universal Theory of Performance, Motivation to Perform, Mp, and Performance Behavior, Bp, can refer to any behavior one cares to specify. The behavior of major interest in a performance evaluation situation is behavior that increases one's score on the Actual Criterion Measure. In fact, this is a set of three types of behaviors, any one of which may result in such increased scores. Of these three types of CSB, one is desirable, from the point of view of the organization employing the evaluation system, and two are undesirable.
Desirable CSBs
Relevant CSBs. Any changes in behavior resulting from efforts to increase scores on the Actual Criterion Measure are desirable (for the evaluator) to the extent that they increase Ultimate Criterion Factors. The only behaviors that increase both Actual and Ultimate Criterion Factors are those that affect Criterion Relevant Factors. Using the Universal Theory of Performance, we can predict that:
To the extent that a Relevant
CSB is low in effort and is high in both the expectation of contingent
consequences
and in the subjective value
placed upon those consequences, and to the extent that one believes effort
to produce
that behavior will be successful
in doing so, the strength of motivation to enact that CSB will
be high.
To the extent that any incompatible
non-performance is believed to be a more efficient alternative means of
achieving valued outcomes,
performance motivation will be low.
Assuming a high level
of performance motivation, performance behavior will be high
to the extent that ability to
perform is high and constraints
inhibiting performance are low.
It is exactly this variety of CSBs that advocates of evaluation-based merit awards envision. What is typically assumed is that valued outcomes made contingent upon some performance will lead to increases in such performance. In fact, such increases are not expected if effort is too high, if contingent rewards are of insufficient value, if the perceived contingency between effort to perform and performance, or between performance and valued outcomes is too low, if ability to perform is too low, or environmental constraints inhibiting performance are too high. Nor will performance be increased if alternative non-performance behaviors are perceived to be more efficient at achieving valued outcomes.
This cumbersome list of limitations
is presented here to convey an important point. Though increases
in Criterion Relevant Behavior (Relevant CSBs) may occur following imposition
of a performance-based reward system, there will be numerous situations
in which such an increase is not to be expected. If it is correctly assumed,
for example, that increased teaching effectiveness behaviors are believed
to lead to highly valued contingent outcomes (e.g., merit money), then
these Relevant CSBs may increase; but not if the teacher perceives that
the effort required to increase scores on the actual criterion measure
by such relevant behaviors is too high. This may be the case if a
teacher believes that many extra hours would be required, or additional
schooling, or learning a new and possibly intimidating technology, e.g.,
computers. The incentive would not be expected if the perception
of a contingency between the desired performance and the valued outcome
were unclear or lacking in credibility, as may occur for a teacher who
has little trust in the administration; nor would it be effective for a
teacher who has little faith in his or her ability to convert effort into
successful performance, as would be expected for teachers lacking sufficient
internal locus of control, or self-esteem. Even a teacher who is
motivated to exert the effort for improved performance will fail to demonstrate
an increase if necessary abilities, such as intelligence, skills, or previously
acquired knowledge or training are lacking, or if environmental constraints
prevent effort from being converted into desired improvement in performance.
Such constraints as having inadequate facilities or equipment, ill prepared
students, or inadequate administrative support may lead current efforts
to fail, which can lay the groundwork for a loss of motivation for even
attempting improvement in the future.
Perhaps most disturbing
is the fact that incentives will only be expected to work if the expected
increase in valued outcomes for desired improvement in performance is greater
than what is perceived to be available from incompatible alternative behaviors.
If tending bar or waiting tables is believed to be a more efficient means
of increasing one's income, or if coaching Little League is a more efficient
means of achieving a valued sense of accomplishment, or if running for
the city council is an easier way to fulfill one's power or status needs,
these activities will increase at the expense of improved teaching performance.
If we assume that all of
these problems have been solved, and that increases in Relevant CSBs will
indeed occur, it is still important to note that only some of the behaviors
that constitute ideal performance have been measured, those in the Criterion
Relevance sector of Figure 1. Those desirable behaviors that have
not been measured are in the Criterion Deficiency sector, and none of these
will be expected to increase. Behaviors that have not been assessed
by the Actual Criterion Measure, cannot influence scores on that measure
and thus valued outcomes cannot be contingent on them. Therefor, the SEUs
associated with performance, or with efforts to perform such behaviors,
will not be increased by a merit system. Of course this is not a
problem if we have a perfectly valid measure with which all Ultimate Criterion
Factors are fully assessed. The fact is, however, that all Actual
Criterion Measures are deficient to some degree, and matters are made worse
by the fact that they are frequently deficient with respect to major or
even essential factors.
Assume, for example, a system
that is very good at adequately measuring quantities (e.g., How many students
graduated?). Such quantitative factors will very likely get measured,
and will fall in the Criterion Relevant sector. Now assume that it
is very bad at measuring qualities (e.g., How educated had they become?).
Such qualities are likely to not get measured at all, and so these will
very likely fall in the Criterion Deficient sector. What results
is a merit system that is quite effective at increasing the quantity of
graduates, but is less effective at increasing the quality of education.
This point is raised also in discussions of test formats, especially concerning
the widespread use of multiple choice tests (Frederiksen, 1984).
Skills of students that are readily measured by such tests do get measured,
and as a result the system responds by teaching those skills, leading to
increases, or at least prevention of decreases, in such performance over
time. Skills difficult to measure by such tests do not get measured,
hence do not get taught, and consequently performance fails to increase,
and may even decrease.
In summary, those desirable
behaviors that do get measured (Relevant CSs) may increase as a result
of imposing a performance-based reward system, but in order for such an
increase to occur many separate conditions, involving effort, ability,
constraints, and the perceptions of values and contingencies for both performance
and for non-performance behaviors must be satisfied. Even then, no
increase should be expected for those desirable behaviors that are not
measured, those in the Criterion Deficiency section of Figure 1.
While failure of the system to increase desirable behaviors is of major
importance, a potentially more important consideration is the possibility
that performance evaluation will produce an increase in undesirable behaviors.
Undesirable CSBs
Two separate classes of Criterion Shaped Behavior can be defined in relationship to the two non-overlapping segments of Figure 1, both of which are undesirable.
Deficiency CSBs. It was shown above, as a limitation on desirable increases in Criterion Relevant Factors, that important or essential factors may fall in the Criterion Deficiency category and consequently not be subject to such increases. What often occurs, however, is that desirable behaviors that fall in the Criterion Deficient sector not only fail to increase, but will actually decrease. The major reason for this decrease is that other behaviors, those that do increase scores on the Actual Criterion Measure, will increase, leaving less time or energy available for behaviors which fail to increase these scores. In terms of the Universal Theory of Performance, any performance will decrease to the extent that some alternate non-performance is perceived to be more efficient at achieving valued outcomes. Thus any anticipated increase in Relevant CSBs, or in the yet to be described Contamination CSBs, will be expected to result in a decrease in those desirable behaviors falling in the Deficiency sector. For example, if the quantity of performance is assessed (Criterion Relevant Factor), but its quality is not (Criterion Deficient Factor), a likely result is a decrease in quality. Thus Deficiency CSBs are those which decrease desirable behaviors in the Criterion Deficiency category.
This is what is occurring in the examples given by Frederiksen (1984) concerning the decrease in teaching efforts in areas not measured by standardized tests of student achievement. If teachers' merit increases are given for increases in their students' test scores, we expect a deficiency-shaped decrease in teacher efforts, however worthy, that fail to be reflected in those scores. In fact, the greater the incentive associated with teacher performances that do get measured, the greater will be the decrease in those performance that are not measured. Critics of a state-wide performance appraisal system in the Texas schools, without the aid of a systematic theory of why or how undesirable effects might be expected, have argued that good teachers have been prevented from engaging in effective teaching by the need to enhance students' scores on state-wide standardized tests. It was contended that "teaching to the tests" resulted in a decline in many aspects of learning that were not measured by these tests. Whether or not such fears are likely to be valid would depend upon the extent to which the tests used were Criterion Deficient.
Contamination CSBs.
Motivated increases in behaviors
which increase Actual Criterion Measure scores by increasing Criterion
Contamination Factors, called Contamination CSBs, can have multiple
undesirable effects. Contamination CSB's increase scores on the Actual
Criterion Measure without increasing the types of behaviors that increase
Relevant Factors. These are undesirable for two different types of
reasons.
First of all, because they
take time away from behaviors that would produce desirable effects in both
the Criterion Relevant and the Criterion Deficient categories. One
example of Contamination is rater bias. To the extent that
the Actual Criterion Measure is influenced by a bias of an evaluator or
rater, the validity of the performance measure is reduced. In this
case, invalidity results from scores reflecting not only desired performance
behaviors, but also how much the rater likes or is otherwise biased for
or against the person being rated. Contamination shaped behavior
involves the active manipulation of the score one receives by the performance
of behaviors not included in the Ultimate Criterion. In this
case, behaviors that exploit the potential for such bias are likely to
increase. It is the motivated increase in such behaviors that I call
Contamination
CSBs. If the Actual Criterion Measure contains a Contaminating
rater bias factor, and a Relevant performance quantity factor, and is Deficient
in assessing the performance quality factor, then "buttering up the boss"
to enhance one's score through Criterion Contaminating rater bias, could
lead to a reduction in both quality, and quantity. While both "buttering
up the boss," and increasing performance quantity would increase one's
score on the Actual Criterion Measure, since both require time, any time
spent on one takes away from time available to spend on the other.
The choice of which response is to be made will depend upon their perceived
relative efficiency at achieving valued outcomes. If "buttering up
the boss" is more efficient than increasing or maintaining performance
quantity, then an actual reduction in quantity might occur. Thus,
Contamination
CSBs can result in decreases in all types of Ultimate Criterion Factors,
both those in the Criterion Relevance category and those in the Criterion
Deficiency category.
Secondly, Contamination
CSBs may have other undesirable features, beyond their effects of reducing
desirable behavior. Cheating on a test, for example, a type
of Contamination CSB, may not be incompatible with desired performance,
thus may increase one's score on the Actual Criterion Measure without reducing
either Criterion Relevant or Criterion Deficiency categories of behavior.
It might be argued that such an effect is not detrimental to performance,
but simply fails to facilitate performance. One might be concerned,
however, that cheating has additional negative impact upon the system,
by creating distrust, cynicism, and feelings of inequity, especially in
those who refrained from cheating. Our point is that such an appraisal
system is not only without merit, for it fails to increase relevant behavior,
but that it may have negative side-effects which make the consequences
using it worse than of having done no appraisal at all.
The CSB Theory clearly delineates
three categories of behavior change that may result from imposition of
an appraisal system. Of increases in Criterion Contamination
Factors, Contamination CSBs, and Criterion Relevant Factors, Relevant
CSBs, and decreases in Criterion Deficiency Factors, Deficiency
CSBs, only increases in Relevant CSBs are desirable. Ultimate
Criterion Factors that are not measured are expected to decrease, and Actual
Criterion Factors that don't reflect Ultimate Criterion Factors are expected
to increase. We can even expect a decrease in those Ultimate Criterion
Factors that are assessed (i.e., Criterion Relevant Factors) to the extent
that increasing scores through Contamination CSBs is perceived to be more
efficient.
The CSB Theory can
be used to help design evaluation systems that maximize the occurrence
of desirable and minimize the occurrence of undesirable behavior change.
It can sensitize evaluators to the potential for negative as well as positive
consequences of imposing evaluation systems, and it can possibly promote
a more enlightened debate about when, where, and in what way, we should
impose evaluation systems upon individuals or institutions in educational
as well as in other organizations.
Others have at times addressed
some of the issues developed here (c.f. Popham, 1983; Frederiksen, 1984;
Koretz, 1989) as they relate to systems of evaluation in the field of education.
Some of what CSB Theory predicts has been pointed out less systematically
in Kerr's (1975) discussion on, "The folly of rewarding A while hoping
for B." None of these, however, employs a theory, model, or framework
with respect to which the numerous pitfalls of evaluation systems can be
anticipated, or with the help of which better systems might be devised.
Hopefully CSB Theory as developed here will provide the necessary guidance
for designing more adequate systems for performance evaluation for the
future.