Evaluation: Principles and Pitfalls

October 1, 1973 Paul R. Lehman

font size decrease font size increase font size

Do students learn more under a pass/fail grading system, freed from the pressure of traditional grades? What does a grade of A really mean? Can essay tests be graded as reliably as objective tests? Is it fair to include optional questions on essay tests? Not one of these questions can be answered simply or definitively, yet they are representative of the many and varied questions that have confronted college and university faculty members in recent years. It is the purpose of this article to raise a few questions such as these, comment on related research or opinion, and perhaps point out a few principles which can help the individual instructor arrive at evaluation procedures to meet his needs and the needs of his students.

One of the most remarkable developments in higher education in recent years has been the widespread acceptance of pass/fail grading. According to a 1971 study, 61% of 1,301 member institutions of the American Association of Collegiate Registrars and Admissions Officers were using some form of pass/fail or credit/no-credit grading in at least some courses, and 2% were using pass/fail exclusively.¹ The advantages claimed for pass/fail are that the student is relatively free of anxiety, that he is able to explore the subject matter in his own way, and that he is able to investigate unfamiliar academic fields without risking his grade-point average (GPA). The disadvantages are that the student may do less work and that the system fails to distinguish between varying levels of achievement, except for those who fail.

Although the research evidence is still meager, there are some indications that the advantages of pass/fail have been overestimated. In one study only 9% of the students electing pass/fail courses did so in order to explore new academic fields; 87% reported that their reason was to devote more study time to other courses or to avoid the risk of a low grade in a difficult course. Lower motivation was reported by 32% as compared with graded courses.² Further, there is evidence that transfer students and graduate school applicants with large numbers of pass/fail courses on their records may face difficulties or delays. When some but not all of the grades of a prospective transfer student are non-traditional, 36% of the institutions surveyed accept such credit without question, while 31% request further information concerning the quality of the applicant's work, 9% place a limit on such credits, and 22% have not yet developed policies.³

The fundamental difficulty is that despite all their weaknesses grades do make it possible to make distinctions among individuals based on their relative achievement. Since these distinctions are for the most part obliterated under the sweeping designation of "pass," this mark simply fails to convey information and thus does not satisfactorily serve the function that a grade should serve. Further, many professors regard it as unfair to the student who would otherwise achieve a high grade to group him in the same category with the student who would otherwise achieve a low but passing grade.

It has been claimed that students would learn best of all in the absence of any evaluation whatever. This claim is inherently incapable of validation but is unlikely to be taken seriously.

Still, there is some merit in the rationale for pass/fail grading, and there are no doubt occasions when it may be quite sufficient. The danger is that at some future time when the individual is competing with his peers, as, for example, in applying to a graduate school, it may be found that pass/fail grades are of little help and the decision may be based partially on criteria even less valid than traditional grades. Of the 530 graduate and professional schools responding, 26% reported that admission is jeopardized or delayed if the applicant's undergraduate transcript contains a substantial number of non-traditional grades. Another 21% reported that admission is not jeopardized or delayed, 16% reported that the policy varies among departments, and 37% claimed that no policy had yet been established.⁴ If the student is in a terminal program or is unlikely to need meaningful grades to assist him in subsequent selection processes or to document his achievement, pass/fail may be adequate. But under these conditions the question itself may be superfluous.

It should be pointed out that although grades or class rank tend to be helpful in predicting future academic success, there are other types of objective and subjective evidence which also should be considered. Since each human being is a complex and unique combination of skills and abilities, strengths and weaknesses, critical decisions affecting the individual should be based on all available information and not on grades alone.

The case of the instructor who gives predominantly high grades or predominantly low grades is more complex. Some faculty give low grades as evidence of their high standards or of the academic rigor of their courses, though such results could just as well be interpreted as evidence of poor teaching. Others give high grades as a protest against the allegedly unfair, non-egalitarian, or inhumane practice of grading itself. Studies at numerous institutions have revealed that GPA's have risen in recent years. The president of one prestigious institution, noting that freshman GPA's rose from 2.49 to 3.01 in four years on his campus, asked a faculty committee to study grading practices.⁵ In some cases the general intellectual excellence of the students, as reflected by SAT or similar scores, has risen also, though not always to the same extent. In other cases the quality of the students has remained the same or actually fallen while the GPA increased.

These trends have been documented in national studies. Between 1960 and 1969 the mean undergraduate GPA at 100 institutions rose from 2.4 to 2.56 on a 4.0 scale.⁶ At one college it was reported that 40% of the grades awarded were A's while only 3% were D's or F'S.⁷ The effect of such practices is to lump together all of the above-average students in the highest grade categories and, as in the case of pass/fail, not distinguish among them. If the mean GPA exceeds 3.0, which is not uncommon, even though the institution may use pluses and minuses it has no more than four categories (A+, A, A-, B+) with which to make fine distinctions among above-average students while it has nine categories (B, B-, C+, C, C-, D+, D, D-, F) with which to make fine distinctions among below-average students. The contention that C represents not the average at the institution or in the class but rather the national average for similar students or for the general population at that age cannot be documented and must be regarded as a convenient fiction, particularly when no attempt is made to gather baseline data. Knowledgeable persons are well aware of the different standards represented by the same grade at different institutions and interpret transcripts accordingly.

On the other hand, there are certain skills and knowledge in which it may be unnecessary to distinguish levels of competence once the student has achieved a critical level of competence. Some of the skills of sightsinging and ear training may fall into this category. Measurement based on a single, pre-established criterion is known as criterion-referenced measurement; it is contrasted with norm-referenced measurement, in which the standards are set by the performance of the members of the group. The justifiably high distribution associated with criterion-referencing is often found when an instructor adopts a contract system of grading, in which the student knows in advance precisely what is required for each grade and he can choose the grade he wishes to earn.

The purpose of a university is to facilitate learning, and the university is successful to the extent that it helps as many students as possible learn as much as possible. In this sense, its objective should be to help every student earn an A in every course, and in this sense a disproportionate number of high grades is not undesirable. However, even in criterion-referenced grading the criteria may be established somewhat arbitrarily, and the fact remains that a preponderance of high grades masks the distinctions that grades are intended to reveal.

Many music professors believe that the content of their courses is better suited to evaluation by essay tests than by objective tests. This belief is often well-founded, particularly in courses where the ability to recall facts is less important than the ability to analyze, synthesize, and apply knowledge. However, the difficulties of writing good essay items and scoring them reliably are formidable. Studies have consistently shown significant discrepancies between the grades assigned the same papers by different instructors and between the grades assigned the same papers by the same instructors at different times.⁸

What can an instructor do to improve the reliability⁹ of his essay grading? He should do at least four things:

1. He should word each question so as to define the task as clearly as possible. The question should be sufficiently structured to indicate just what is expected, but open enough to permit the student the proper degree of latitude.

2. He should outline the expected response in advance for his own guidance. Not only does this help the instructor to apply the same standard to every paper, but it may reveal weaknesses or ambiguities which can be corrected before the question is administered.

3. He should take steps to ensure the anonymity of the papers as he reads them. Though handwriting sometimes reveals the identity of the writer, the use of seat numbers, social security numbers, or other code numbers can be helpful in contributing to the elimination of the so-called halo effect. The instructor must not risk being influenced by preconceptions of the student, success or failure of the student on previous examinations, or other irrelevant factors.

4. He should grade the papers item-by-item rather than paper-by-paper. That is, he should read and grade each student's response to item one, then read and grade each response to item two, and so forth. After each question has been read independently, the results are then totaled for each student. The instructor should not know, in grading any given item, how well the student has done on any other question on the examination. This procedure can produce a significant improvement in scoring reliability.

One of the common arguments against grading is that it allegedly forces the student to work for grades rather than work to learn the subject matter. The fallacy in this argument is that it assumes there is little or no correspondence between knowing the subject matter and receiving a good grade. The grade is intended to be an assessment of how well the student has learned the subject matter. To the extent that the grade is successful in reflecting the degree of learning, the argument has no meaning. To the extent that it is unsuccessful, the grade is invalid.

A grade should represent knowledge or competence in the subject matter. Each component skill or bit of knowledge should be represented in the grade in the same proportion that it is represented in the subject matter. Skills which are easier to test should be no more numerous and no more heavily weighted than those which are difficult to test. When grades are based on other factors or influenced by extraneous considerations, they are invalid no matter how well meaning the instructor. The practice of basing grades on attendance, for example, is not justified by any reasonable criterion.

Aside from eliminating any ambiguity or lack of clarity in his test items, perhaps the most important single step an instructor can take to improve the validity of his grading is to define the objectives of his course in terms of the skills or behaviors the student is expected to exhibit at the conclusion rather than in terms of what content the course will "cover."¹⁰ There has been a large amount of literature and some controversy in recent years concerning the usefulness of behavioral objectives.¹¹ This approach is more difficult to apply in some fields than in others, but fortunately it is not necessary to choose between total acceptance and total rejection of the concept. Objectives exist along a continuum from the completely behavioral (e.g., those utilizing verbs such as "state," "list," "describe," "sing," "play") to the completely non-behavioral (e.g., those with verbs such as "know," "understand," "enjoy," "grasp the significance of"). The instructor should make certain that his objectives are stated as behaviorally as the nature of the subject matter will allow. Often this can be done by citing examples of test items or item types. In such cases the objectives and the evaluation exercises tend to merge so that there is no discrepancy between the two. This does not cause the evaluation exercises to be unduly easy for the student, as some instructors have feared, because the musical examples can always be different while the task remains the same. However, this approach can be of great help to the student because he knows just what type of task will be expected of him, and in this way it contributes significantly to the efficiency of the educational process.

What happens when the student is offered optional items on an essay examination? This practice appears to be a logical if not inevitable concomitant of individualized instruction. After all, each student, particularly at the more advanced levels, has his own specialized interests and competencies. Such reasoning is sound in theory but difficult to translate into practice. What typically happens is that, when given options, different types of students tend to be attracted to different alternatives. Students with keen analytic minds, for example, may be attracted to a given question which appeals to them because of the potential it offers for thorough analysis, for critical thinking, or for extrapolation. As a result, the general level of the responses may be higher for this item than for another item which attracts a different quality of student. If so, the brighter student suffers because he is being compared only with other bright students rather than with the class as a whole. In other words, the selection process itself introduces systematic biases, and, therefore, the distributions for the various options do not represent equivalent achievement.

The instructor presumably could account for these discrepancies by comparing results on items everyone answers for the groups choosing each option and then equating the results on the optional items, but this would penalize the student who gave a good answer to a difficult question by assigning him the same grade as the student who gave an equally good answer to a somewhat less difficult question, which is also unfair. The instructor should at last be aware of this problem if he chooses to give optional items.

The field of measurement and evaluation is complex. The various practices found in colleges and universities are based on certain assumptions which are more likely to be implicit than explicit. Some instructors lack the interest or background to do a technically adequate job of evaluation, though if it is to be done it should be done as well as possible. Current interest in measurement and evaluation on the part of students and critics is healthy in that it results in critical examination of existing practices. Although not every change represents progress, if professors will examine their basic premises and check to see that their practices are consistent with these premises, both their premises and their practices will be better.

¹The AACRAO Survey of Grading Policies in Member Institutions, A Report of the Ad Hoc Committee to Survey Grading Policies in Member Institutions of the American Association of Collegiate Registrars and Admissions Officers (Washington, 1971), p. 10.

²Philip T. Bain, Loyde W. Hales, and Leonard P. Rand, "Does Pass-Fail Encourage Exploration?," College and University, XLVII (1971-72), 17.

³The AACRAO Survey, p. 25.

⁴The AACRAO Survey, p. 30.

⁵The Chronicle of Higher Education, V, 36 (July 5, 1971), p. 6.

⁶The Chronicle of Higher Education, V, 21 (March 1, 1971), p. 1.

⁷The Chronicle of Higher Education, VII, 10 (November 27, 1972), p. 3.

⁸W.C. Bells, "Reliability of Repeated Grading of Essay Type Examinations," Journal of Educational Psychology, XXI (1930), 48-52; E.S. Dexter, "The Effect of Fatigue or Boredom on Teachers' Marks," Journal of Educational Research, XXVIII (1935), 664-667; P.D.M. Edwards, "The Use of Essays in Selection at 11 Plus: Essay Marking Experiments: Shorter and Longer Essays," British Journal of Educational Psychology, XXVI (1956), 128-136; Daniel Starch and Edward C. Elliott, "Reliability of the Grading of High School Work in English," School Review, XX (1912), 442-457, "Reliability of Grading Work in Mathematics," School Review, XXI (1913), 254-295, "Reliability of Grading Work in History," School Review, XXI (1913), 676-681; W.N. Thompson, "A Study of the Grading Practices of 31 Instructors in Freshman English," Journal of Educational Measurement, LXIX (1955), 65-68; E.W. Tieg, "Educational Diagnosis," Educational Bulletin #18 (Monterey, California, 1952).

⁹"Reliability," in measurement, refers to the consistency with which a test measures. "Validity" refers to the extent to which it actually measures what it claims to measure.

¹⁰See Robert F. Mager, Preparing Instructional Objectives (Palo Alto, 1962).

¹¹Arthur W. Combs, Educational Accountability: Beyond Behavioral Objectives (Washington, 1972); W. James Popham, "Probing the Validity of Arguments Against Behavioral Goals," in Behavioral Objectives and Instruction, by Robert J. Kibler, Larry L. Barker, and David T. Miles (Boston, 1970), pp. 115-124.

2511 Last modified on November 13, 2018