Working with the tension between language test validity and reliability

The combination of validity and reliability is the holy grail when it comes to language assessment, yet these two qualities are always in direct tension with each other. This can create a challenge when English language programs try to put in place effective measures of language learning, and especially when they have to convince their accreditors that they’ve done so. Student achievement standards are frequently not met in accreditation reviews for precisely this reason.

An assessment is valid if it measures what it is supposed to measure. So, a multiple choice test is generally not a very valid means of testing speaking ability; nor is a gap-fill test a very valid way to determine whether a student has learned to use a grammar structure in communication. On the other hand, a student presentation might serve as a useful basis for a valid assessment of speaking ability, and a speaking or writing test that elicits a target grammar structure would bring to light a student’s ability to use grammar.

An assessment is reliable if would yield the same results for that student if administered by a different person or in a different location. An in-class presentation or role-play assessed by the class teacher is vulnerable to having a low level of reliability, since the test conditions would be difficult to reproduce in another class. The TOEFL iBT is probably the gold standard for test reliability, with extremely detailed protocols for ensuring the uniformity of the test-taking experience for all students, and two-rater grading of written and spoken assignments.

You can probably see the tension: the greater the validity, the harder it is to attain reliability; the greater the reliability, the harder it is to make the test valid (in the three-hour iBT, the test taker is not required to interact with a single human being).

To increase the reliability of valid assessments, programs can:

  1. use a common set of learning objectives across the program and hold teachers accountable for teaching to them
  2. use standard assessment rubrics across the program
  3. calibrate grading through teacher training
  4. have more than one person assess each student’s performance.

These measures might generate pushback among faculty in some university programs.

Unfortunately, there aren’t any great ways to increase the validity of highly reliable of achievement tests. Doing so would require standardizing the teaching – teaching directly to the test – which nobody in an IEP wants, except in a course specifically for test preparation. Programs that use external standardized tests for level promotion are not using a valid means of assessing what was taught (since the test makers don’t know what was taught).

Instead of seeking the absolute standard of ‘assessments that are valid and reliable,’ we need to

  1. start by creating assessments that are valid – that measure precisely what was taught and was supposed to be learned; and then
  2. design and implement measures to reach as high a level of reliability of those assessments as is possible and practical.

Using this approach is a recognition that you can’t have it all, but you can work within the tension of validity and reliability to reach a satisfactory compromise.