Saturday 29 April 2017

Measurement of Performance in Task-Based Assessment

Introduction
Task-based assessment (TBA) uses tasks to gather information about the linguistic abilities of test-takers. It employs either performance-referenced or construct-based approach to do this. Performance-referenced TBA looks at target language use (TLU) tasks to identify what aspects of performance to be tested, so that the test is useful for its intended purposes. This is called work-sample approach. Criterion-based approach selects a theory of language learning and use to define the underlying language abilities to be measured in the test. Whatever is the approach, after the test is conducted, we need to interpret or assess the performance to get statements regarding the test-taker's abilities to predict future performance, or to take decisions. For this, we have three different ways.

1. Direct assessment of task outcomes
Direct assessment is useful for closed tasks. Closed tasks are tasks that result in a solution that is either right or wrong. Scoring is dichotomous- either right or wrong. Such criterion-referenced tests directly let us judge whether the test-taker has passed or failed in the test.
Advantages: There is very little subjectivity involved. The test result very clearly states the outcome. It is easy. It is quick.
Disadvantages: Direct tests need direct observation of task performance. That is, a rater must observe each test-taker individually. This is time consuming and almost impossible. In case of direct written tests, this problem can be easily solved by evaluating the test scripts later. Another disadvantage is the lack of clarity regarding such test's ability to measure language ability as against non-linguistic or general knowledge. This problem more real for direct performance-referenced tests where work sampling approach is used. Construct-based tests of the direct nature do not face this problem.

2. Discourse Analytic Methods
As the name indicated, this method analyses the discourse produced in the test. It counts specific linguistic features appearing in a task discourse. There are many ways one can do this. First is to look for the test-taker's linguistic competence by looking for complexity, fluency and accuracy in the discourse. Second is to look for sociolinguistic competence- like the use of different strategies for eliciting information from interlocutors. Third is to look for discourse competence- like appropriate use of connectors, topic changers, etc. Fourth is to look for strategic competence- like use of different strategies to negotiate meaning, build discourse, or to overcome breakdowns in communication.

Discourse analytic methods are generally objective. It can be considered direct measurement. Some do not consider this as direct since in the real world, we do not analyse discourse to see if someone has succeeded in communicating something.

This method is used widely in task-based research. It is a time consuming process since the discourse requires to be transcribed for analysis. Therefore, this is not used widely in task-based language teaching. But in research, this is a very useful tool. Most of the times, one uses discourse analysis along with external rating in order to compare the assessment afforded by both.

3. External Ratings
External ratings make use of an external rater/observer to assess performance. Judgment comes from this rater. But this is not direct assessment in the nature of judgment made. Here, judgment is made by the rater subjectively, while in direct assessment, judgment is objective since it comes as part of the performance itself.

The most common external rating type uses rating scales. Both performance-referenced and system-referenced tests make use of rating scales. What do rating scales do? They specify the ability/competency being measured, and provide different levels of performance as bands of performance. Usually this scale is on a spectrum ranging from 0 to 5. The highest level usually is native-like proficiency.

Another kind of scale used makes use of checklists. This checklist lists different abilities/aspects of performance. The rater observes the performance and checks off those that are present in performance.

In TBA, there are two ways of specifying competence.
1. In Behavioural terms. 2. In Linguistic terms. Choice of behavioural or linguistic specification depends on whether you want to know about the general language proficiency or specific language proficiency of the test-taker.

Behavioural specification of competence uses rating scales that provide us with levels of test-taker performance. For example: instructions are correctly given with clarity. Such scales are essentially task dependent. Every task needs its own rating scale.

Linguistic specification of competence is problematic. How do we specify competence? What aspects of linguistic competence must be measured? There are two choices a) holistic measure which looks at general language proficiency to fulfill functions required, without attention to specific linguistic features. b) analytic measure which looks for specific linguistic abilities in terms of the four language skills. Many tests uses both holistic and analytic measures. Analytic scale requires us to specify what aspect of performance are we looking at. These aspects need to be connected to an underlying theory of language. Skehan's framework suggest 'accuracy, fluency and complexity' as three dimensions. One can also make use of functional definitions of ability instead of this. The basic rule is to use a theory of communicative competence or performance to determine the competencies to be measured.

How does one determine the checklists or performance levels used in rating scales? McNamara proposes two approaches- theoretically driven and empirically driven. Theoretically driven approach uses context-driven descriptors. Criterion behaviours are specified. Performance is observed to see if criterion behaviour is satisfied. Here, descriptors are developed in terms of linguistic skill displayed. In empirically driven approach,  Rasch scaling procedure is used to statistically study the relationship between different items in a test and general abilities to be deduced to form a scale. Such scales are not very common since the process involved is complex. An alternative way is to develop scales using the insights received from discourse analysis.

Criterion level in a scale has to be defined. This is a problematic process. Most often, this decision is arbitrary or relative. For example, some universities choose IELTS score 6 as criterion for admission
criterion, while some others choose 7 or 7.5. Such decisions are not theoretically or empirically explained.

A very important aspect of rating scale is its interaction with the rater. This is another field of study altogether.

4. Self Assessment
Self assessment has many problems and advantages. Advantages are that they are easy, less time consuming, less expensive, helps learners to be self-regulated learners, develop reflective learning, etc. Disadvantages are about the reliability and validity of self assessment. How able are learners to judge language proficiency. Bachman has found that self assessment is more meaningful when questions are about actual performance or needs, rather than about linguistic abilities. According to some studies, it lacks predictive and concurrent validity (criterion-related validity). Reliability is high when it is about internal consistency. But it is not the same case with test-retest reliability.

Amazon.in