Scholarly article on topic 'Overview of medical student assessment: Why, what, who, and how'

Overview of medical student assessment: Why, what, who, and how Academic research paper on "Educational sciences"

Share paper
OECD Field of science
{"أسئلة الاختيار من متعدد" / "الطريقة الموضوعية المنظمة للامتحانات السريرية" / "مقدار الوثوق بالامتحان" / "مقدار صحة الامتحان" / "مصفوفة الامتحان" / "محاكاة المرضى" / "المقالة السريرية القصير" / "Multiple choice question (MCQ)" / "Objective structured clinical examination (OSCE)" / "Test reliability" / "Test validity" / "Test blue print" / "Simulated patient"}

Abstract of research paper on Educational sciences, author of scientific article — Omar Hasan Kasule

Abstract The paper discusses various issues related to assessment in the context of the Arab world and focusing on 2 forms of assessment: multiple choice questions (MCQs) and objective structured clinical examinations (OSCEs). The appropriate assessment system is determined by the content and method of teaching as well as the expected knowledge and skills of the final product, the health professional. Assessment motivates students to study hard. It is also used to make decisions on promotion of students. A good assessment system cannot be imported; it must be home grown taking into consideration the cultural, linguistic, and educational background of the students. Centralized assessment not under the immediate control of teachers in contact with the students is appropriate for standardized examinations like the United States Medical Licensure Examination (USMLE) but is associated with many challenges in internal examinations within the teaching institution: complicated logistics, marginalization of the teachers, and the injustice of treating un-equals as equals. Assessment covers the two components of medicine: the science and the art (practical). The MCQ format assesses knowledge and its applications. The OSCE format assesses practical skills. Writing good MCQ items takes a lot of effort and time to review but is easy to administer and score. The OSCE based on simulated patients (SP) has ably replaced the traditional long and short clinical cases but penalizes the advanced candidate who asks the SP questions off the script. I propose using SPs who actually had personal experience of the condition being tested. I also propose some items in the OSCE that are of critical knowledge for professionals and which should have higher scores assigned to them. Students should be failed in the whole examination if they do not know some of these critical items.

Academic research paper on topic "Overview of medical student assessment: Why, what, who, and how"

Journal of Taibah University Medical Sciences (2013) 8(2), 72-79

Taibah University Journal of Taibah University Medical Sciences

Review Article

Overview of medical student assessment: Why, what, who, and howq Omar Hasan Kasule, DrPH

Faculty of Medicine, King Fahad Medical City, Riyadh, Kingdom of Saudi Arabia Received 18 October 2012; revised 9 November 2012; accepted 15 December 2012

¿l jljlkVl Alj—i ¿ - ■ JjV ¿l jj£jJJ' -Jl*Jl jlj, ^i ^.'"J' -jj°JJl .' As1*j-JI Ll^asJl ■ ^J' - AajjJl ^aliJ

Ja ¿l l,l.J£l ^^jji-Jl ^¡IjI,-JIj ^Ilj1*-JIj ^jj'Jjl Aajj^ij ^jj—-Jl — — —I ''-Jl -jjiJjl -lhj 'j'—j -Jjj .Aijjj.JI ■ "'' 'l — JlNJ a -hi-Jl Ap,...-Jl Aijj"iJl ¿-■ "^Jj -jj°jjj Ajl^—Jl AijjJl ¿jij ¿i Jj a^ljjJ,l V Jj^Jl ¿j ^.J^J ^jJ.— ^¡N-kJl JaiJ j Aji u£ A^IjjJI ^¡N-LJl -jjsJjl

hj — .Jl ^Ijl——Jl

^ ~ — i J— aJ^.j-JI ■ j — JlVI —"Ijj ¿jJ^aj—jl ¿j—1*—jl ¿l aj^L— ■ j"j "J ¿"'j V ^ÎJl Jl -JJ°JJ' ¿li Jjla-Jl ^i .^¡N"i1J Aj-jJaJjIj Aij*U'j AjilijJl AjükJl alrlj— ^J^J ¿— t^j—1*—Jl jjJJ ¿— ■' ¿l — J-Vl ajtaj J,i :A .. , j-Jl Ajl^ljJl ^¡ljl—JlVI ^i ■ " 'lj'^JJ' ¿l ^JJJj Ai£J J-Jl ^¡LVjJlj Aj^LJl ^^j^ljJjl

Aijj^Jl —¿J l -'j-Aij*—Jl ¿l jUj^VI Alj—i —¿J - '' 'j— - ^jhjJl -J*Jl kJl ^ j■ -jj'JJi ^"'j; .^¡N"iJl ¿l ¿jjjl .J-Jl£ QjijI.J-Ji jy¿ aJlI*— ^i

^ajjij ^ Jl ¿l jjj^ll iklj ^^i A * ■ ~ A jl -r A iaJ l ; aJ^ ¿l jl^^Vl jJJ*J Jjli-jl ^i .Ajl-* Jl ^ijl^-jl Ajjjj.]' ■ ''l 'jlnj A—^j-Jl J-Jl

AiaicJl ' 1J A i'ilajJl Aaij"Jl aJ'iJ.l ^ . ^ j -Jl a'' £' ' l ^ Ir A 'j'j' J' ^IîI'JlNJ A -^i-Jl A ir.^j -Jl Aaij"Jl ' -J*J; ^ j ' . ^jJl A Jt . ^N "JJ - i 'ajJl aJt— l^i^J ji^'jJJ

^^j-jl al£l^L ''J- ^jjal liJ .jji-Jl ^^iJl ^jl^ ^.j-j' al£l^L ^Vl^ Alj—i ^j^ ¿L Újjj-J-Jl ^j'J ^ájj"Jl aÍA .ajj^asjlj A1jj"Jl ^jjj.Jl

^jJ a - l — J'; A -Jl ^Il jj*-ji -jjaJ jj£jJjl l ■ ^ji ^jJai l—£ l ^ jJr ¿l 'JlVi ^j^j ^Jjl Aj.j-Jl AJl'JJ A a j — aj.' ^ Ajj^J -.^jj ^ ^ ' ^i Jl—*j—Ij A -Jl j■ ^ * Jl aÍA ¿L ^í li lij ¿l —JLV. JLI£ ^i J^a — j ¿i ^' ij ; ^l^j'Jl ¿L ajjj£ ^ ^J.'JJ

al£l— - t¿l —jlv' aíji.l í¿''—j-Vi A—■ jij^ !¿l—J.Vl - jija— íAjjjj.Jl ^1''1—jlnj A - hj-Jl a^j.j^ji Aajj"Jl Í^JxJ- jljJ^V' AJj,,i .A^^l^â

ajj.aj' AJla-jl


The paper discusses various issues related to assessment in the context of the Arab world and focusing on 2 forms of assessment: multiple choice questions (MCQs) and objective structured clinical examinations (OSCEs). The appropriate assessment system is determined by the content and method of teaching as well as the expected knowledge and skills of the final product, the health professional. Assessment motivates students to study hard. It is also used to make decisions on promotion of students. A good assessment system cannot be imported; it must be home grown taking into consideration the cultural, linguistic, and educational background of the students. Centralized assessment not under the immediate control of teachers in contact with the students is appropriate for standardized examinations like the United States Medical Licensure Examination (USMLE) but is associated with many challenges in internal examinations within the teaching institution: complicated logistics, marginalization of the teachers, and

q Presented at a workshop on Assessment at the Faculty of Medicine King Fahad Medical City Riyadh Saudi Arabia on October 7, 2012. Corresponding author: Faculty of Medicine, King Fahad Medical City, Riyadh, Kingdom of Saudi Arabia. Tel.: +966 548867916. E-mail: Peer review under responsibility of Taibah University.

1658-3612 © 2013 Taibah University. Production and hosting by Elsevier Ltd. All rights reserved.

the injustice of treating un-equals as equals. Assessment covers the two components of medicine: the science and the art (practical). The MCQ format assesses knowledge and its applications. The OSCE format assesses practical skills. Writing good MCQ items takes a lot of effort and time to review but is easy to administer and score. The OSCE based on simulated patients (SP) has ably replaced the traditional long and short clinical cases but penalizes the advanced candidate who asks the SP questions off the script. I propose using SPs who actually had personal experience of the condition being tested. I also propose some items in the OSCE that are of critical knowledge for professionals and which should have higher scores assigned to them. Students should be failed in the whole examination if they do not know some of these critical items.

Keywords: Multiple choice question (MCQ); Objective structured clinical examination (OSCE); Test reliability; Test validity; Test blue print;

Simulated patient

© 2013 Taibah University. Production and hosting by Elsevier Ltd. All rights reserved.

Teaching: what and how?

In order to determine our assessment we need to define our teaching system and the end product we expect to produce. Medicine is art and science.1 It started purely as art but became more scientific with discovery of new knowledge especially in the 19th century G. The 21st century is characterized by dominance of technology (practical application of science). The art of medicine is taught as apprenticeship2 that involves passive learning with life-long impact. The role of the mentor is more important than normal teaching. We may need clinicians who are mainly teachers, see few cases, and do few procedures so that they have time to teach and are not rushed.

Teaching the science of medicine can be student-centered self-teaching or teacher-centered. Student centered learning is appropriate for learning facts. Good teachers are needed to teach the concepts that underlie and organize the facts in the mind of the learner.

Determining the final product will help us plan our teaching and assessment better. In my view the problem-based learning (PBL) approach is well suited to training general practitioners with broad and practical/pragmatic knowledge of medicine that enables efficient solving of routine problems. I also think that the traditional method is appropriate for training medical specialists (with deep, detailed, specialized knowledge as well as skills) and researchers (with deep thinking, understanding, and inquiry that extend the frontiers of knowledge).

Our dilemma is that the examination format we are using is influenced by licensing bodies whose primary interest is practical professional skills. When we use this format we short change and fail to reward students whose careers will be as specialists or researchers. An additional dilemma is that our teaching also gets modified to fit the examination format and requirements.3 We end up teaching students to pass the examination instead of teaching them to learn.

Assessment: why?

There are many reasons for assessing students and we may sometimes lack clarity which ones we are interested in. I think the main reason is motivating our students to work hard so that they can be rewarded with good scores. I have often mused about what would happen if there were no examinations? Would we have any students in classes? Other reasons for testing are:

(a) to make a diagnosis of deficiencies in our teaching and our student learning. (b) To make decisions about what to do with low achievers. We normally set a standard for deciding who is a low achiever. The standard may be absolute such as 50% of the total score. It may be relative for example failing students scoring more than 2 standard deviations below the mean score. Teachers in practice have problems using standards because student cohorts behave differently. They change the standard by awarding a few more marks to enable borderline students to pass or by shifting the curve.

Assessment: who is assessed?

Many of our problems in assessment arise from basic incompatibilities between the underlying philosophy of the testing on one hand and the cultural, intellectual, and school education background of the students on the other. Test strategies developed in European societies (Europe, America, and Australia) with emphasis on analytical thinking cannot do well in Muslim societies in which synthetic thinking emanating from the integrative paradigm of tauhid predominates. The Muslim paradigm accepts and celebrates differences as natural while European societies aim at minimizing differences to achieve the efficiency of uniformity and standardization. Bridging these philosophical differences has not been easy. To make matters worse many of those engaged in assessment may not be aware of these differences.

The educational background of our students is very different. Their ability to read and understand English is below that of native speakers with a different point of equilibrium between reading speed and comprehension of what is read. Our students go through more steps in understanding a question: translation from English to Arabic, reasoning out the response, translation of the response back into English, and formulating the response in English. They think in one language and answer in another one with each language using different logical structures. Their educational experience is very different; they grew up in a school system that emphasized memorization and getting authoritative knowledge. They cannot suddenly adapt to a system that emphasizes problem analysis and solving; they grew in an analytic paradigm that recognized the dichotomy between the absolute and the relative in knowledge; they cannot suddenly adapt to a system that treats all knowledge as relative and they

are tested on their knowledge of what information is relatively nearer the truth.

Many adaptations will be needed to adapt the testing strategy to our local environment. Continuous change and adaptation of test strategies from local experience is a must. There is language bias in the testing4 that must be corrected for. Simple language must be used for non-native speakers. Tests strategies that do not involve sentence construction or recall of words should be used such as MCQs. A glossary of commonly misused or misunderstood English words must be available to the examiners as they construct test items. The sequence of the logical operations in problem solving items should use the pattern most familiar to the students: simple to complex or vice versa, summary to detail and vice versa, and easy to complex and vice versa.

Assessment: who assesses?

Traditionally the teachers prepared and scored a test for the students they taught. The teachers knew what was learned well and need not be tested. They knew what was difficult and required emphasis on the test. Traditionally the paradigm of university academic freedom respected these roles of the teachers as independent professionals who were not told what to do regarding teaching and assessment. If a common format or system was followed, it was made by unanimous input and agreement of the teachers concerned.

In our times the teachers' role has been supplanted by a remote testing system to which they contribute items but have no control or knowledge of when and how these items turn up in the examination. Today the powerful medical education department tells the teachers what to teach, how to teach it, and in many cases scores the test. The teachers have opinions, biases, and controversies that their students know and can deal with in the test that is custom-made. The centralized test does not take into consideration these dynamics of class teaching.

The centralized and remotely constructed tests share characteristics of standardized tests like Medical College Admission Test (MCAT) and USMLE. Standardized tests have their own techniques and approaches being constructed to fit students with different learning experiences. They are simpler and cover only facts that are unanimous. They may not relate to all what was taught and was learned. They also may not relate to tests constructed by teachers who know the students. A study found low correlation between school clinical test scores and USMLE2.5 Year 2 internal OSCE scores could not predict USMLE2 scores.6 Tests by teachers were more predictive of future success in clinical work than USMLE1.7 The score in a standardized test does not reflect knowledge only. It also reflects test 'wiseness' with a candidate able to earn scores based on experience with the examination technique acquired by practice on past papers. Commercial preparation packages and preparation courses were found, against common sense, to have no effect on USMLE 1 scores.8

I am concerned about the pervasive powers of the medical education department. It is easy to be an expert in medical education at the expense of substantive discipline knowledge i.e. being better at how to teach can be more than at what to teach. The traditional system was not acceptable because it relied on teachers who had no training in teaching or assessment skills. The modern system has emphasized teaching and assess-

ment skills but I fear that issues of substantive discipline knowledge may not have the same emphasis.

Remote control and centralization of testing have advantages and disadvantages. The advantages are more efficient human resource utilization, standardization, better quality control, and objectivity. The disadvantages are marginaliza-tion of local initiative and the injustice of treating the unequal as equal. The remote examiner would never know when the air conditioning system broke down and the students could not follow one lecture well.

Assessment: what is assessed?

Deciding what to test depends on the underlying philosophy of knowledge, education, and training. Some examiners assume that students are taught facts so that they can derive information and above that attain wisdom. The students are therefore not tested on the facts but on the wisdom they acquired from the facts. The test therefore need not be directly based on what was taught.

An easier and fairer system to students is to align curriculum objectives, the teaching, and the assessment. Each item in the test should be referred to a specific learning outcome (LO). This is achieved by using a blue print showing LOs and the testing method to ensure fair coverage of all was taught. The blue print is a grid showing LOs against the testing method and the level of testing. The item writer should first write the LO before thinking of the details of the item.

Bloom's educational taxonomy and its modification have been used to provide a paradigmatic basis for knowledge classification into various levels in diverse disciplines such as psy-chiatry9 and pediatrics.10 Bloom's taxonomy has 5 levels: knowledge, comprehension, application, analysis, and synthesis and evaluation.11 The taxonomy has been revised in various ways but retained its main features. The Saudi Commission for Health Specialties modified Bloom's taxonomy into two levels: K1 = recall and comprehension (25%), K2 = application and problem solving (75%).12 Our different epistemological background requires that we think of local alternatives to Bloom's taxonomy. Our ancestors wrote many treatises on classification of knowledge, tasniif al 'uluum, that we need to re-read with modern eyes. Some of the difficulties in writing items may arise from differences in epistemological assumption and premises.

Knowledge recall is considered the simplest level and is shunned by some testers. This runs counter to what we know about medicine which as a discipline is easy in the sense that unlike mathematics, logic, or physics its facts are easy to understand when presented. Students find the study of medicine difficult not because of not understanding facts but because of too much information to digest and retain. Quick reasoning and problem solving in medicine require consideration of many facts simultaneously. A doctor cannot have higher cognitive functioning without a lot of evidence-based facts in the head. Students who retain a lot of facts do better in their examinations and I daresay in their future practice as doctors. Scores in gross anatomy predicted USMLE1 scores well13. This is explained by the fact that students with mastery and recall of facts perform better. In my mind recall of essential knowledge should be a major component of any medical test.

Application/problem solving may be remembering a personal experience. We may be overdoing the teaching of problem solving skills because in real life the hospital is run on protocols and clinical practice guidelines and doctors may not always engage in problem solving; they just follow the guidelines.

We need to strike the right equilibrium between general questions and very specific ones. The former can be answered by a grasp of medical concepts or by use of logical tools such as analogies. Many medical students can answer questions using general medical knowledge they get from watching medical television shows, or following the illness of their close relatives. Very specific questions require deeper knowledge and understanding.

Items should be from well-established text books and the students should be told what these books are. Basing questions on research papers may be a disadvantage because students might not have had access to them. Even if students follow recent research literature, the facts may change so rapidly that it is not fair to use them as a basis for testing.

Assessment: how?

Informal assessment has little room because of fears of subjective bias. Generally formal testing methods attempt to achieve objectivity. It is virtually impossible to eliminate subjective judgments in some tests such as OSCE. A student with a confident personality and who speaks well will do better in the OSCE exam. We need to establish the right equilibrium between theoretical and practical knowledge. A major challenge to examiners is to know the difference between testing competence of learning vs. testing competence of taking the test. It is disaster if a good test taker scores with less knowledge scores higher.

Traditional testing formats such as long essays, short cases, long cases, and log books have given way to modern tests such as MCQ and OSCE. Appendices 1 and 2 show examples of these two types of examination. The MCQ has been used with satisfaction to test various skills and areas of knowledge such as pediatric resuscitation,14 clinical nursing skills,15 dental clinical skills,16 orthopedics,17 and bronchoscopy.18 Computer based MCQs with immediate feedback can be used as a formative examination.19 Use of regular MCQ tests during training has been found useful for learning by some researchers20 and not by others.21 MCQs with some accommodation can be used with no disadvantages for students with specific learning disabilities such as dyslexia.22 MCQs can be used in more creative ways such as adding them to the OSCE examination.23

MCQs save time and money compared to traditional examinations; they can be administered on paper and corrected by computer or they can be formatted, administered, and corrected by the computer.24 MCQs are alleged to be valid, reliable, and objective but more research is needed in various testing scenarios in our environment before we can make final conclusions.25 They are easy to administer and score. They are bankable and reusable. They can incorporate multi-media (diagrams, path slides, X-ray, audio and video). They are also thought to correlate well with clinical knowledge and have been successfully incorporated in OSCE examinations.36 MCQs are good for standardized tests and are best suited to modern information technology. They can be used mainly to

assess knowledge recall. Some researchers find them suitable for assessing critical thinking for large classes of students21 while others think that they hinder critical thinking.26 They have also been developed to test higher cognitive skills.27 MCQs have a memorial effect with higher scores the second time around28 this implies that students who practice using past questions or any type of question bank can perform better because they will come across questions they have seen before. MCQs can be negative in teaching students wrong information contained in the options this can be corrected in feedback. The positive effects of feedback (retention of information) exceed the negative ones of non-feedback (misinformation).29 The retention of knowledge is better if the feedback is delayed.30

Students can be very creative when dealing with examiners who pick up question papers after each examination so that students would not have the questions. I saw students who virtually had the school's MCQ bank. They organized themselves such that each student memorized a question from the examination and wrote it down immediately after the examination. They would look for the correct answer or ask the teachers. We discovered their question bank when they were asking for answers to difficult questions. Suspecting that our question bank had been leaked, we forced them to produce copies of their bank. We discovered that it was from their memory and many questions had been distorted by deficiencies of memorization.

OSCEs have been used for assessment in a wide range of disciplines including psychiatry,31 radiography,32 surgery,33 dentistry,34 internal medicine,35 and non-prescription medicine courses.36 OSCE can predict future clinical performance.37 Non-human stations can be constructed featuring graphs, photos, etc. OSCE has been modified to provide patient continuity i.e. each station covers an aspect of the illness38 as if the student is examining one patient in the traditional long case scenario. When examining a large number of students OSCE can be done on different days but many stations will be needed.39 Non-native speakers had a disadvantage in OSCE40 and a way must be found to adjust for this. Recently graduated medical students can create good OSCE for final year medical stu-dents.41 Student examiners performed as well as faculty examiners.42 OSCE scores on the same station cannot be compared across several medical schools because of local variations.43 Among disadvantages of OSCE in inter-rater variability that can be reduced by pre-examination training.44 OSCE sessions can be filmed and can be viewed if wide inter rater variation is found.45 Inclusion of SP ratings improves OSCE overall assessment.46

Critical action analysis improves OSCE assessment47 and should have special consideration. A student who fails in the critical action could be failed in the whole examination whatever his or her score may be on other items. The critical action or item is especially important in the final assessment because it leads to a professional license. Failure in just one critical question should be a reason for repeating a year. Examples of critical mistakes are: finding an enlarged prostate in a female patient, prescribing nephrectomy for renal tuberculosis, prescribing rapid IV infusion of 2.0 liters of saline in congestive cardiac failure, and prescribing anti-coagulants in hemorrhage. The logic is 'since you have killed the patient, you cannot be allowed to become a doctor and kill more'.

SP can be trained for the average student who goes through a laundry list of expected questions. The cleverer student will

pick up a clue to the true diagnosis and ask the SP questions that the examiner did not anticipate. The SP can mislead/lead the student especially when the candidate asks questions off the prepared script. We therefore need to use SPs who are real i.e. they experienced the conditions being examined and still remember its symptoms, signs, or associated conditions. An alternative approach is to have the examiner as the SP. When the student strays off the prepared script the examiner can still respond to his/her questions without the candidate being misled or being confused. In this case the script ends up being changed automatically but the examiner must make a note of it when the examination scores are reviewed.

Traditional long and short case final assessment relied on the student reaching the final diagnosis with no attention to how he took history and made the examination. OSCE on the other hand tests mostly the process and not the outcome. A clever student who will clinch the diagnosis in a few questions will not go through the list of irrelevant questions that the examiner expects and ends up with a low score. An average student may just go through a check list of questions relevant to the presenting complaint and end up scoring well.

Empirical studies have been carried out on correlations among tests which can lead to conclusions about their substi-tutability. MCQs are superior to modified essay questions in testing higher cognition48 and are easier to construct. Performance on MCQ was correlated to performance on essay ques-tions49 and short essay questions.50 A comparison of MCQs with narrative answers showed that the open ended narrative answering allowed students more expression of their thoughts but required more time to score while MCQs were quicker to score and to give feedback to students.51 MCQs scores were not correlated to OSCE scores indicating that the two test strategies tested different things.52

The MCQ as an assessment standard

Types and structure of MCQs

Type A MCQs have only one correct option. Type X MCQs have multiple true and false options. Type K MCQs have a range of correct options and the candidate has to pick a 'winning' combination. Type A MCQs are becoming more popular.

The components of a structured MCQ are the stem, the question line/lead-in, and the options. The stem is a 70-100 word vignette that provides the context for the case. It must be a complete paragraph with only relevant information. It may be supplemented by diagrams such as graphs, tabulated data, images, and pictures. One stem can be used for more than one item. The question line is brief and is written separate and below the stem. It is followed by a list of options written in such a way that the candidate can write down the correct answer without looking at any of the options. In a type A MCQ, there is one correct option with 2-4 dis-tractors. Three-option MCQs were found to perform as well as 4 or 5 option ones.53

The distractors must be correct statements but are not the most appropriate answer to the question. They must be similar to the correct option in grammar, length, complexity, tense, and on the same continuum. This is required to make sure that the candidate cannot use any clue to find the right answer.

Absolute statements, mutually exclusive statements, 'all of the above', 'none of the above' are not acceptable.

Construction of MCQs

Writing good MCQs is an arduous task for teachers and many of them hanker to return to the easier traditional assessment methods. The efforts of item writers produce preliminary drafts that require further refining. Reviews are needed to remove flaws. The process of constructing MCQ items is educative for the teacher and has been found useful as a revision tool for postgraduate students54 because it challenges them to think about knowledge in a critical and creative way. Students can construct a lot of MCQ items and can also review them well.55 If they can generate LOs for PBL sessions they can be trusted to produce MCQs as well that reflect the student point of view and these can be added to the bank after review.

Assessing MCQs: validity, precision, and reliability

There are several approaches to assessing MCQs that we shall mention without going into methodological details. Validity answers the question 'did we measure what we wanted to measure in terms of knowledge recall, applications, and skills? It is measured on one item or whole test. Reliability/consistency assesses whether the test performs in the same way when repeated with different cohorts of candidates. Reliability is measured for the whole test and not its component items. Reliability assumes that the different candidate cohorts are of similar basic ability which is not always true. One way out of this is to correlate the candidate's score in the first half of the test with the score in the second half of the test or alternately correlate the same students' scores in even items to scores in the even numbered items. Validity is not tied to reliability. A valid test may not be reliable and vice versa.

Assessing the assessment: mean score and the score distribution curve

Test performance can also be assessed for the whole test or for its items. Overall test analysis is based on the mean (average) score and the distribution curve. A low mean score means that either the test is poor or the students are poor. A high score may be due to an easy test. The distribution curve of the scores provides more information about the test. It may be normal with a small standard deviation (measurement error) which is what is expected when the test is good and the students are good. Other shapes like the negative skew, the positively skew, or the bi-modal indicate problems in students and the test that must be analyzed. When analyzing the curve we must keep in mind variability due to the teacher, the test, and the student. Very good or very bad teaching can be reflected in the shape of the curve. Biased scoring can be seen in skewed curves. Average students with a good test will generate a nice normal curve. Very bright students who are well taught may produce a curve highly skewed to the higher grades.

Assessing the assessment: distractors, difficulty, discrimination, and point biserial

Item analysis focuses on distractor analysis, proportion of correct responses, and the test's ability to discriminate. Weak

students will mistake the distractor for a correct answer. Nonfunctional distractors are those not frequently chosen by examinees who recognize them easily as wrong. If a high proportion of students recognize a distractor, then it is not a true distractor and should not have been in the test. Very difficult or very easy items have fewer functional distractors. Item difficulty is measured as the proportion of candidates who answer the item correctly. This varies from cohort to cohort and may not be a reliable indicator. Test discrimination is the proportion of the upper third answering correctly minus the proportion of the lower third answering the item correctly and ranges from +1 to — 1. This complicated measurement of difficulty has to be used because faculty assignment of item difficulty is subjective and is unreliable.56 The point bi-serial is the item correlation with the total test mark and ranges from — 1 to +1. A corrected bi-serial is computed after removing a flawed item.

Item construction flaws

Mistakes in item construction, designated as item writing flaws (IWF), are common even after several reviews. Flaws pass and fail students unfairly so we need to edit them out. High achievers were affected more by flaws than average students.57 This is because they think deeply about small points that lead them to wrong responses. Common flaws are: ambiguous or unclear options, overlapping options, negative statements (good for recall but not for application or problem solving), unrealistic distractors, providing more information than needed, long options that students tend to think are correct, providing a clue to the correct option in the stem, use of absolute terms such as 'always' or 'never', true/false items, 'all of the above' options, and 'none of the above' options. Very short options or those not consistent in grammar and tense with the stem give away the answer by elimination. A major flaw is missing the problem in the stem so that the stem and the options have no relationship. Repeating stem words and phrases in the options enables the candidate to identify the correct answer. Among IWFs is use of the terms: frequently, occasionally, rarely, usually, and commonly.

MCQ items should be reviewed to correct the flaws. The review should be made by several people experienced in item writing working in a team one may see a mistake that others do not see. The writer must consider any criticism however wrong he thinks it is and not be self-defensive using the philosophy that there is no smoke without fire. Face to face review in the presence of item writers is psychologically difficult. Review can be done by each individual and the results are fed to a chairperson. Quality assurance is based on 4 criteria: adherence to an in-house style, item proportion testing at K2 level, functioning distracter proportion, overall discrimination ratio and IWF frequency.58


There are still many problems, challenges, and unanswered questions regarding assessment in our local environment. The assessment systems do not take into consideration the cultural, intellectual, and educational background of the students which is different from the European or American one.


The world-view, linguistic background, and epistemological paradigms of the students' society should be considered in assessment. Teachers in direct touch with students should construct and assess the examinations. Local empirical studies of the performance of MCQ and OSCE assessment systems are needed.

Appendix A. #1. Example of an MCQ examination item

Vignette: A 60-year old former cement factory worker with a history of smoking for 40 years with a history of chronic cough and difficulty of breathing for the past 10 years comes in with a painless 2-day productive cough with mucoid blood stained sputum. Examination revealed normal temperature, BP 120/ 83, pulse 73/minute, normal heart sounds and rhythm. Chest X-ray showed a 3 by 5 cm opacity in the right upper lung lobe.

Question: What is the most likely diagnosis from the information provided?


A. Pulmonary embolism.

B. Pulmonary esinophilia.

C. Pulmonary hypertension.

D. Pulmonary adenocarcinoma.

E. Pulmonary chronic obstructive disease.

Appendix B. #2: Example of an OSCE examination

Objective: Assessing the student's ability to take a history of simple headache

Information for the stimulated patient: Answer the candidate's questions using the information below. You are a 30-year old single operation theater nurse complaining of severe headaches every day at the end of work for the last 6 months. The headache is aggravated by long surgery and is partially relieved by analgesics. It is felt at the forehead but the pain moves to the back. You have no other symptoms. You have no previous medical or surgical history.

Instructions for the candidate: The simulated patient is a 30-year old male who presents with headache. Take a history of the presenting complaint.

Instructions to the examiner: The candidate will have 10 minutes to complete the task. You have an absolutely passive role. Do not communicate with the patient or the simulated patient either verbally or by body language. Use the scoring sheet below to score the students. Award 2 points for a task well done. Award 1 point for a partially fulfilled task. Award zero if the student does not perform the task or performs in such a way that it is clear he has no competence.

Scoring sheet (showing expected tasks)

(1) Self-introduction, greeting the patient, explaining the purpose of the encounter and asking for permission.

(2) Asking for the patient's name, age, and residence.

(3) Asking about manner of the onset of the pain, time since onset, increase/decrease, frequency and duration of episodes.

(4) Asking about the severity of the pain and attempting to measure it objectively.

(5) Asking about the site of the pain and whether it radiates to other parts of the body.

(6) Asking about precipitating, aggravating, and relieving factors for the pain.

(7) Asking about past medical history.

(8) Asking about past surgical history.

(9) Asking about social history.

(10) Asking about patient's feelings about the illness and its causes.


1. Dole DM, Nypaver CF. Nurse-midwifery: art and science. Nurs Clin North Am 2012; 47(2): 205-213.

2. Balmer DF, Serwint JR, Ruzek SB, Giardino AP. Understanding paediatric resident-continuity preceptor relationships through the lens of apprenticeship learning. Med Educ 2008; 42(9): 923-929.

3. Gilliland WR, La Rochelle J, Hawkins R, Dillon GF, Mechaber L, Dyrbye L, Papp KK, Durning SJ. Changes in clinical skills education resulting from the introduction of the USMLE step 2 clinical skills (CS) examination. Med Teach 2008; 30(3): 325-327.

4. Lampe S, Tsaouse B. Linguistic bias in multiple-choice test questions. Creat Nurs 2010; 16(2): 63-67.

5. Berg K, Winward M, Clauser BE, Veloski JA, Berg D, Dillon GF, Veloski JJ. The relationship between performance on a medical school's clinical skills assessment and USMLE Step 2 CS. Acad Med 2008; 83(10 Suppl): S37-S40.

6. Simon SR, Bui A, Day S, Berti D, Volkan K. The relationship between second-year medical students' OSCE scores and USMLE Step 2 scores. J Eval Clin Pract 2007; 13(6): 901-905.

7. Denton GD, Durning SJ, Wimmer AP, Pangaro LN, Hemmer PA. Is a faculty developed pretest equivalent to pre-third year GPA or USMLE step 1 as a predictor of third-year internal medicine clerkship outcomes? Teach Learn Med 2004; 16(4): 329-332.

8. Werner LS, Bull BS. The effect of three commercial coaching courses on Step One USMLE performance. Med Educ 2003; 37(6): 527-531.

9. Miller DA, Sadler JZ, Mohl PC, Melchiode GA. The cognitive context of examinations in psychiatry using Bloom's taxonomy. Med Educ 1991; 25(6): 480-484.

10. Plack MM, Driscoll M, Marquez M, Cuppernull L, Maring J, Greenberg L. Assessing reflective writing on a pediatric clerkship by using a modified Bloom's Taxonomy. Ambul Pediatr 2007; 7(4): 285-291.

11. Bloom Benjamin. Taxonomy of educational objectives, handbook I: the cognitive domain. New York: David McKay; 1956.

12. Saudi commission for health specialties. The basics of assessment for licensing examinations. (manual for as assessment workshop held on 7-8 April 2012).

13. Peterson CA, Tucker RP. Medical gross anatomy as a predictor of performance on the USMLE Step 1. Anat Rec B New Anat 2005; 283(1): 5-8.

14. Duff JP, Cheng A, Bahry LM, Hopkins J, Richard M, Schexnay-der SM, Carbonaro M. Development and validation of a multiple choice examination assessing cognitive and behavioural knowledge of pediatric resuscitation: a report from the EXPRESS pediatric research collaborative. For the EXPRESS investigators. Resuscitation. 2012 Jul 25. PII: S0300-9572(12)00375-9.

15. McWilliam PL, Botwinski CA. Identifying strengths and weaknesses in the utilization of Objective Structured Clinical Examination (OSCE) in a nursing program. Nurs Educ Perspect 2012; 33 (1): 35-39.

16. Ratzmann A, Wiesmann U, Korda B. Integration of an Objective Structured Clinical Examination (OSCE) into the dental preliminary exams [Article in English, German]. GMS Z Med Ausbild 2012; 29(1): Doc09.15.

17. Griesser MJ, Beran MC, Flanigan DC, Quackenbush M, Van Hoff C, Bishop JY. Implementation of an objective structured clinical exam (OSCE) into orthopedic surgery residency training. J Surg Educ 2012; 69(2): 180-189.

18. Quadrelli S, Davoudi M, Galindez F, Colt HG. Reliability of a 25-item low-stakes multiple-choice assessment of bronchoscopic knowledge. Chest 2009; 135(2): 315-321.

19. Karay Y, Schauber SK, Stosch C, Schuettpelz-Brauns K. Can computer-based assessment enhance the acceptance of formative multiple choice exams? A utility analysis. Med Teach 2012; 34(4): 292-296.

20. Mathis BR, Warm EJ, Schauer DP, Holmboe E, Rouan GW. A multiple choice testing program coupled with a year-long elective experience is associated with improved performance on the internal medicine in-training examination. J Gen Intern Med 2011; 26(11): 1253-1257.

21. Serane TV, Arun Babu T, Menon R, Devagaran V, Kothendar-aman B. Improving learning during pediatric lectures with multiple choice questions. Indian J Pediatr 2011; 78(8): 983-986.

22. Ricketts C, Brice J, Coombes L. Are multiple choice tests fair to medical students with specific learning disabilities? Adv Health Sci Educ Theory Pract 2010; 15(2): 265-275.

23. Napankangas R, Harila V, Lahti S. Experiences in adding multiple-choice questions to an objective structural clinical examination (OSCE) in undergraduate dental education. Eur J Dent Educ 2012; 16(1): e146-e150.

24. Mandel A, Hornlein A, Ifland M, Lüneburg E, Deckert J, Puppe F. Cost analysis for computer supported multiple-choice paper examinations [Article in English, German]. GMS Z Med Ausbild 2011; 28(4): Doc55.

25. Considine J, Botti M, Thomas S. Design, format, validity and reliability of multiple choice questions for use in nursing research and education. Collegian 2005; 12(1): 19-24.

26. Stanger-Hall KF. Multiple-choice exams: an obstacle for higherlevel thinking in introductory science classes. CBE Life Sci Educ 2012; 11(3): 294-306.

27. Khan MU, Aljarallah BM. Evaluation of Modified Essay Questions (MEQ) and Multiple Choice Questions (MCQ) as a tool for Assessing the Cognitive Skills of Undergraduate Medical Students. Int J Health Sci (Qassim) 2011; 5(1): 39-43.

28. Fazio LK, Agarwal PK, Marsh EJ, Roediger 3rd HL. Memorial consequences of multiple-choice testing on immediate and delayed tests. Mem Cognit 2010; 38(4): 407-418.

29. Butler AC, Roediger 3rd HL. Feedback enhances the positive effects and reduces the negative effects of multiple-choice testing. Mem Cognit 2008; 36(3): 604-616.

30. Butler AC, Karpicke JD, Roediger 3rd HL. The effect of type and timing of feedback on learning from multiple-choice tests. J Exp Psychol Appl 2007; 13(4): 273-281.

31. Zahid MA, Al-Zayed A, Ohaeri J, Varghese R. Introducing the Objective Structured Clinical Examination (OSCE) in the undergraduate psychiatric curriculum: evaluation after one year. Acad Psychiatry 2011 Nov 1; 35(6): 365-369.

32. Lele SM. A mini-OSCE for formative assessment of diagnostic and radiographic skills at a dental college in India. J Dent Educ 2011; 75(12): 1583-1589.

33. Falcone JL, Schenarts KD, Ferson PF, Day HD. Using elements from an acute abdominal pain Objective Structured Clinical Examination (OSCE) leads to more standardized grading in the

surgical clerkship for third-year medical students. J Surg Educ 2011; 68(5): 408-413.

34. Eberhard L, Hassel A, Baumer A, Becker F, Beck-Mubotter J, Börnicke W, Corcodel N, Cosgarea R, Eiffler C, Giannakopoulos T, Kraus T, Mahabadi J, Rues S, Schmitter M, Wolff D, Wege KC. Analysis of quality and feasibility of an objective structured clinical examination (OSCE) in preclinical dental education. Eur J Dent Educ 2011; 15(3): 172-178.

35. Yang YY, Lee FY, Hsu HC, Huang CC, Chen JW, Lee WS, Chuang CL, Chang CC, Chen HM, Huang CC. A core competence-based objective structured clinical examination (OSCE) in evaluation of clinical performance of postgraduate year-1 (PGYj) residents. J Chin Med Assoc 2011; 74(5): 198-204.

36. Hastings JK, Flowers SK, Pace AC, Spadaro D. An Objective Standardized Clinical Examination (OSCE) in an advanced nonprescription medicines course. Am J Pharm Educ 2010; 74 (6): 98.

37. Wallenstein J, Heron S, Santen S, Shayne P, Ander D. A core competency-based objective structured clinical examination (OSCE) can predict future resident performance. Acad Emerg Med 2010; 17(Suppl 2): S67-S71.

38. Hatala R, Marr S, Cuncic C, Bacchus CM. Modification of an OSCE format to enhance patient continuity in a high-stakes assessment of clinical performance. BMC Med Educ 2011 May; 24 (11): 23.

39. Schoonheim-Klein M, Muijtjens A, Habets L, Manogue M, Van der Vleuten C, Hoogstraten J, Van der Velden U. On the reliability of a dental OSCE, using SEM: effect of different days. Eur J Dent Educ 2008; 12(3): 131-137.

40. Schoonheim-Klein M, Hoogstraten J, Habets L, Aartman I, Van der Vleuten C, Manogue M, Van der Velden U. Language background and OSCE performance: a study of potential bias. Eur J Dent Educ 2007; 11(4): 222-229.

41. Rashid MS, Sobowale O, Gore D. A near-peer teaching program designed, developed and delivered exclusively by recent medical graduates for final year medical students sitting the final objective structured clinical examination (OSCE). BMC Med Educ 2011; 17 (11): 11.

42. Moineau G, Power B, Pion AM, Wood TJ, Humphrey-Murto S. Comparison of student examiner to faculty examiner scoring and feedback in an OSCE. Med Educ 2011; 45(2): 183-191. http://dx.

43. Chesser A, Cameron H, Evans P, Cleland J, Boursicot K, Mires G. Sources of variation in performance on a shared OSCE station across four UK medical schools. Med Educ 2009; 43(6): 526-532.

44. Schwartzman E, Hsu DI, Law AV, Chung EP. Assessment of patient communication skills during OSCE: examining effectiveness of a training program in minimizing inter-grader variability. Patient Educ Couns 2011; 83(3): 472-477.

45. Abe S, Kawada E. Development of computer-based OSCE reexamination system for minimizing inter-examiner discrepancy. Bull Tokyo Dent Coll 2008; 49(1): 1-6.

46. Homer M, Pell G. The impact of the inclusion of simulated patient ratings on the reliability of OSCE assessments under the borderline regression method. Med Teach 2009; 31(5): 420-425.

47. Payne NJ, Bradley EB, Heald EB, Maughan KL, Michaelsen VE, Wang XQ, Corbett Jr EC. Sharpening the eye of the OSCE with critical action analysis. Acad Med 2008; 83(10): 900-905.

48. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: modified essay or multiple choice questions? Research paper. BMC Med Educ 2007 Nov; 28(7): 49.

49. Pepple DJ, Young LE, Carroll RG. A comparison of student performance in multiple-choice and long essay questions in the MBBS stage I physiology examination at the University of the West Indies (Mona Campus). Adv Physiol Educ 2010; 34(2): 86-89.

50. Mujeeb AM, Pardeshi ML, Ghongane BB. Comparative assessment of multiple choice questions versus short essay questions in pharmacology examinations. Indian J Med Sci 2010; 64(3): 118-124.

51. Kim S, Spielberg F, Mauksch L, Farber S, Duong C, Fitch W, Greer T. Comparing narrative and multiple-choice formats in online communication skill assessment. Med Educ 2009; 43(6): 533-541.

52. Dennehy PC, Susarla SM, Karimbux NY. Relationship between dental students' performance on standardized multiple-choice examinations and OSCEs. J Dent Educ 2008; 72(5): 585.

53. Tarrant M, Ware J. A comparison of the psychometric properties of three- and four-option multiple-choice questions in nursing assessments. Nurse Educ Today 2010; 30(6): 539-543 [Epub 2010 Jan 6].

54. Bobby Z, Radhika MR, Nandeesha H, Balasubramanian A, Prerna S, Archana N, Thippeswamy DN. Formulation of multiple choice questions as a revision exercise at the end of a teaching module in biochemistry. Biochem Mol Biol Educ 2012; 40(3): 169-173.

55. Bottomley S, Denny P. A participatory learning approach to biochemistry using student authored and evaluated multiple-choice questions. Biochem Mol Biol Educ 2011; 39(5): 352-361.

56. Kibble JD, Johnson T. Are faculty predictions or item taxonomies useful for estimating the outcome of multiple-choice examinations? Adv Physiol Educ 2011; 35(4): 396-401.

57. Tarrant M, Ware J. Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Med Educ 2008; 42(2): 198-206.

58. Ware J, Vik T. Quality assurance of item writing: during the introduction of multiple choice questions in medicine for high stakes examinations. Med Teach 2009; 31(3): 238-243.