2 min read

A Better Multiple-Choice Test

A Better Multiple-Choice Test

Multiple-choice test scores are meant to reflect a student’s knowledge. Students estimate the right answer for each question; students who understand the content estimate the right answer more often than students who don’t. But multiple-choice tests force students to be 100% confident in one answer and 0% confident in the rest, when (a) it’s impossible to be 100% or 0% confident in anything,[1] and (b) it doesn’t map to how people think — nobody takes a test and has perfect confidence in all of their answers. There’s always some uncertainty, and often quite a bit.

Example: you’re split between A and B, leaning toward A. You’re pretty sure it’s not C or D. Something like this:

A — 50% confident (1:1 odds)
B — 40% confident (2:3 odds)
C — 5% confident (1:19 odds)
D — 5% confident (1:19 odds)

The conventional advice is to just guess A. You'll get half-credit in expectation, but you'll never actually get half credit — you'll only ever get full credit or no credit. And we don't test students nearly enough on the same types of problems for the law of large numbers to kick in, and for the expected values to converge onto reality. There's a lot of variance that doesn't get accounted for.

Instead, we want a system for students to answer questions (and for teachers to score those answers) that would, for the above example, have the following properties: the student is rewarded with some credit if the answer is A, a little less credit if the answer is B, and a tiny amount of credit if the answer is C or D.

The system which matches those properties is:

  1. Have students give their confidence in each answer choice in percentages (1%,100%), where the sum of their percentages for all of the answer choices is 100%. (I.e. explicitly state the confidences from the example.)
  2. Use a proper scoring rule to award students credit.

I set up a playground on Google Sheets that you can experiment with — check it out here. It uses two common types of proper scoring (log & Brier) to evaluate an example multiple-choice test question; this can be straightforwardly extrapolated to a full multiple-choice test by summing the scores for each question.

I think stats teachers should strongly consider using this in their classroom. It better reflects students’ knowledge, and it also integrates stats concepts into everyday teaching. The main downside is that it’s a lot more confusing for students than just giving one answer — that’s why I’d expect it to work best in stats-heavy classrooms, where students are already learning a lot of the requisite concepts.

Edit: looks like this (sorta) exists! See Bayes-Up.


  1. At least for epistemic Bayesians! ↩︎