The absurd precision of 11-plus test scores

September is the time of year when pupils sit 11-plus tests, a stressful time for all the families involved. There is no easy way to tell a ten-year-old they have failed a test like this, and there are awful connotations to a fail score, with these tests presumed ability to predict future academic success.

Grammar schools and local authorities operating 11-plus tests have no desire to reveal the odd processes they use to determine who is ‘suitable’ for a grammar school education. However, James Coombs, Comprehensive Future’s, data specialist, understands these processes very well. He is so angry about the secrecy and flawed methodology of these tests, he’s spent years fighting a legal battle to ensure transparency with the scoring process.

He describes the process of scoring the Buckinghamshire 11-plus, as “absurd.” He explains that the test scores in the Bucks 11-plus, like most others, is designed to create artificial precision, producing a number that allows test providers to look, “clever and scientific,” but which is mostly about helping them divide hundreds of children who are of broadly the same academic standard.

Here’s James’ explanation of the age weighting process in Buckinghamshire.

I was struck by the fact that the Bucks test provider, CEM, manages to create unique results for virtually every single one of 10,000 individual test takers. In a test of 50 questions there are 50 possible scores. These scores combined with 12 different birth months means there are only 50 × 12 = 600 different possible scores. I discovered through another source that CEM calculate age weighting based on applicant’s age in days. By age weighting down to the actual day, that means there are 30 times the number of different scores, so 50 × 365 which means 18,250 different scores.

Any applicant born at the end of August is almost a whole year younger than the oldest candidate, and age weighting adds no more than a couple of standardised marks. That works out at (2/365 =) ~0.005 extra marks per day. So why do they standardise to the exact day? In a situation where two children scored exactly the same in each of the three tests; Maths, VR, NVR, if one was just a day younger they’d get an extra ~0.005 on each test score, so increasing their overall mark by ~0.015, just enough to “distinguish” [sic] between these two candidates.

At a 95% confidence interval, the true score of a test taker lies somewhere in a range ± 7.5 marks either side of the recorded score. Recording this to two decimal places is like creating a sundial big enough to record the time in minutes … and then mounting it on a ship. “The time is 12:42 precisely, give or take an hour or so!” Grammar schools give parents absurdly precise scores because the unsuspecting public conflate precision with accuracy.

If you think the idea that a school would “distinguish” on the basis of 0.01 of a mark sounds all very theoretical and unlikely to happen in real life I can assure you it really does happen. In a short film I made about 11-plus standardisation I quoted the actual scores of two applicants to Reading Boys school for 2015 entry. Although the candidates had differing raw test marks, due to the wholly inappropriate use of precision, their final marks really were 109.99 and the other 110.00. The ‘cut off’ mark was 110.00. Those two boys were divided as suitable/unsuitable for this school using this 0.01 difference.

As you approach any ‘boundary condition’, such as deciding if someone is of ‘grammar school standard’ [sic] or which of two candidates are the last to be admitted in any cohort, the probability that the correct individual is chosen approaches 50%. You really would do as well to toss a coin. The technical term for this statistical/probabilistic phenomena is ‘classification accuracy’. Director of UCL’s Centre for Education Improvement Science, Dr Rebecca Allen, provides a good explanation here.

My film continues with a graph which illustrates that almost a quarter of 11-plus test takers (22%) are misallocated by the 11-plus. The point missed by most commentators, which I try to explain with some crude animation, is that from the school’s perspective a 70% correlation between the test and GCSE results is actually perfectly adequate at selecting a cohort which ensures the school does well in league tables. Those individuals who arbitrarily fall the wrong side of this line will see this statistical equivalent of tossing a coin quite differently.

In theory the 1980 Education Act enshrined parental choice. In practice, the schools are the ones doing the choosing. Revealing the raw marks would expose the issue of classification accuracy in an intuitive way which parents would immediately get without requiring a degree in statistics. That’s the primary reason the schools and test companies don’t want the public to know and understand how raw marks are standardised.

James’ shows that the secrecy around 11-plus test scoring helps no one but the grammar schools. The idea that an 11-plus result is meaningful and can accurately judge future potential is clearly flawed. There are more than 100,000 children sitting an 11-plus test in the next few weeks, and most of these children will have no idea that it is an inaccurate measure of potential. We know that there will be around 22,000 of these children who will go on to achieve either better or worse GCSE results than predicted by this test. There will be children celebrating success, or crying, when they receive an 11-plus verdict, who will each have scored exactly the same mark on the test day. The nonsense of this pointless school admission test really must end.