Behind the Build: Why the Assessment Scores Fluency Higher Than Accuracy (And the Story of the Decision)


Listen to this article
Narrated by Editorial Director · 6 min read audio
When I built the fluency assessment, I made one design decision that everything else flowed from. When fluency and accuracy disagree on a sample, which one wins?
Most assessments weight accuracy higher. They were built for academic certification, where grammatical correctness is the proxy for "knows the language." We weight fluency higher. The assessment was built for working professionals, where the proxy is "can use the language to lead a meeting."
This is the story of that decision. The two weeks of arguing it out with myself, the three pieces of data that settled it, what the decision costs us, and the kind of user we under-serve as a result.
The fork in the road
A speaking assessment that uses AI scoring has to weight four sub-scores into one overall CEFR level: fluency, lexical range, grammatical range, and coherence. The four sub-scores rarely agree on a single sample. A speaker who produces grammatically perfect sentences slowly, with frequent self-corrections and long pauses, will score high on grammar and low on fluency. A speaker who produces continuous, structured, persuasive speech with two or three small grammar errors will score high on fluency and lower on grammar.
The weighting decides which speaker is rated higher overall. The decision is not technical. It is editorial. It depends on what the assessment is for.
For an academic exam (Cambridge, IELTS, Trinity), accuracy weights higher, because the use case is "can the candidate produce correct English under controlled conditions?" For a workplace assessment, the use case is different. The question is "can this candidate function in an English-speaking professional environment?" And in that environment, fluency wins.
I knew that. The question was how strongly to weight it.
The data that decided it
Three data points settled the question.
First, the correlation in our own learner cohort. I pulled the post-programme career outcome surveys from learners who had completed at least six months on the platform. The fluency sub-score from their initial assessment correlated with reported career outcomes (promotion, role change, senior meeting confidence) at roughly twice the strength of the grammar sub-score. The lexical range sub-score correlated about as strongly as fluency. The coherence sub-score correlated about as strongly as grammar.
That alone would have been enough to weight fluency higher than grammar. It would not have told me by how much.
Second, the workplace literature. The finding that fluency predicts perceived workplace competence more strongly than accuracy past a B1 threshold is not new. It is in the SLA literature going back decades. What is new is the willingness to act on that finding in an assessment design. Most academic assessments have not, because their use case is academic. Mine could, because the use case was not.
Third, the recruiter conversation. I spent two weeks calling recruiters and hiring managers who routinely assess non-native candidates for senior roles. I asked each of them, in different words, the same question: when you decide whether a candidate's English is good enough for a senior role, what are you actually listening for? Without exception, the answer was about flow, response speed, comfort under pressure, and ability to take the floor, not about grammar. The recruiters had been operating with an internal fluency-first model the whole time. I was just designing the assessment to match it.
The weighting
Fluency and lexical range together carry more weight than grammatical range and coherence combined. Fluency is the single highest-weighted sub-score. The two sub-scores that academic exams typically place at the top of the hierarchy (grammatical range and coherence in its academic sense) sit lower in the EFO weighting on purpose.
This is the single most consequential editorial decision in the entire platform. Everything downstream (the practice recommendations, the coaching focus, the Sophie session structure) is calibrated against this weighting. A learner who tests at B2 on our assessment is a learner who has demonstrated working-professional fluency, not academic precision. That is the user we know how to serve.
What this costs us
The decision has two real costs, and I think about both of them often.
The first is disagreement with other assessments. Our assessment will sometimes return a higher CEFR level for a fluent-but-slightly-inaccurate speaker than they would receive on a traditional academic exam. This confuses users who have been graded elsewhere. We explain the weighting in the result page, but a user who needs an "official" CEFR for a visa or a tender knows our score is not that score. We are honest about this. The free assessment is not a certificate. It is a diagnostic.
The second cost is a specific kind of user we under-rate. The very careful, very accurate, very slow speaker (often academic, often older, often from a culture where slow careful speech is the formal register) will test lower with us than they would on a traditional exam. Their grammar is excellent. Their lexical range is rich. Their fluency, by our criteria, is constrained, because the criteria are calibrated for working English where speed-of-response matters.
For most of these users, our score is the more useful one in their actual professional context. But for a minority, especially academic professionals whose role does not actually require fast conversational English, our score under-rates them. We accept this trade-off because the alternative is calibrating the entire assessment for a population we are not the right platform for.
The decision I would not change
Every six months, I look at the weighting again. The data has only ever moved one direction: toward higher fluency weighting, not lower. Each cohort of learners reinforces that fluency is the sub-score that predicts the career outcomes the learners actually want.
If I were starting again, the weighting might be 35-30-20-15, leaning even harder on fluency. The reason I have not made that move yet is that the current weighting is the one against which two years of recommendations are calibrated. Shifting the weighting would shift every recommendation downstream of it. That recalibration is more expensive than the marginal improvement would be worth.
But the direction is clear. The assessment is fluency-first, and the further into operation we run, the more confident I am that it should be.
TL;DR
The fluency assessment weights fluency and lexical range above grammatical range and coherence, with fluency the single highest-weighted sub-score. The weighting was decided by three data points: the correlation between fluency sub-score and learner career outcomes in our own cohort; the workplace SLA literature on fluency vs accuracy past B1; and two weeks of recruiter interviews that all converged on "we are listening for flow, not grammar." The cost is disagreement with traditional CEFR-aligned exams and under-rating of the careful-accurate-slow speaker. The benefit is that the score tracks real-world professional outcomes more closely than weightings that treat fluency and accuracy as equal. The direction of every re-evaluation has been toward higher fluency weighting, not lower.
Language Analysis
Select a category above to highlight those words in the text.
Learning Materials
Key Vocabulary
fluency
The ability to produce language continuously at the speed of conversation, with appropriate pauses and recovery.
“In professional settings, fluency predicts perceived competence more than grammar.”
accuracy
The grammatical correctness of the language a speaker produces.
“Most assessments weight accuracy higher than fluency.”
weighting
The relative importance assigned to different components of a composite score.
“The final weighting puts fluency at 30% and coherence at 20%.”
sub-score
One of several individual scores that combine to form an overall result.
“The fluency sub-score correlated twice as strongly with career outcomes as the grammar sub-score.”
correlation
A statistical relationship in which two variables move together to some degree.
“The correlation between fluency and career outcomes was the first data point that settled the question.”
predict
To indicate, on the basis of evidence, that a future outcome is likely.
“Fluency predicts perceived workplace competence more strongly than accuracy past B1.”
threshold
The level at which something begins to take effect or change in character.
“Past a B1 threshold, accuracy adds less to perceived competence than fluency.”
calibrate
To adjust a system so that its outputs align with a defined reference.
“The assessment is calibrated for working professionals, not academic exam-takers.”
marginal
Relating to a small additional change, especially one whose value is being judged against its cost.
“The marginal improvement from re-weighting is smaller than the recalibration cost.”
trade-off
A balance achieved by accepting one loss in return for another gain.
“We accept the trade-off of disagreeing with traditional CEFR exams.”
diagnostic
An assessment whose purpose is to surface a learner's profile, not to certify their level.
“The free assessment is not a certificate. It is a diagnostic.”
proxy
An indirect measure used to stand in for something that is harder to measure directly.
“In academic exams, grammatical accuracy is the proxy for knowing the language.”
validation
The process of confirming that a measurement tool produces results that match the outcomes it claims to predict.
“Internal validation against career-outcome surveys supports the fluency weighting.”
cohort
A group of people who share a defining characteristic, often tracked together over time.
“Each cohort of learners reinforces that fluency predicts the outcomes they want.”
recalibration
The act of adjusting a calibrated system again, especially in response to new data or a changed goal.
“Shifting the weighting would force a recalibration of every downstream recommendation.”
Grammar Notes
Second conditional / hypothetical 'If I were' for design counterfactuals
The second conditional ('If I were... the weighting might be...') describes an imagined, contrary-to-fact present situation. It is the natural English construction for design counterfactuals: discussing a choice that was not made, without committing to making it now. Note the subjunctive 'were' (not 'was') in formal English, and the modal 'might' in the result clause to signal that even in the counterfactual the outcome is not certain.
“'If I were starting again, the weighting might be 35-30-20-15, leaning even harder on fluency.'”
Common mistake: Using 'If I was starting again, the weighting will be...' mixes registers. 'Was' is acceptable in informal speech but the formal pattern is 'were'; and 'will be' in the result clause turns a counterfactual into a prediction, which is not what the writer means.
Comparative 'more X than Y' for weighting decisions
When two options are being compared on a single dimension (here, weight in a composite score), English uses a simple comparative — 'higher than', 'more than', 'less than'. Note that the second part of the comparison is often omitted when the contrast is clear from the prior sentence: 'We weight fluency higher [than other assessments do].' Skilled professional writing leaves the implied half out and lets the reader supply it.
“'Most assessments weight accuracy higher. We weight fluency higher.'”
Common mistake: Over-specifying the comparison: 'We weight fluency higher than we weight accuracy higher than other assessments weight fluency.' English prefers implication over repetition. When the comparison is clear, leave the second half implicit.
Present passive for system descriptions
The present passive ('is calibrated', 'is decided', 'is weighted') is the natural English voice for describing how a system behaves — the focus is on what happens to the artefact (the assessment, the score), not on who is doing it. This is the standard register for technical and product documentation. The agent is often omitted entirely because the action matters more than the actor.
“'The weighting was decided by three data points.' / 'The assessment is calibrated for working professionals.'”
Common mistake: Forcing the active voice in system descriptions: 'I calibrate the assessment for working professionals' shifts focus onto the speaker and reads as autobiography rather than specification. Use the passive when describing the system; switch to active when describing your decision-making process about the system.
Comprehension Questions
- 1.What is the final weighting the assessment uses across its four sub-scores, and which two sub-scores together account for 60% of the overall score?
- 2.What three data points settled the decision to weight fluency higher than accuracy?
- 3.Why does the post argue that a fluency-first weighting is the right choice for working professionals but not necessarily for everyone?
- 4.What two real costs does the author say the fluency-first weighting imposes, and how does the platform respond to each?
- 5.If you have received an English score from an assessment platform or exam and want to work out whether it is fluency-weighted or accuracy-weighted, what concrete signals would you look for?
Run your own diagnostic
Use the same Strategic Council I run my own decisions through. The assessment preview is free. The specific central human intelligence it is based on is verified in person during the call.
Start the free diagnostic →