Is AI accurate at grading TOEFL Speaking? 674 attempts

ETS now scores TOEFL Speaking with AI first. Here's what the official engine measures, how accurate it really is, and where a good grader adds something on top.

ETS doesn't score TOEFL Speaking by hand anymore. Since the January 2026 reform, its automated engine — SpeechRater — is the first-pass scorer for every Speaking response, with human raters pulled back to review only the ones it flags as unusual.

So "is AI accurate at grading Speaking?" isn't a question about some startup's gadget. It's a question about the official scoring path itself.

We built our own grader on the same standard. Across 674 graded attempts from 101 learners, here's what the official engine measures, how accurate it's actually known to be, and the one place a grader should add something on top.

The official standard: what SpeechRater measures

SpeechRater is ETS's automated speech-scoring engine. It runs speech recognition on your answer, then extracts more than 100 measurable features and groups them into three constructs. End to end, it looks like this:

How an AI scores a spoken TOEFL answer: audio to speech recognition to 100+ extracted features to three constructs — Delivery, Language Use, Topic Development — to a score

Those three constructs are the backbone of how a Speaking answer is judged. Each one opens up into the concrete sub-skills the engine actually measures:

What each construct measures: the AI rating breaks into Delivery (pace and pauses, pronunciation, rhythm and intonation), Language Use (grammar accuracy, vocabulary range, word-choice precision), and Topic Development (coherence, idea progression, elaboration)

At the feature level the same engine surfaces things like fluency, intelligibility, and — on a repeat-after-me task — repeat accuracy, the share of prompt words you reproduced correctly.

The structure matters more than the names. The official engine does not hand back a single verdict. It reads your answer along separate axes and scores each one — because a single number averages your strengths and weaknesses into one uninformative blur.

So is it accurate?

For the parts it measures, yes — and not by a small margin. ETS reports that its automated scoring agrees with human raters about as closely as two trained human raters agree with each other; the figures cited from its technical documentation put the human–machine correlation around 0.89, against roughly 0.96 between two humans. A machine also never drifts: the fiftieth recording of the day gets the same yardstick as the first.

That reliability isn't uniform across the three constructs, though — and ETS is the source for why.

Delivery and Language Use are where automated scoring is strongest, because both reduce to signals a model can count: speaking rate, pause length and frequency, intelligibility, grammatical errors, vocabulary range. They're measurable, and they get measured the same way every time.

Topic Development is the hard one. Judging whether an idea is genuinely relevant, specific, and well-reasoned is a judgment call, and ETS's own research has long treated it as the construct least suited to automation — the one most reliant on human readers. In plain terms: AI grades how you speak far more confidently than whether what you said was any good.

That's the honest shape of "accurate." Not omniscient — precise about the countable, and deferring on the rest.

What a grader should add on top of the standard

The standard tells you the constructs. It doesn't, on its own, tell a learner what to do next. That's the gap we built into.

We score the same constructs ETS does. Our interview engine reads Topic Development, Delivery, and Language Use; our Listen & Repeat engine reads the feature-level signals — fluency, intelligibility, repeat accuracy. Same axes the official engine uses, so practice transfers to the real thing.

What we add is two-fold. First, the real test hands back a band; we expose the per-construct scores underneath it, so you can see which axis cost you the points instead of guessing. Second — and this is the important one — on Topic Development, the construct even ETS treats as hardest for a machine, we don't compress the verdict into a single decimal and pretend it's settled. Each answer is broken into its Opening, Support, and Closing, with specific, quotable feedback on whether your point actually developed. The engine scores what it counts well, and shows its work on what it counts worst, rather than bluffing a number.

That's the line a grader earns trust on: confident where the signal is countable, transparent where it isn't.

What 674 attempts reveal about where you lose points

Here's the part the construct scores make visible. Across 148 interview answers, Topic Development is the highest-scoring construct, at 3.35 out of 5. Delivery and Language Use both sit lower, at 2.74 — and that's the real bottleneck.

Read that against the most common test-day fear. Learners worry most about what to say. The data says their ideas are mostly fine; the points are leaking out through how it's delivered and the grammar and vocabulary control underneath it.

It's a neat collision of the two halves of this piece. The constructs AI scores most confidently — Delivery and Language Use — are the very ones where learners have the most ground to gain. So practising against the machine's measurable axes isn't a workaround; it's practising the exact thing that's both reliably scored and most worth fixing.

This is also why a Speaking score so often feels stuck. You spend prep time gathering more to say, while the binding constraint is executing what you already have — clearly, and in real time. We dug into that gap separately in what counts as a good Speaking score; the construct data here is the mechanism behind it.

(One scale note, since the format is mid-transition: the section is reported on the 1–6 CEFR band. Our Listen & Repeat scoring already reports on that band; the interview construct scores above are shown on the 0–5 construct scale the engine reads them on, and are moving to the 1–6 band next.)

What an AI score is good for — and what it isn't

An AI grade is a fast, consistent, repeatable read on the measurable parts of your speaking. That makes it an excellent training instrument: take fifty attempts, get the same yardstick every time, and watch Delivery and Language Use climb with the reps. No human rater scales to that — which is exactly why ETS automated the first pass.

What it isn't is the last word on whether your idea was clever. Treat the number as a measurement of execution, and treat the per-answer feedback as the coaching on content. The real test now draws the same line — automated scoring on the countable constructs, humans in the loop for the rest — so a grader that's honest about that line is one that's teaching you the real thing.

Accurate, in the end, doesn't mean omniscient. It means knowing precisely what you can measure, measuring it the same way every time, and being honest about the rest.

Frequently asked questions

Does the real TOEFL use AI to grade Speaking?

Yes. Since the January 2026 reform, ETS's automated engine, SpeechRater, is the first-pass scorer for every Speaking response. Human raters now play a quality-assurance role, reviewing responses the engine flags as unusual rather than scoring every answer by hand.

Is AI TOEFL Speaking scoring accurate?

On the countable constructs — Delivery and Language Use — it's highly consistent, and ETS reports its automated scores agree with human raters about as closely as two trained humans agree with each other. It's least confident on Topic Development, the substance of what you say, which is why human oversight stays in the loop.

What is SpeechRater?

SpeechRater is ETS's official automated speech-scoring engine for the TOEFL. It runs speech recognition on your response, then extracts more than 100 features grouped into three constructs — Delivery, Language Use, and Topic Development — to produce a score.

What can't AI judge well in TOEFL Speaking?

Topic Development — whether your idea is relevant, specific, and well-reasoned. ETS's own research treats it as the construct least suited to automation. AI judges how you say something far more confidently than whether what you said was any good.

How is TOEFL Speaking scored in 2026?

An automated engine scores each response first, across Delivery, Language Use, and Topic Development, reported on a 1–6 band aligned to the CEFR. Humans review flagged responses. See our breakdown of what counts as a good score for the full band table.