Skip to content
Back to writingWRITING

How AI candidate scoring actually works — and where it falls apart

22 Mar 2026 · 7 min read · SunEdge AI

The phrase "AI scoring" gets thrown around with very little explanation of what's actually happening under the hood. For agency owners trying to evaluate vendors — or build their own — it's worth understanding what a defensible scoring pipeline looks like and where it gets things wrong.

Stage one: vector retrieval

Every candidate's profile and every job description gets converted into a numeric vector — a long list of numbers that represents the semantic meaning of the text. Two candidates with similar backgrounds have similar vectors. A role and a well-matched candidate have similar vectors.

This is how you go from 1,000 candidates in your pool to a manageable shortlist of 50 in milliseconds, without an LLM ever reading a single CV. The vectors are stored in a vector database, and the system does a similarity search — find the 50 candidates whose vector is closest to this role's vector.

This stage is fast, cheap, and surprisingly accurate for surface-level matching. It's also where most of the false negatives happen. A candidate whose CV says "led the FP&A function" and a role looking for "Director of Financial Planning" will have similar vectors. A candidate whose CV is poorly written, or who's in a non-obvious adjacent role, often won't.

Stage two: LLM re-rank

The top 50 from vector retrieval go into the second stage. The LLM — typically GPT-4 or Claude Sonnet — receives the full job description and the full candidate profile, and returns a structured score: skills match, seniority fit, industry fit, location, availability, with a written rationale for each.

This is the stage that produces the explainable output. When you see "94 — strong match, 7 years Python including 4 at Monzo on payments backend", that came from this step.

The re-rank is slow, expensive, and dramatically more accurate than vector retrieval alone for nuanced fit. It also has its own failure modes.

Where AI scoring falls apart

The honest list, from most common to least.

Scoring drift over time

The same candidate, scored against the same role, will get a slightly different score on different runs. Variance is usually ±3 points, occasionally ±8. We cache scores per candidate-role pair so the recruiter never sees this jitter, but it's there.

Niche skills and recent role changes

If a candidate just moved into a new role two months ago, the LLM tends to undervalue their new skills — still listed as a "tester" in their headline, even though they've been doing dev work. We mitigate by pulling LinkedIn updates and GitHub activity, but it's imperfect.

Hard requirements masquerading as soft ones

An LLM can be persuaded to score a candidate based in Cork at 85 for a Dublin-only role, because all the other dimensions match. We solve this with deterministic filters that run after the LLM and zero out candidates failing hard requirements (right to work, salary band, location radius). This step is non-negotiable in any production system.

Bias in the source data

If the role description leans on phrases that correlate with specific demographics, the LLM will pick up on those correlations. We strip identifying information from candidate profiles before the LLM sees them — name, age, photo, gender markers — and we audit scoring outputs monthly against the original pool.

Confidently wrong rationales

Occasionally the LLM will invent a fact about a candidate that isn't in their CV. "Strong DevOps background" when the CV mentions DevOps once in passing. We constrain the LLM to quoting from the source data and flag rationales that introduce unsupported claims.

Why it still works

Despite all this, AI scoring outperforms manual screening on the metrics that matter. Recruiters using the system review 3–5× more candidates per role in the same time, and the candidates they ultimately shortlist convert to interviews at a higher rate — because the scoring catches strong candidates the recruiter wouldn't have found in a manual scan.

The right question isn't "is the AI score correct?" — it never is, completely. The right question is "does the AI score reliably surface candidates a good recruiter would want to talk to?" That bar is much more achievable, and it's the one we build to.

Want this for your agency?

Book a 20-min call. I'll show you the demo, ask about your current sourcing process, and tell you honestly whether this is a fit.

Book a 20-min call