Initial Success

After replacing keyword matching with semantic search, the initial results were genuinely impressive. Leads that previously fell through the cracks — because users phrased things differently from how branches were described — started matching correctly. The retrieval pipeline felt like a significant step forward.

The numbers supported this. Match quality improved. Manual review rates dropped. The team felt confident that the hard problem was solved and that the remaining work was refinement rather than rethinking.

That confidence didn't last long.

The worst production bugs are the ones that don't look like bugs at first. The system was working — just not in the way we thought it was.

Within weeks of running the semantic pipeline in production, patterns emerged that keyword matching had never produced. Not errors exactly — the system wasn't crashing. But leads were being associated with branches that, on closer inspection, made no business sense at all.


The False Positive Problem

False positives in semantic retrieval are fundamentally different from those in keyword matching. With keywords, a false positive usually means a word appeared in the wrong context. With embeddings, a false positive means two things occupy nearby regions of a high-dimensional vector space — and that proximity felt meaningful to the model, even when it wasn't meaningful to the business.

The problem surfaced across several categories:

  • Branches for adjacent but unrelated services appearing in top results
  • Semantically broad categories consistently outranking specific ones
  • Leads about one domain retrieving branches from a loosely related domain
  • High similarity scores on matches that any human would immediately reject
What the pipeline was returning
Query
"Looking for someone to fix a broken pipe under the kitchen sink"
Retrieved
Plumber — correct
Also retrieved
Kitchen renovation specialist — similarity score: 0.81
Semantically close. Contextually wrong for this lead.
Also retrieved
Home appliance repair — similarity score: 0.78
Shares "kitchen" and "fix" in the embedding space.

The scores looked reasonable. That was the dangerous part. Nothing was obviously broken — the model was doing exactly what it was trained to do. The issue was that what it was trained to do and what the business needed were not the same thing.


Similar ≠ Relevant

This is the fundamental tension at the core of semantic retrieval systems: vector similarity is not the same as business relevance.

Embeddings capture semantic proximity — the degree to which two pieces of text share meaning, vocabulary patterns, and conceptual associations in the model's training data. They are very good at this. But business relevance introduces constraints that live entirely outside the embedding space.

Semantic similarity captures Business relevance requires
Shared vocabulary and concepts Correct service category for this specific request
Topical proximity Operational fit between lead intent and branch scope
Linguistic similarity Market-specific constraints and exclusions
General context overlap Distinction between adjacent but incompatible categories

A lead about water damage in a home is semantically close to content about plumbing, renovation, insurance, cleaning, and flooring — all at the same time. Embeddings cannot tell you which of these is the right match for the lead. They can only tell you which are nearby.

The core insight

Retrieval finds candidates. It does not make decisions. Treating retrieval results as decisions was the mistake — and it took production data to make that distinction clear.


Contextual Ambiguity

Beyond false positives, a second class of problem emerged that was harder to diagnose: contextual ambiguity.

Many leads contained enough information to be interpreted in multiple valid ways. The same sentence, taken literally, could reasonably map to different branches depending on context that wasn't present in the text itself.

The same query — very different correct answers
Lead A
"My garden needs complete work — everything is overgrown and I want it redesigned."
Could be: landscaper, garden maintenance, tree surgeon, garden designer
Lead B
"I need help with my roof — there's been a leak after the last storm."
Could be: roofer, waterproofing specialist, general builder, insurance assessor

For both leads, the embedding model would retrieve semantically relevant candidates. But without understanding the user's actual priority — urgency, budget, scope — it had no reliable basis for ranking them correctly.

Users rarely describe what they need in precise, service-catalogue terms. They describe situations. They describe problems. They describe outcomes they want. Embedding models, trained on general text, compress these descriptions into vector representations that capture their general meaning — but lose the specific intent that determines which branch is the right match.

The category overlap problem

A related issue appeared specifically around categories that were genuinely adjacent in the real world. In home services, the boundaries between construction, renovation, and general building work are fuzzy by nature. A lead about extending a house could legitimately involve architects, structural engineers, builders, and planning consultants — all of whom sit in nearby regions of the embedding space.

Semantic retrieval returned all of them. With similar scores. With no way to determine which was the primary match without additional context.

Production impact

When a lead matches too many branches, it either gets sent to all of them — creating noise — or it falls into an unresolved state requiring manual review. Both outcomes were worse than the keyword matching system the pipeline had replaced.


The Need for a Validation Layer

The conclusion from weeks of production data was uncomfortable but clear: semantic retrieval was a necessary component of the pipeline, not a sufficient one.

The retrieval stage was doing its job correctly. The problem was architectural — the system had been built as if retrieval was the final step, when in reality it needed to be the first step of a multi-stage process.

What was missing was a layer that could:

  • Evaluate candidate branches against the lead in context — not just by vector distance
  • Reason about whether a retrieved match was actually appropriate
  • Handle ambiguity by making a judgement call rather than returning all options
  • Apply business logic that lived outside the embedding space entirely

Ranking functions, threshold tuning, and metadata filters all helped at the margins. But the fundamental gap — the inability to reason about relevance rather than just measure similarity — could not be closed with retrieval techniques alone.

The system needed something that could read the lead, read the candidate, and decide. Not measure distance. Not rank by score. Decide.


Introducing LLM Validation

The solution that emerged from this analysis was to introduce an LLM-based validation layer between retrieval and final matching. The idea was straightforward in principle: after Elasticsearch returned a set of semantically similar candidates, an LLM would evaluate each candidate against the lead and determine whether the match was genuinely appropriate.

This reframed the problem in a way that better matched its actual nature. Classification of ambiguous natural language against business categories is not a similarity problem — it is a reasoning problem. And LLMs, trained on vast amounts of human reasoning about language and context, are considerably better suited to reasoning than to measuring vector distance.

The architecture that emerged from this shift — retrieval as candidate generation, LLM as validation — produced results that neither system could achieve independently. But introducing an LLM into a production pipeline created a new class of challenges that were entirely different from the ones it solved.

What comes next

The next article covers the full hybrid architecture: how retrieval and LLM validation were combined, how prompts were designed to produce reliable decisions, and what happened when they didn't.


What This Stage Taught

The period between deploying semantic search and introducing LLM validation was one of the most instructive phases of the project — not because things worked, but because of exactly how they failed.

The core lesson was about the difference between capability and correctness. Embeddings are capable of capturing semantic meaning at a level that keyword systems cannot approach. But capability does not automatically produce correct outcomes in a production system with business constraints.

Building retrieval pipelines means accepting that retrieved results are hypotheses, not answers. Every system that treats retrieval output as a final decision — without a layer that evaluates those hypotheses against the real-world context — is making an assumption that the data will eventually disprove.

The false positives were not a bug in the embedding model. They were a predictable consequence of applying a general-purpose similarity tool to a problem that required domain-specific reasoning. Recognising that distinction — and building for it rather than against it — changed the direction of the entire system.