Analyzers: The Hidden Engine of Search
The Component Nobody Talks About
When teams discuss search relevance, the conversation usually jumps to ranking models, vector embeddings, or machine learning. But there's a component that sits upstream of all of them — one that silently determines whether a query will match a document or miss entirely.
Analyzers control how raw text is transformed into searchable tokens. They're the first line of relevance, and they're almost always under-configured.
An analyzer is a pipeline with three foundational stages:
Character Filters
Clean raw text stream
Tokenizer
Split text into tokens
Token Filters
Normalize and enrich
Every document field goes through this pipeline at index time. Every query goes through it at search time. If the two pipelines don't align — or if either one makes poor decisions — relevance breaks silently.
Stage 1: Character Filters
Character filters operate on the raw character stream before tokenization. They handle the kind of messy text that real-world data always contains.
Common Character Filters
- HTML Strip: Removes HTML tags from content. Essential when indexing web-scraped data or CMS content that includes markup.
- Mapping: Replaces specific characters or character sequences. For example, replacing
éwithe, or&withand. - Pattern Replace: Uses regex to normalize text. Useful for stripping formatting characters, normalizing phone numbers, or removing embedded codes.
When Character Filters Matter
In one project — a real estate search platform — listing descriptions contained HTML entities, embedded CSS classes, and inconsistent use of special characters. Without proper character filtering, queries for "3-bedroom" failed to match listings described as "3–bedroom" (with an em-dash). A simple mapping character filter fixed this and immediately improved recall by 8%.
Stage 2: Tokenizers
The tokenizer splits the character-filtered text into individual tokens. This is where the most impactful decisions happen.
Tokenizer Types
| Tokenizer | Behavior | Best For |
|---|---|---|
| Standard | Splits on whitespace and punctuation, removes most punctuation | General-purpose text |
| Whitespace | Splits only on whitespace, preserves punctuation | Technical content, code |
| Keyword | Treats the entire input as a single token | Exact-match fields (SKUs, IDs) |
| N-gram | Generates overlapping substrings of configurable length | Autocomplete, partial matching |
| Edge n-gram | Generates substrings anchored to the start of each token | Type-ahead / prefix search |
| Path hierarchy | Splits on path separators (e.g., /a/b/c -> a, a/b, a/b/c) | File paths, category trees |
| UAX URL Email | Preserves URLs and email addresses as single tokens | Content with embedded links |
The Tokenizer Trap
Choosing the wrong tokenizer can silently destroy relevance. A classic example:
If you use the standard tokenizer on a field containing product codes like ABC-123-XYZ, it splits on the hyphens and produces three separate tokens: ABC, 123, XYZ. A query for the exact code ABC-123-XYZ would then match any document containing any of those three tokens — including completely unrelated products.
The fix is to use a keyword tokenizer for exact-match fields and a separate analyzed field for full-text search. This is the multi-field pattern — one of the most important schema patterns in search engineering.
Stage 3: Token Filters
Token filters transform the stream of tokens produced by the tokenizer. This is where the nuance lives.
Essential Token Filters
Lowercase: Normalizes "Search" -> "search". Almost always applied. Without it, queries are case-sensitive, which is rarely what users expect.
Stemming: Reduces words to their root form. "running" -> "run", "contracts" -> "contract". Stemming improves recall (more matches) at the cost of precision (less exact matches).
There are two main stemming approaches:
- Algorithmic stemmers (Porter, Snowball) — Apply language-specific rules. Fast and predictable, but make mistakes with irregular words.
- Dictionary-based stemmers (Hunspell) — Look up words in a dictionary. More accurate but slower and require maintenance.
Synonyms: Map semantically equivalent terms. "car" -> ["car", "automobile", "vehicle"]. Synonyms can be applied at index time (expands what's stored) or query time (expands what's searched). Each approach has trade-offs:
- Index-time synonyms: Faster queries, but requires re-indexing when synonyms change.
- Query-time synonyms: More flexible, but adds query-time latency and can cause unexpected scoring effects.
Stopword removal: Filters out common words like "the," "is," "at." This reduces index size and prevents common words from diluting relevance scores. But as I've written about separately — stopwords are not as harmless as they look. In many domains, removing them breaks real queries.
Phonetic matching: Converts tokens to phonetic codes (Metaphone, Soundex) for "sounds-like" matching. Useful for names and places where spelling varies: "Smith" ≈ "Smythe".
Decompounding: Splits compound words (critical for German, Dutch, and Scandinavian languages). "Handschuh" -> "Hand" + "Schuh" (glove = hand + shoe).
Domain-Specific Analyzer Strategy
There's no universal analyzer configuration. The right setup depends entirely on your domain, your data, and your users.
E-Commerce Search
E-commerce has unique challenges: product names are short, attributes are critical, and users mix natural language with product codes.
Recommended approach:
- Use multi-field mappings — one field with aggressive analysis (stemming, synonyms) for recall, and another with minimal analysis (keyword or whitespace) for precision.
- Apply synonym filters carefully. Map brand abbreviations (
"HP"->"Hewlett-Packard"), common misspellings, and category aliases. - Use edge n-grams on product name fields to power autocomplete without a separate suggest component.
- Don't stem fields like brand names, model numbers, or SKUs.
Legal / Compliance Search
Legal text requires high precision. A search for "contract termination" should not loosely match documents about "contract renewal."
Recommended approach:
- Use conservative stemming or no stemming at all on primary fields.
- Enable stemming on a secondary, lower-boosted field for fallback recall.
- Implement phrase slop — allowing query phrases to match with limited word distance — rather than individual term matching.
- Build domain-specific synonym lists (
"breach"->"violation","party"->"signatory").
Multilingual Search
Searching across languages introduces a unique set of challenges. Each language has different tokenization rules, stemming algorithms, and stopword lists.
Recommended approach:
- Use language-specific analyzers for each language field (e.g.,
french,german,arabic). - Store content in separate per-language fields (
title_en,title_fr,title_de) with appropriate analyzers. - Use ICU analysis plugins for proper Unicode normalization, especially for Arabic, CJK (Chinese-Japanese-Korean), and Indic scripts.
- For CJK text, use bigram tokenizers rather than standard tokenization, since CJK languages don't use whitespace between words.
At one client — an automotive marketplace serving French and German markets simultaneously — we ran separate analyzer chains per language with shared synonym expansion for cross-language technical terms. This improved cross-market search quality measurably, and reduced zero-result rates in the German market by 22%.
Debugging Analyzer Issues
When search relevance feels wrong, the analyzer is usually the first place to look. Here's how to debug it.
The _analyze API
Both Elasticsearch and Solr expose APIs to test what an analyzer produces:
Elasticsearch:
POST /my-index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "The New York City apartments for rent"
}
Solr:
http://localhost:8983/solr/my-core/analysis/field?analysis.fieldtype=text_general&analysis.fieldvalue=The+New+York+City+apartments+for+rent
This shows you the exact tokens produced at each stage of the analysis chain. When a query doesn't match a document, compare the tokens produced for both — the mismatch will be immediately obvious.
Common Debugging Patterns
- Query matches too many irrelevant results -> Your analyzer is too aggressive. Check if stemming is reducing distinct concepts to the same root.
- Query returns zero results when it shouldn't -> Your analyzer is misaligned between index-time and query-time. Or your tokenizer is splitting terms that should stay whole.
- Similar queries produce wildly different result quality -> Check synonym handling. Synonyms applied inconsistently cause unpredictable scoring.
- Exact phrases don't match -> Your tokenizer may be splitting on characters within the phrase (hyphens, apostrophes, slashes).
The Multi-Field Pattern
One of the most powerful patterns in search schema design is indexing the same source content into multiple fields with different analysis chains:
{
"title": {
"type": "text",
"analyzer": "standard_with_stemming",
"fields": {
"exact": {
"type": "text",
"analyzer": "keyword_lowercase"
},
"autocomplete": {
"type": "text",
"analyzer": "edge_ngram_analyzer"
}
}
}
}
This gives you:
title-> Full-text search with stemming for broad recall.title.exact-> Exact phrase matching for precision.title.autocomplete-> Prefix matching for type-ahead suggestions.
You can then boost these fields differently in your query to balance precision and recall.
Best Practices
After years of tuning analyzers across industries — real estate, automotive, e-commerce, legal, and enterprise content platforms — here's what I'd recommend:
- Start with your domain. Don't copy-paste analyzer configs from blog posts. Understand your data, your users' vocabulary, and your business requirements.
- Test with real queries. Pull the top 1,000 queries from your search logs and run them through your pipeline. Check zero-result queries, low-CTR queries, and reformulation patterns.
- Don't rely on defaults. Default analyzers are a starting point, not a destination. The standard analyzer doesn't know your domain.
- Version your synonym and stopword files. Treat them as code. Review changes. Test for regressions.
- Monitor analyzer impact. When you change an analyzer, track nDCG, zero-result rate, and CTR before and after. Analyzers affect every query — regressions compound fast.
The Bottom Line
Search doesn't start with ranking models or embeddings. It starts with analyzers.
Get them right, and your scoring models have clean signal to work with. Get them wrong, and no amount of re-ranking, boosting, or machine learning will compensate.
Analyzers are the hidden engine of search relevance. If you're not actively managing them, you're leaving relevance on the table.
Apply Strategic Depth
Enterprise Advisory
Strategic partnership for Engineering Leads and CTOs. I bridge the gaps in your Search, AI, and Distributed Infrastructure.
Retainer
Inquiry OnlyRAG Health Audit
Diagnostics for retrieval precision, chunking strategy, and evaluation protocols.
Fixed Scope
€5k+Search Relevance
Hybrid Search implementation, scoring refinement, and analyzer tuning at the 1M+ level.
Performance
€3.5k+