Information Retrieval · IRT

The Search Query
Processing Lifecycle

Follow a single word as it travels from the search bar all the way to ranked results — every detection, correction, dictionary lookup and learning step in between. The journey splits into two connected engines: Search Suggest (turning raw keystrokes into a clean suggestion) and Search Result (turning that suggestion into matched inventory).

Capture

Language

Normalize

Suggestion Dictionaries

Words Dictionary

Results / Inventory

Learning Loop

PHASE A Search Suggest // keystroke → corrected, ranked suggestion

Capture user keywords

trigger: keystroke

A character (or partial word) lands in the search bar. The raw, unprocessed string enters the pipeline exactly as typed — including typos, casing and stray characters.

raw input»“sheng…”, “クラム”, “Por”, “toukyo”

Detect language

engine: Apache LangDetect

Identify the script and spoken language so the right rules apply. Disambiguates near-identical alphabets — Japanese vs Chinese, Hindi vs Bengali, Urdu vs Persian — and handles cross-language intent (search in CJK, results in English).

script トッキオ · とうきょう · 東京 → all resolve to “Tokyo”

Auto-correction (if applicable)

engine: Elasticsearch

Fix misspellings using N-Gram / Shingle similarity and Edit-Distance scoring. Four classic error types are repaired by transposition, insertion, deletion or substitution.

transpose Brelin → Berlin

insert Munchen → Muenchen

delete Toukyo → Tokyo

substitute Shenghai → Shanghai

Detect abuse · root word · stop words

engine: Kafka Streams

Clean and reduce the term: strip stop words (the, to, of, and), flag slang / negative / abusive tokens, then stem to the root word so variants collapse to one canonical form.

stem running · ran · runs → run

Look up the Suggestion Dictionaries

3-tier cascade

The cleaned term is checked against three suggestion dictionaries in priority order. The first tier that has a match wins and returns immediately — personalised history beats regional, which beats global.

TIER 1

User history

What this person has searched & picked before.

scope · individual

TIER 2

Country history

Popular searches within the user’s region.

scope · regional

TIER 3

Global history

System-wide demand, ordered by rank & word frequency.

scope · everyone

HIT Suggestion found

Return the suggestion immediately, ranked by frequency & popularity. Pipeline ends here — fast path. → flows into Phase B.

MISS Not in any suggestion dict

No history match anywhere. Fall through to the authoritative Words Dictionary check below.

Fallback → Words Dictionary

authoritative lexicon

Is the cleaned term a real, valid word at all? The Words Dictionary is the source of truth that decides whether this becomes a brand-new suggestion or gets rejected as noise.

VALID Word exists

Promote it — add the term into all three suggestion dictionaries (user, country & global) so it’s instantly available next time, then return it as a suggestion.

INVALID Not a word

No suggestion. The input is treated as a bug / garbage / nonsense string and the suggest pipeline stops cleanly.

SUGGESTION READY ↓ feeds the result engine

PHASE B Search Result // suggestion → tagged inventory → ranked results

Receive keyword from Search Suggest

handoff

The clean, corrected, language-aware keyword arrives from Phase A as the trusted query seed.

seed “men shirt uniqlo”

Fetch linked tag words

tag graph

Map the keyword onto the inventory’s tag vocabulary — the words products are labelled with, e.g. men, shirt, uniqlo.

B3 · B4

Match inventory & resolve context

retrieve

Fetch tagged items
Pull every product carrying the tags men + shirt + uniqlo.

Establish context
The intersection defines intent: “Uniqlo men’s shirts” inventory context.

Return the result list

render

Emit the matched list with concrete product codes / product links — the visible search results shown to the user.

↻ The Learning Loop // every result feeds the system smarter

Search isn’t one-and-done. Each interaction quietly upgrades the dictionaries and tag graph, so the next query for everyone gets better. This is the engine behind the keyword → word → tag promotion ladder.

Increment frequency

Every searched word that returns inventory gets its usage count bumped.

Raise word rank

When frequency crosses a threshold, the word climbs to the next ranking tier.

Record breadcrumb

On a product click, log the trail into user history & link it to that user’s other picks.

Promote word → tag

If a word’s rank reaches “tag” level, attach it to the product’s tags. The vocabulary grows itself.

Search-Oriented Architecture · stack behind the pipeline

Elasticsearchauto-correct · autocomplete · suggest

Apache LangDetectlanguage detection

Kafka Streamsstop words · stemming · abuse filter

Inverted Indexcore retrieval structure

Graph DBalternate spellings · domain

WordNetsynonyms · semantic relations

Apache UIMANLP · context analysis

Postgres FTSsuggestions / recommendations

Algorithms in play

N-Gram / Shinglesfuzzy similarity

Edit Distancetypo correction

Trie / TSTprefix autocomplete

Stemmingroot-word reduction

Likelihood Modelranking suggestions