Information Retrieval · IRT

The Search Query
Processing Lifecycle

Follow a single word as it travels from the search bar all the way to ranked results — every detection, correction, dictionary lookup and learning step in between. The journey splits into two connected engines: Search Suggest (turning raw keystrokes into a clean suggestion) and Search Result (turning that suggestion into matched inventory).

Capture
Language
Normalize
Suggestion Dictionaries
Words Dictionary
Results / Inventory
Learning Loop
PHASE A Search Suggest // keystroke → corrected, ranked suggestion
A1

Capture user keywords

trigger: keystroke

A character (or partial word) lands in the search bar. The raw, unprocessed string enters the pipeline exactly as typed — including typos, casing and stray characters.

raw input»“sheng…”, “クラム”, “Por”, “toukyo”
A2

Detect language

engine: Apache LangDetect

Identify the script and spoken language so the right rules apply. Disambiguates near-identical alphabets — Japanese vs Chinese, Hindi vs Bengali, Urdu vs Persian — and handles cross-language intent (search in CJK, results in English).

script トッキオ · とうきょう · 東京 all resolve to “Tokyo”
A3

Auto-correction (if applicable)

engine: Elasticsearch

Fix misspellings using N-Gram / Shingle similarity and Edit-Distance scoring. Four classic error types are repaired by transposition, insertion, deletion or substitution.

transpose Brelin Berlin
insert Munchen Muenchen
delete Toukyo Tokyo
substitute Shenghai Shanghai
A4

Detect abuse · root word · stop words

engine: Kafka Streams

Clean and reduce the term: strip stop words (the, to, of, and), flag slang / negative / abusive tokens, then stem to the root word so variants collapse to one canonical form.

stem running · ran · runs run
A5

Look up the Suggestion Dictionaries

3-tier cascade

The cleaned term is checked against three suggestion dictionaries in priority order. The first tier that has a match wins and returns immediately — personalised history beats regional, which beats global.

TIER 1

User history

What this person has searched & picked before.

scope · individual
TIER 2

Country history

Popular searches within the user’s region.

scope · regional
TIER 3

Global history

System-wide demand, ordered by rank & word frequency.

scope · everyone
HIT Suggestion found

Return the suggestion immediately, ranked by frequency & popularity. Pipeline ends here — fast path. → flows into Phase B.

MISS Not in any suggestion dict

No history match anywhere. Fall through to the authoritative Words Dictionary check below.

A6

Fallback → Words Dictionary

authoritative lexicon

Is the cleaned term a real, valid word at all? The Words Dictionary is the source of truth that decides whether this becomes a brand-new suggestion or gets rejected as noise.

VALID Word exists

Promote it — add the term into all three suggestion dictionaries (user, country & global) so it’s instantly available next time, then return it as a suggestion.

INVALID Not a word

No suggestion. The input is treated as a bug / garbage / nonsense string and the suggest pipeline stops cleanly.

SUGGESTION READY ↓ feeds the result engine
PHASE B Search Result // suggestion → tagged inventory → ranked results
B1

Receive keyword from Search Suggest

handoff

The clean, corrected, language-aware keyword arrives from Phase A as the trusted query seed.

seed “men shirt uniqlo”
B2

Fetch linked tag words

tag graph

Map the keyword onto the inventory’s tag vocabulary — the words products are labelled with, e.g. men, shirt, uniqlo.

B3 · B4

Match inventory & resolve context

retrieve

Fetch tagged items
Pull every product carrying the tags men + shirt + uniqlo.

Establish context
The intersection defines intent: “Uniqlo men’s shirts” inventory context.

B5

Return the result list

render

Emit the matched list with concrete product codes / product links — the visible search results shown to the user.

The Learning Loop  // every result feeds the system smarter

Search isn’t one-and-done. Each interaction quietly upgrades the dictionaries and tag graph, so the next query for everyone gets better. This is the engine behind the keyword → word → tag promotion ladder.

L1
Increment frequency

Every searched word that returns inventory gets its usage count bumped.

L2
Raise word rank

When frequency crosses a threshold, the word climbs to the next ranking tier.

L3
Record breadcrumb

On a product click, log the trail into user history & link it to that user’s other picks.

L4
Promote word → tag

If a word’s rank reaches “tag” level, attach it to the product’s tags. The vocabulary grows itself.

Search-Oriented Architecture · stack behind the pipeline
Elasticsearchauto-correct · autocomplete · suggest
Apache LangDetectlanguage detection
Kafka Streamsstop words · stemming · abuse filter
Inverted Indexcore retrieval structure
Graph DBalternate spellings · domain
WordNetsynonyms · semantic relations
Apache UIMANLP · context analysis
Postgres FTSsuggestions / recommendations
Algorithms in play
N-Gram / Shinglesfuzzy similarity
Edit Distancetypo correction
Trie / TSTprefix autocomplete
Stemmingroot-word reduction
Likelihood Modelranking suggestions