Large Language Models · LLM

The AI
Lifecycle

Follow a single prompt as it travels from your message all the way to a generated answer — every stage of comprehension and generation in between. The journey splits into two connected engines: Comprehend (turning your words into meaning and grounded context) and Generate (turning that meaning into a reasoned, written answer). Same shape as search — retrieve, then produce — with one genuinely new step.

Context
Tokenize
Embed
Retrieve · RAG
Attend
Generate
Feedback Loop
PHASE A Comprehend // prompt → tokens → vectors → grounded context
A1

Assemble the context window

the model's whole world

Your message is never seen alone. It is stitched together with the system instructions and the chat history into one context window — the entire, finite input the model gets. Nothing is looked up in a database of stored answers.

PART 1

System prompt

The standing rules: who the model is and what it may do.

set by · the app
PART 2

Chat history

Everything said so far in this conversation, in order.

scope · the session
PART 3

Your message

The new prompt, the thing you just typed.

scope · this turn
your message»“explain attention simply”
A2

Tokenize

engine: BPE tokenizer

A model never sees letters. The text is split into tokens — subword pieces from a fixed vocabulary. Common words stay whole; rarer ones split into reusable pieces, and each token becomes an integer ID.

tokensexplainattentionsimply ids 25, 6817, 9760, 88
A3

Embed

engine: embedding table

Each token ID is looked up and becomes a vector — a list of numbers — so meaning turns into geometry: similar ideas land near each other (king near queen). The very same vector space that powers semantic search.

each token → a vector[0.12, -0.41, …][0.90, 0.08, …][-0.33, 0.71, …]
A4

Retrieve context · RAG (if grounded)

engine: vector search

The prompt's vector is matched against a vector database to pull the most relevant documents — your docs, your code, your knowledge base — and inject them into the context. This is search's retrieval step, reborn inside the model's input.

GROUNDED Documents found

The nearest passages are injected into the context window, so the answer is anchored to real sources, not just memory.

PARAMETRIC No retrieval

With no knowledge base attached, the model answers from its trained parameters alone — fast, but ungrounded.

injectedtransformer-paper.md · attention-explained.md
CONTEXT + MEANING READY ↓ feeds the transformer
PHASE B Generate // attention → next-token loop → tools → answer
B1

Attend

engine: transformer · self-attention

A word means nothing alone. The model weighs every token against every other, all at once, to work out what refers to what and what matters. This is the 2017 transformer doing its work — relevance scoring, the search engineer's craft, turned inward on the sentence.

weightsattention looks most at explain and the retrieved transformer-paper
B2

Generate, token by token

★ the one new step engine: autoregressive loop

Now the model writes. It produces a probability over every token in its vocabulary, picks one, appends it, and runs the whole thing again — building the answer one piece at a time. A very large, very capable autocomplete. This is the stage search never had: a search engine ranks pages that already exist; this loop generates text that did not.

streaming“Attention lets each word look at the others …”
B3

Act with tools · MCP (if needed)

protocol: MCP

When the task needs more than text, the model calls tools — query a database, hit an API, edit a file. This is the line between a chatbot and an agent.

NEEDS TOOL Call out

The model emits a structured tool call, runs it, and feeds the result back into the context — then keeps generating.

TEXT ONLY Answer directly

If the answer is just language, no tool is called — generation continues straight to the reply.

tool callsearch_docs("self-attention")3 results
B4

Review & answer

render

For multi-step work, a planner → worker → reviewer loop iterates, and a human approves anything consequential before it ships. Then the finished answer streams to your screen, token by token.

delivered“Attention lets every word look at the others and decide which ones matter.”

The Agentic & Feedback Loop  // every answer can loop, and every preference trains the next model

A model is not one-and-done either. Within a task it can loop until the work is right; across millions of tasks, your reactions quietly shape the next version. This is the engine behind plan → act → review and the preference → training ladder.

L1
Plan → act → review

Agents break the task down, do a step, check the result, and loop until it holds.

L2
Human gate

A person approves anything consequential before it ships — the loop's safety valve.

L3
Stream to screen

The answer arrives live, token by token, instead of all at once.

L4
Preference → RLHF

Your thumbs-up / edit / retry feed the training data, so the next model gets better.

The Four-Layer Stack · what runs underneath
LLMthe model · transformer weights
RAGretrieval-augmented grounding
MCPtool-calling protocol
Agentsplan · act · review orchestration
Vector DBsemantic retrieval store
Context Windowthe finite input budget
Algorithms in play
BPE Tokenizertext → subword tokens
Embeddingstokens → vectors
Self-Attentioncontext, computed
Next-Token Predictionautoregressive generation
ANN Searchnearest-vector retrieval
KV-Cachefast repeated decoding