An LLM (Large Language Model) is a neural network trained on a huge corpus of text to predict the next token. Modern LLMs like GPT, Claude, and Gemini are general-purpose enough to write code, summarize documents, answer questions, and follow multi-step instructions.

What is a token in AI?

A token is the chunk of text a language model actually sees — roughly three-quarters of a word in English. Pricing, context windows, and throughput are all measured in tokens rather than characters or words.

What is RAG (Retrieval-Augmented Generation)?

RAG is an architecture where relevant documents are retrieved from a knowledge base and injected into the prompt so the model can answer over fresh or proprietary data it was not trained on. It is the default pattern for chatbots over private data.

What is the difference between fine-tuning and RAG?

Fine-tuning continues training a base model on a small task-specific dataset to change its behavior; RAG keeps the model unchanged and gives it relevant documents at query time. Fine-tuning is best for tone, format, or stable domain knowledge; RAG is best for fresh facts and traceable citations.

What is a reasoning model?

A reasoning model is a language model post-trained to spend variable amounts of inference compute on internal deliberation before producing an answer. OpenAI's o-series, Claude with extended thinking, DeepSeek-R1, and Gemini Thinking are reasoning models.

What is prompt injection?

Prompt injection is an attack where malicious instructions hidden in user input, retrieved content, or tool output hijack a model into ignoring its system prompt. It is OWASP LLM01 and currently the top production AI security risk.

What does 'open weights' mean?

Open weights means a model's trained parameters are published so anyone can download, run, fine-tune, or build on them. Llama, Mistral, DeepSeek, and Qwen are open-weight families. It is distinct from 'open source,' which would also require training code and data.

Bloom AI

Follow on X →

Glossary

The AI vocabulary, in plain English.

Every term you actually need to follow a frontier model launch, an agent demo, or a policy fight — defined without marketing fluff and without a PhD-level prerequisite. Each entry has its own page.

By Aaron Whitfield · Updated May 24, 2026 · 79 terms

LLM (Large Language Model): An LLM is a neural network trained on a huge corpus of text to predict the next token. Modern LLMs like GPT-4, Claude, and Gemini are general-purpose enough to write code, summarize documents, answer questions, and follow multi-step instructions. Read more →
Foundation Model: A foundation model is a model trained on broad data at scale that can be adapted to many downstream tasks. The category includes LLMs, image models like Stable Diffusion and Midjourney, and multimodal systems like GPT-4o and Gemini. Read more →
Multimodal Model: A multimodal model is a model that natively processes more than one modality — text plus images, audio, or video — without bolting separate models together. The frontier is increasingly multimodal by default. Read more →
Token: A token is the chunk of text a model actually sees: roughly three-quarters of a word in English. Pricing, context windows, and throughput are all measured in tokens, not characters or words. Read more →
Context Window: A context window is the maximum number of tokens a model can consider in a single request — both input and output. Larger context windows (1M+ tokens) enable longer documents and more in-context examples, but cost and latency scale with size. Read more →
Inference: Inference is the act of running a trained model to generate outputs, as opposed to training it. Inference cost dominates the unit economics of most AI products at scale. Read more →
Training: Training is the process of updating model weights against a loss function on a dataset. Pretraining is the expensive bulk; fine-tuning and post-training are cheaper passes that shape behavior. Read more →
Fine-Tuning: Fine-tuning is continuing to train a base model on a smaller, task-specific dataset to specialize it. It is often used for tone, format, or domain knowledge — though prompting and RAG handle most cases more cheaply. Read more →
RLHF (Reinforcement Learning from Human Feedback): RLHF is a post-training technique where humans rank model outputs and the model is updated to prefer the ranked-better answers. It is largely why ChatGPT-style assistants feel usable. Read more →
RAG (Retrieval-Augmented Generation): RAG is an architecture that retrieves relevant documents from a knowledge base and injects them into the prompt so the model can answer over fresh or proprietary data. It is the default architecture for chatbots over private data. Read more →
Embedding: An embedding is a dense vector representation of text (or images, audio) that captures semantic similarity. Embeddings are the substrate of every RAG pipeline and semantic search system. Read more →
MoE (Mixture of Experts): MoE is an architecture in which only a subset of model parameters activates per token, allowing models with hundreds of billions of parameters to run at the cost of much smaller dense models. Mixtral and DeepSeek-V3 are well-known examples. Read more →
Agent: An agent is a system that uses an LLM to plan and execute multi-step tasks against tools (browsers, code interpreters, APIs). Agents are easy to demo and hard to ship reliably — the gap is most of the work. Read more →
Tool Use / Function Calling: Tool use is a model's ability to emit structured calls to external functions (APIs, databases, code) and incorporate the results. It is the foundation under every useful agent. Read more →
Hallucination: A hallucination is when a model generates content that sounds plausible but is factually wrong. Hallucinations are mitigated by RAG, grounding, citations, and constraints — never eliminated. Read more →
Eval: An eval is a test that measures whether a model or AI product does what you need. Public benchmarks (MMLU, SWE-bench, GPQA) are sport; product evals are what actually matter for shipping. Read more →
Quantization: Quantization is the practice of compressing a model's weights to lower precision (8-bit, 4-bit, even lower) so it fits on smaller hardware with minimal quality loss. It is the reason capable models now run on a laptop. Read more →
Distillation: Distillation is training a smaller model to imitate a larger one's outputs, capturing most of the quality at a fraction of the inference cost. Read more →
Diffusion Model: A diffusion model is the class of model behind image and video generation (Stable Diffusion, Midjourney, Sora). Diffusion models generate outputs by iteratively denoising random noise toward a target. Read more →
Frontier Model: A frontier model is one of the most capable models from the leading labs at a given moment. The frontier moves every few months and dictates what's possible at the application layer. Read more →
Open Weights: Open weights means a model's trained parameters are published so anyone can run, fine-tune, or build on them. Llama, Mistral, DeepSeek, and Qwen are the recurring examples. Distinct from 'open source,' which would also require training code and data. Read more →
Scaling Laws: Scaling laws are empirical relationships showing how model quality improves as a predictable function of parameters, data, and compute. They are the reason labs keep spending more on training runs. Read more →
Alignment: Alignment is the broad problem of getting AI systems to do what humans actually want, including under distribution shift, adversarial inputs, and increasing capability. It encompasses safety, robustness, and value loading. Read more →
Jailbreak: A jailbreak is a prompt or technique that bypasses a model's safety training to elicit content it was trained to refuse. It is an active cat-and-mouse area between users, attackers, and labs. Read more →
Pretraining: Pretraining is the expensive base phase of training a foundation model — predicting tokens over a massive web-scale corpus. A frontier pretraining run in 2026 burns tens of thousands of GPUs for months and costs hundreds of millions of dollars. Read more →
Post-Training: Post-training is everything done to a base model after pretraining to shape behavior: supervised fine-tuning (SFT), RLHF, RLAIF, DPO, constitutional AI, tool-use training, and safety tuning. Post-training is where a raw model becomes a usable assistant. Read more →
SFT (Supervised Fine-Tuning): SFT is training a model on curated input/output pairs that demonstrate the desired behavior. It is the first post-training step before RLHF or preference optimization. Read more →
DPO (Direct Preference Optimization): DPO is a simpler, more stable alternative to RLHF that optimizes directly on preference pairs without training a separate reward model. It is now the default at many labs. Read more →
RLAIF (Reinforcement Learning from AI Feedback): RLAIF is using a strong model to rank outputs in place of human labelers. It is cheaper and faster than RLHF, and the core idea behind Anthropic's constitutional AI. Read more →
Constitutional AI: Constitutional AI is Anthropic's post-training recipe where a model critiques and revises its own outputs against a written set of principles — the 'constitution' — before preference learning. Read more →
Chain of Thought (CoT): Chain of thought is prompting (or training) a model to write intermediate reasoning steps before answering. The simple version is 'think step by step'; the trained version is what powers o-series and Claude thinking modes. Read more →
Reasoning Model: A reasoning model is a model post-trained to spend variable amounts of inference compute on internal deliberation before producing an answer — OpenAI o-series, Claude with extended thinking, DeepSeek-R1, Gemini Thinking. Read more →
Test-Time Compute: Test-time compute refers to inference-time techniques that trade more compute per query for higher quality: longer chains of thought, self-consistency, best-of-N sampling, tree search. It is the basis of the reasoning-model paradigm. Read more →
Speculative Decoding: Speculative decoding is using a small draft model to propose tokens and a large model to verify them, generating multiple tokens per forward pass. It is a standard inference optimization that can 2–3x throughput. Read more →
KV Cache: The KV cache is the cached key/value tensors from prior tokens that let a transformer generate the next token without recomputing attention over the whole sequence. It dominates GPU memory at long context. Read more →
Prompt Caching: Prompt caching is provider-side caching of a long prefix (system prompt, retrieved docs, few-shots) so repeated requests pay only for the suffix. It cuts cost 50–90% on chat workloads with stable prefixes. Read more →
Function Calling: Function calling is a model API where the model emits a structured JSON call to a developer-defined function instead of free text. It is the substrate for tool use and agents. Read more →
Structured Outputs: Structured outputs are API-enforced JSON schema (or regex/grammar) on model output, guaranteeing valid structure. They drop parser-error rates by orders of magnitude versus prompting alone. Read more →
MCP (Model Context Protocol): MCP is Anthropic's open standard for connecting LLM apps to tools and data sources — a USB-C for models. It has been adopted across IDEs, agents, and IDE-style products in 2025. Read more →
Vector Database: A vector database is a database optimized for nearest-neighbor search over embeddings. Pinecone, Weaviate, Qdrant, Milvus, pgvector, Turbopuffer, and LanceDB are the common choices. Read more →
Reranker: A reranker is a second-stage model that re-scores the top-k results from a retrieval system to push the truly relevant documents to the top. Cohere, Voyage, and BGE rerankers are the common picks. Read more →
Hybrid Search: Hybrid search is combining lexical (BM25, keyword) and semantic (embedding) search. It almost always beats either alone on real production corpora. Read more →
Chunking: Chunking is splitting source documents into retrieval-sized pieces. It is the unglamorous decision that quietly decides whether a RAG product is good or bad. Read more →
Grounding: Grounding is constraining model output to cited source material so claims can be verified. It is the honest alternative to hoping the model doesn't hallucinate. Read more →
Guardrails: Guardrails are runtime checks on model inputs and outputs — toxicity, PII, jailbreak attempts, off-topic queries, schema violations. NeMo Guardrails, Guardrails AI, and homegrown systems are common. Read more →
Prompt Injection: Prompt injection is an attack where malicious instructions hidden in user input, retrieved content, or tool output hijack the model. It is OWASP LLM01 and the top production AI security risk. Read more →
Indirect Prompt Injection: Indirect prompt injection is prompt injection delivered through content the model fetches (a webpage, an email, a PDF) rather than typed by the user. It is much harder to defend against than the direct kind. Read more →
Red-Teaming: Red-teaming is adversarially testing a model or AI system to surface failure modes, jailbreaks, dangerous capabilities, and policy violations before launch. Read more →
AI Eval Harness: An AI eval harness is software that runs a model against a fixed set of test cases and scores the outputs. Inspect, Promptfoo, OpenAI Evals, lm-evaluation-harness, and homegrown harnesses are the common stacks. Read more →
MMLU: MMLU (Massive Multitask Language Understanding) is a 57-subject multiple-choice benchmark that became the default LLM leaderboard. It was largely saturated by 2024. Read more →
GPQA: GPQA (Graduate-level Google-Proof Q&A) is a harder reasoning benchmark designed to resist memorization, often used as the new MMLU. Read more →
SWE-bench: SWE-bench is a benchmark of real GitHub issues that the model must resolve by editing a real repository. SWE-bench Verified is the trusted subset; progress here is the headline metric for coding agents. Read more →
ARC-AGI: ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark of visual puzzles designed by François Chollet to measure generalization. The frontier moved on it sharply in late 2024 with reasoning models. Read more →
HumanEval: HumanEval is an older Python coding benchmark, mostly saturated; useful as a sanity check, not as a frontier signal anymore. Read more →
MMMU: MMMU is a multimodal reasoning benchmark across college-level subjects with text + diagrams. It is the default vision-language leaderboard. Read more →
LMSYS Chatbot Arena: LMSYS Chatbot Arena is a live ELO leaderboard built from anonymous human pairwise preferences across chat models. It is the closest thing to a vibe benchmark with real signal. Read more →
Long-Context Eval: A long-context eval is a benchmark like needle-in-a-haystack, RULER, or BABILong that probes whether a model actually uses its advertised context window or degrades in the middle. Read more →
Vision-Language Model (VLM): A VLM is a model that natively accepts images and text in the same input. GPT-4o, Claude 3.5+, Gemini, Qwen-VL, and Llama Vision are the current crop. Read more →
Diffusion Transformer (DiT): A diffusion transformer is the transformer-based architecture that replaced U-Nets in modern image and video diffusion models — Stable Diffusion 3, Flux, Sora, Veo. Read more →
Text-to-Video: Text-to-video is the task of generating coherent video from a text prompt — Sora, Veo, Runway Gen-3, Kling, Pika, Hailuo, LTX. Quality crossed the production-usable threshold for short clips in 2024–2025. Read more →
Speech-to-Speech: Speech-to-speech models are end-to-end voice models that take audio in and emit audio out without a text bottleneck. OpenAI Realtime, GPT-4o voice, and a growing open-weight crop (Moshi, Hume EVI) are the examples. Read more →
Diarization: Diarization is identifying who-spoke-when in an audio stream — a key step in any voice agent pipeline that handles multi-speaker input. Read more →
Vision-Language-Action Model (VLA): A VLA is a model that takes pixels and instructions and outputs robot actions. RT-2, Physical Intelligence π0, and OpenVLA are the foundation-model bet for robotics. Read more →
Compute (FLOPs): Compute, measured in FLOPs (floating-point operations), is the standard unit for measuring training and inference cost. A frontier pretraining run in 2026 is roughly 1e26 FLOPs. Read more →
GPU-Hour: A GPU-hour is one hour on one GPU; the practical unit cloud bills are denominated in. Training a frontier model is millions of H100-hours. Read more →
Inference Optimization: Inference optimization refers to speed and cost tricks at serving time: speculative decoding, paged attention, continuous batching, quantization, FlashAttention, KV cache offload, prefix caching. Read more →
vLLM: vLLM is an open-source high-throughput LLM inference engine (paged attention, continuous batching). It is the default self-hosted serving stack for open-weight models. Read more →
FlashAttention: FlashAttention is an exact attention algorithm that exploits GPU memory hierarchy to run faster and use less memory. It is standard in every modern training and inference stack. Read more →
LoRA (Low-Rank Adaptation): LoRA is a fine-tuning method that trains small low-rank update matrices on top of frozen base weights. It is cheap to train, cheap to store, and easy to swap. Read more →
QLoRA: QLoRA is LoRA on top of a quantized base model — it lets you fine-tune a 70B model on a single consumer GPU. It is the reason hobbyists can fine-tune at all. Read more →
Synthetic Data: Synthetic data is training data generated by another model (or pipeline). It is now a dominant fraction of post-training data at every frontier lab. Read more →
Data Mixture: Data mixture is the recipe of which datasets, in what proportions, go into pretraining. It is a closely guarded secret at every frontier lab — arguably more important than architecture. Read more →
Catastrophic Forgetting: Catastrophic forgetting is when fine-tuning on a narrow task degrades the model's general capability. It is the reason naive fine-tuning often makes models worse, not better. Read more →
Model Card: A model card is a standardized document describing a model's intended use, training data, evals, limitations, and risks. It is required by the EU AI Act for high-risk systems. Read more →
Frontier Model Forum: The Frontier Model Forum is an industry body of leading AI labs (Anthropic, Google, Microsoft, OpenAI, Meta, Amazon) coordinating on safety research and policy engagement. Read more →
Responsible Scaling Policy (RSP): An RSP is a lab's published commitment to capability thresholds (ASL levels at Anthropic, Preparedness Levels at OpenAI) that trigger additional safety measures. Read more →
AI Safety Institute (AISI): An AISI is a government-funded safety evaluator — US AISI, UK AISI, Japan AISI, Singapore — that tests frontier models pre-deployment under voluntary lab agreements. Read more →
Sparse Autoencoder (SAE): A sparse autoencoder is an interpretability tool that decomposes model activations into a large dictionary of monosemantic features. It is the core of modern mechanistic interpretability work at Anthropic and DeepMind. Read more →
Mechanistic Interpretability: Mechanistic interpretability is reverse-engineering the internal circuits of a neural network into human-understandable algorithms. It is the most ambitious thread in technical safety research. Read more →

Frequently asked questions

What is an LLM?: An LLM (Large Language Model) is a neural network trained on a huge corpus of text to predict the next token. Modern LLMs like GPT, Claude, and Gemini are general-purpose enough to write code, summarize documents, answer questions, and follow multi-step instructions.
What is a token in AI?: A token is the chunk of text a language model actually sees — roughly three-quarters of a word in English. Pricing, context windows, and throughput are all measured in tokens rather than characters or words.
What is RAG (Retrieval-Augmented Generation)?: RAG is an architecture where relevant documents are retrieved from a knowledge base and injected into the prompt so the model can answer over fresh or proprietary data it was not trained on. It is the default pattern for chatbots over private data.
What is the difference between fine-tuning and RAG?: Fine-tuning continues training a base model on a small task-specific dataset to change its behavior; RAG keeps the model unchanged and gives it relevant documents at query time. Fine-tuning is best for tone, format, or stable domain knowledge; RAG is best for fresh facts and traceable citations.
What is a reasoning model?: A reasoning model is a language model post-trained to spend variable amounts of inference compute on internal deliberation before producing an answer. OpenAI's o-series, Claude with extended thinking, DeepSeek-R1, and Gemini Thinking are reasoning models.
What is prompt injection?: Prompt injection is an attack where malicious instructions hidden in user input, retrieved content, or tool output hijack a model into ignoring its system prompt. It is OWASP LLM01 and currently the top production AI security risk.
What does 'open weights' mean?: Open weights means a model's trained parameters are published so anyone can download, run, fine-tune, or build on them. Llama, Mistral, DeepSeek, and Qwen are open-weight families. It is distinct from 'open source,' which would also require training code and data.