Glossary

Brief definitions for technical terms that recur across the site. Each one is meant to be friendly before it is exhaustive.

Little world: A bounded form that lets the mind see a larger order: a syllable, token, equation, diagram, parable, model, or controlled example.
Microcosm: A small world that helps disclose a larger one. It abstracts away complexity so pattern can be examined.
Macrocosm: The larger order that a microcosm helps us approach: language, creation, culture, reason, mathematics, relation, and the Logos.
Token: A small unit of text used by a language model. A token may be a word, part of a word, punctuation mark, or other text fragment.
Probability distribution: A mathematical assignment of likelihoods across possible outcomes. For LLMs, it often means assigning probabilities to possible next tokens.
Embedding: A learned numerical representation of a word, token, sentence, or concept.
Vector: A list of numbers that can represent a point or direction in space. In LLMs, vectors often represent learned linguistic features.
Matrix: A rectangular grid of numbers. Matrix multiplication is one of the core operations behind neural networks.
Parameter: A number inside the model that is adjusted during training.
Loss function: A mathematical way of measuring how wrong the model's prediction was.
Gradient: A mathematical direction showing how to change parameters to reduce error.
Gradient descent: An optimization method that improves a model by repeatedly moving in the direction that reduces loss.
Attention: A mechanism that lets a model weigh which tokens matter for interpreting other tokens.
Transformer: A neural network architecture built around attention mechanisms, and the foundation of many modern LLMs.
Scaling law: A mathematical relationship showing how model performance changes with model size, data size, and compute.
Scale: The movement from many small operations into larger powers. In LLMs, scale links tokens and probabilities to broad linguistic behavior.