-
Causal Language Model
A causal language model is a model trained to predict the next word in a sequence, using only the tokens to its left (previous context). It’s unidirectional.
For example:
Input: “The weather is” → Predict: “sunny”
Popular causal LMs:
- GPT-2 / GPT-3
- Gemma
- LLaMA
- Falcon Contrast this with masked models (like BERT), which predict missing words in the middle.
-
Chain of Thought Prompting
Definition: Chain-of-thought prompting is a method where the prompt includes intermediate reasoning steps, encouraging the model to “think out loud” and break down complex problems before answering.
LLMs like GPT can often solve simple problems directly. But for multi-step or reasoning-heavy tasks (e.g., math, logic puzzles, or common-sense reasoning), they perform significantly better when prompted to generate their reasoning first before concluding.
Example (with vs. without CoT)
Question:
If there are 3 cars and each car has 4 tires, how many tires are there in total?
Without CoT:
Prompt: “If there are 3 cars and each car has 4 tires, how many tires are there?”
Model Output: “4”
With CoT:
Prompt:
“If there are 3 cars and each car has 4 tires, how many tires are there? Let’s think step by step.”
Model Output:
“There are 3 cars. Each car has 4 tires. So the total number of tires is 3 × 4 = 12.
Answer: 12”
Common CoT Prompts:
- “Let’s think step by step.”
- “First…, then…, so…”
- “Let me reason this out.”
Variants:
- Zero-shot CoT: Add only “Let’s think step by step.” to the prompt.
- Few-shot CoT: Include multiple worked examples with reasoning chains in the prompt.
- Automatic CoT: Generate reasoning steps automatically for many problems at scale.
Chain-of-thought prompting helps the model activate latent reasoning paths in its neural structure that are less likely to be triggered by short, direct prompts. It mimics how humans approach complex tasks: by breaking them down.
-
Cosine Similarity
Cosine similarity is a metric used to measure how similar two vectors are by computing the cosine of the angle between them. It is widely used in machine learning, especially in text similarity, recommendation systems, and clustering.
Intuition
- If two vectors point in exactly the same direction, their cosine similarity is 1.
- If they are orthogonal (completely different), the similarity is 0.
- If they point in opposite directions, the similarity is -1.
It ignores magnitude, focusing on orientation, which makes it great for comparing text embeddings where length may vary but direction (semantic meaning) matters.
Formula
For two vectors A and B:
( \text{cosine_similarity}(A, B) = \frac{A \cdot B}{|A| |B|} )
Where:
- $A \cdot B$ = dot product of vectors A and B
- $|A|$ = Euclidean norm (length) of A
- $|B|$ = Euclidean norm of B
Example (Python)
from sklearn.metrics.pairwise import cosine_similarity import numpy as np A = np.array([[1, 2, 3]]) B = np.array([[4, 5, 6]]) similarity = cosine_similarity(A, B) print(similarity) # Output: [[0.9746]]
Use Cases
- NLP: comparing sentence or word embeddings
- Recommendation: finding similar users/items
- Clustering: grouping similar vectors
- Document similarity: e.g., search engines
-
Dependent Variable
A dependent variable is the variable being measured or predicted in an experiment or model. Its value depends on changes in one or more independent variables. In machine learning, it is often called the target or output variable, as it is the value the model aims to predict.
-
Embedding
An embedding is a learned representation of data in a lower-dimensional space. It transforms high-dimensional, discrete, or symbolic data (like words, users, or items) into dense, continuous vectors that preserve semantic or structural relationships.
Why use embeddings?
- Reduce dimensionality
- Enable similarity comparison
- Improve learning by preserving structure
See: Word embedding, matrix embedding
-
Gradient Descent
Gradient Descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent…
-
Hallucination
The generation of output by a model that is not grounded in the input data or real-world facts.
Types of Hallucinations:
- Factual Hallucination The model generates information that is factually incorrect, even though it may sound plausible.
Example: Saying “The Eiffel Tower is in Berlin.”
- Faithfulness Hallucination The model’s output does not accurately reflect or contradicts the input, especially common in summarization tasks.
Example: Summarizing a paragraph to include details not present in the original text.
- Mode Collapse or Memorized Hallucination The model repeats phrases or inserts memorized content that is irrelevant or unrelated.
Why It Happens
- Overgeneralization from training data.
- Poor alignment with source input.
- Incomplete training data or biases.
- Lack of mechanisms for fact-checking or external grounding.
Mitigation Techniques
- Retrieval-augmented generation (RAG).
- Fact-checking pipelines.
- Reinforcement learning from human feedback (RLHF).
- Prompt engineering and input constraints.
-
Independent Variable
An independent variable is a variable that is manipulated or used as input to predict the value of the dependent variable. In machine learning, independent variables are also called features or predictors, and they provide the information used by the model to make predictions.
-
LLM Chaining
LLM chaining is the process of connecting multiple calls to a language model — each with a specific purpose — so that the output of one step becomes the input to the next.
Common Use Case Example:
Task: Generate a well-researched blog post from a user-supplied topic.
Chain:
- Prompt 1: “Summarize the key points about ‘climate change and agriculture’.”
- → Output: High-level bullet points.
- Prompt 2: “Expand each bullet point into a detailed paragraph.”
- → Output: Full article body.
- Prompt 3: “Generate a title and meta description based on this article.”
- → Output: SEO-friendly title + summary.
Each stage builds on the previous one.
Why Use LLM Chaining?
- Decomposes complex tasks into manageable steps.
- Improves accuracy by isolating reasoning from generation.
- Enables control over different stages (reasoning, formatting, summarizing, etc.).
- Supports modularity — you can reuse steps across tasks.
Variants:
- Sequential Chaining: Step-by-step flow, as described above.
- Conditional Chaining: Path depends on a decision made at runtime.
- Parallel Chaining: Multiple prompts run independently, then merged.
- Prompt 1: “Summarize the key points about ‘climate change and agriculture’.”
-
Masked Model
Masked models are machine learning models, often used in natural language processing, that predict missing or masked parts of input data during training. For example, in models like BERT, random tokens in a sentence are hidden (masked), and the model learns to predict them based on context. This helps the model understand relationships in data, improving tasks like text generation or classification.
-
Matrix Embedding
A matrix embedding refers to stacking multiple embeddings into a matrix form. This is common when dealing with sequences like:
- Sentences (word embeddings stacked into a 2D matrix)
- Paragraphs (sentence embeddings stacked)
- Users/items in recommender systems
Shape example:
If you have a sentence of 10 words and each word embedding is 300-dimensional, the sentence embedding matrix is:
shape = (10, 300)
-
Quantization
Quantization is a technique to:
- Compress large models by reducing precision (e.g., from float32 → int8).
- Make them run faster, use less memory, and even run on CPU or mobile.
Tools like llama.cpp, ggml, and mlc-llm quantize models to make them run on M1 chips, Raspberry Pi, or Android.
-
Temperature
Temperature is a scalar value (usually between 0 and 2) used during the sampling process from a probability distribution to control the level of randomness in the output.
When a model generates text, it computes a probability distribution over the possible next tokens (words, characters, etc.). The temperature modifies this distribution before sampling:
\[P_i^{(\text{adjusted})} = \frac{\exp\left(\frac{\log P_i}{T}\right)}{\sum_j \exp\left(\frac{\log P_j}{T}\right)}\]Where:
- $P_i$ is the original probability of token i,
- $T$ is the temperature.
Example
Example:
Prompt: “Once upon a time, in a land far away,”
- T = 0.2 → “there lived a wise old king who ruled with kindness and wisdom.”
- T = 1.0 → “a dragon taught poetry to wandering clouds.”
- T = 1.8 → “the moon whispered jellyfish secrets through laser bananas.”
Use Cases:
- Low temperature (0–0.5): Factual answers, programming help, summarization.
- Medium temperature (0.7–1.0): Creative writing, marketing copy, storytelling.
- High temperature (1.2+): Brainstorming ideas, surreal or poetic content.
-
Tensor
In machine learning (ML), a tensor is a generalization of scalars, vectors, and matrices to higher dimensions and is a core data structure used to represent and process data.
Formal Definition:
A tensor is a multidimensional array of numerical values. Its rank (or order) denotes the number of dimensions:
- 0D tensor: Scalar (e.g., 5)
- 1D tensor: Vector (e.g., [1, 2, 3])
- 2D tensor: Matrix (e.g., [[1, 2], [3, 4]])
- 3D+ tensor: Higher-dimensional arrays (e.g., a stack of matrices)
Why Tensors Matter in ML:
- Input/output representation: Data like images (3D: height × width × channels), text sequences (2D: batch × sequence length), and time series are commonly represented as tensors.
- Efficient computation: Libraries like PyTorch and TensorFlow use tensor operations heavily, leveraging GPUs/TPUs for fast computation.
- Backpropagation: Tensors support automatic differentiation, essential for training neural networks.
Example in Code (PyTorch):
import torch # 2D tensor (matrix) x = torch.tensor(\[[1.0, 2.0], \[3.0, 4.0]]) print(x.shape) # torch.Size(\[2, 2])
In summary, a tensor is the fundamental building block for data in machine learning frameworks, offering a consistent and optimized structure for mathematical operations.
-
Token
In Natural Language Processing (NLP), a token is a basic unit of text used for processing and analysis. It typically represents a word, subword, character, or symbol, depending on the tokenization strategy.
Definition:
A token is a meaningful element extracted from raw text during tokenization, the process of breaking text into smaller pieces.
Common Types of Tokens:
Token Type Example for “I’m learning NLP!” Word token [“I”, “‘m”, “learning”, “NLP”, “!”] Subword [“I”, “’”, “m”, “learn”, “##ing”, “NLP”, “!”] (e.g., BERT) Character [“I”, “’”, “m”, “ “, “l”, “e”, “a”, “r”, “n”, “i”, “n”, “g”, “ “, “N”, “L”, “P”, “!”]
Why Tokens Matter:
• Input to models: NLP models operate on sequences of tokens, not raw text. • Efficiency: Tokenizing helps standardize and normalize text, aiding in tasks like classification, translation, and summarization. • Vocabulary mapping: Tokens are converted to numerical IDs using a vocabulary (lookup table), enabling neural models to process them.
Tokenization Example (Python + NLTK):
from nltk.tokenize import word_tokenize text = "I'm learning NLP!" tokens = word_tokenize(text) print(tokens) # Output: ['I', "'m", 'learning', 'NLP', '!']
Summary:
A token in NLP is a unit of text—often a word or subword—that forms the basis for downstream processing and modeling. Tokenization strategy varies depending on the language and model architecture.
-
Word Embedding
A word embedding is a type of embedding specifically used in Natural Language Processing (NLP). It maps words (or subwords) to real-valued vectors in a continuous vector space, where semantically similar words are close together.
Example word embeddings:
- Word2Vec
- GloVe
- FastText
- BERT (contextual embeddings)
Properties:
- Vectors are typically 50 to 1,024 dimensions
- Similar meanings → similar vectors (cosine similarity)
Example:
word_vectors["king"] - word_vectors["man"] + word_vectors["woman"] ≈ word_vectors["queen"]
See: Cosine Similarity
AI Glossary
- Causal Language Model
- Chain of Thought Prompting
- Cosine Similarity
- Dependent Variable
- Embedding
- Gradient Descent
- Hallucination
- Independent Variable
- LLM Chaining
- Masked Model
- Matrix Embedding
- Quantization
- Temperature
- Tensor
- Token
- Word Embedding