-
Attention
The attention mechanism is a technique used in machine learning - especially in the natural language process (NLP) and computer vision - that allows models to focus on the most relevant parts of the input data when making decisions or predictions.
The core idea is to treat all parts of the input equally, attention assigns weights to different elements, indicating their importance for a given task. These weights are learned during training.
In the context of the Transformer architecture (e.g., GPT, BERT): Let:
- Q: Query
- K: Key
- V: Value
The Scaled Dot-Product Attention is:
$\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$
- $QK^T$: Measures relevance between query and key.
- $\sqrt{d_k}$: Scaling factor to stabilize gradients.
- $softmax$: Converts scores to probabilities (attention weights).
- The result is a weighted sum of the values V, emphasizing relevant parts.
-
Causal Language Model
A causal language model is a model trained to predict the next word in a sequence, using only the tokens to its left (previous context). It’s unidirectional.
For example:
Input: “The weather is” → Predict: “sunny”
Popular causal LMs:
- GPT-2 / GPT-3
- Gemma
- LLaMA
- Falcon Contrast this with masked models (like BERT), which predict missing words in the middle.
-
Chain of Thought Prompting
Definition: Chain-of-thought prompting is a method where the prompt includes intermediate reasoning steps, encouraging the model to “think out loud” and break down complex problems before answering.
LLMs like GPT can often solve simple problems directly. But for multi-step or reasoning-heavy tasks (e.g., math, logic puzzles, or common-sense reasoning), they perform significantly better when prompted to generate their reasoning first before concluding.
Example (with vs. without CoT)
Question:
If there are 3 cars and each car has 4 tires, how many tires are there in total?
Without CoT:
Prompt: “If there are 3 cars and each car has 4 tires, how many tires are there?”
Model Output: “4”
With CoT:
Prompt:
“If there are 3 cars and each car has 4 tires, how many tires are there? Let’s think step by step.”
Model Output:
“There are 3 cars. Each car has 4 tires. So the total number of tires is 3 × 4 = 12.
Answer: 12”
Common CoT Prompts:
- “Let’s think step by step.”
- “First…, then…, so…”
- “Let me reason this out.”
Variants:
- Zero-shot CoT: Add only “Let’s think step by step.” to the prompt.
- Few-shot CoT: Include multiple worked examples with reasoning chains in the prompt.
- Automatic CoT: Generate reasoning steps automatically for many problems at scale.
Chain-of-thought prompting helps the model activate latent reasoning paths in its neural structure that are less likely to be triggered by short, direct prompts. It mimics how humans approach complex tasks: by breaking them down.
-
Cosine Similarity
Cosine similarity is a metric used to measure how similar two vectors are by computing the cosine of the angle between them. It is widely used in machine learning, especially in text similarity, recommendation systems, and clustering.
Intuition
- If two vectors point in exactly the same direction, their cosine similarity is 1.
- If they are orthogonal (completely different), the similarity is 0.
- If they point in opposite directions, the similarity is -1.
It ignores magnitude, focusing on orientation, which makes it great for comparing text embeddings where length may vary but direction (semantic meaning) matters.
Formula
For two vectors A and B:
$ \text{cosine_similarity}(A, B) = \frac{A \cdot B}{|A| |B|} $
Where:
- $A \cdot B$ = dot product of vectors A and B
- $|A|$ = Euclidean norm (length) of A
- $|B|$ = Euclidean norm of B
Example (Python)
from sklearn.metrics.pairwise import cosine_similarity import numpy as np A = np.array([[1, 2, 3]]) B = np.array([[4, 5, 6]]) similarity = cosine_similarity(A, B) print(similarity) # Output: [[0.9746]]
Use Cases
- NLP: comparing sentence or word embeddings
- Recommendation: finding similar users/items
- Clustering: grouping similar vectors
- Document similarity: e.g., search engines
-
Dependent Variable
A dependent variable is the variable being measured or predicted in an experiment or model. Its value depends on changes in one or more independent variables. In machine learning, it is often called the target or output variable, as it is the value the model aims to predict.
-
Embedding
An embedding is a learned representation of data in a lower-dimensional space. It transforms high-dimensional, discrete, or symbolic data (like words, users, or items) into dense, continuous vectors that preserve semantic or structural relationships.
Why use embeddings?
- Reduce dimensionality
- Enable similarity comparison
- Improve learning by preserving structure
See: Word embedding, matrix embedding
-
Gradient Descent
Gradient Descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent…
-
Hallucination
The generation of output by a model that is not grounded in the input data or real-world facts.
Types of Hallucinations:
- Factual Hallucination The model generates information that is factually incorrect, even though it may sound plausible.
Example: Saying “The Eiffel Tower is in Berlin.”
- Faithfulness Hallucination The model’s output does not accurately reflect or contradicts the input, especially common in summarization tasks.
Example: Summarizing a paragraph to include details not present in the original text.
- Mode Collapse or Memorized Hallucination The model repeats phrases or inserts memorized content that is irrelevant or unrelated.
Why It Happens
- Overgeneralization from training data.
- Poor alignment with source input.
- Incomplete training data or biases.
- Lack of mechanisms for fact-checking or external grounding.
Mitigation Techniques
- Retrieval-augmented generation (RAG).
- Fact-checking pipelines.
- Reinforcement learning from human feedback (RLHF).
- Prompt engineering and input constraints.
-
Hard Prompt
Hard prompts are natural language strings, like those used in prompt engineering. They are human-readable and often manually crafted. These can be reused across models and tasks and are often portable.
-
Independent Variable
An independent variable is a variable that is manipulated or used as input to predict the value of the dependent variable. In machine learning, independent variables are also called features or predictors, and they provide the information used by the model to make predictions.
-
LLM Chaining
LLM chaining is the process of connecting multiple calls to a language model — each with a specific purpose — so that the output of one step becomes the input to the next.
Common Use Case Example:
Task: Generate a well-researched blog post from a user-supplied topic.
Chain:
- Prompt 1: “Summarize the key points about ‘climate change and agriculture’.”
- → Output: High-level bullet points.
- Prompt 2: “Expand each bullet point into a detailed paragraph.”
- → Output: Full article body.
- Prompt 3: “Generate a title and meta description based on this article.”
- → Output: SEO-friendly title + summary.
Each stage builds on the previous one.
Why Use LLM Chaining?
- Decomposes complex tasks into manageable steps.
- Improves accuracy by isolating reasoning from generation.
- Enables control over different stages (reasoning, formatting, summarizing, etc.).
- Supports modularity — you can reuse steps across tasks.
Variants:
- Sequential Chaining: Step-by-step flow, as described above.
- Conditional Chaining: Path depends on a decision made at runtime.
- Parallel Chaining: Multiple prompts run independently, then merged.
- Prompt 1: “Summarize the key points about ‘climate change and agriculture’.”
-
Masked Model
Masked models are machine learning models, often used in natural language processing, that predict missing or masked parts of input data during training. For example, in models like BERT, random tokens in a sentence are hidden (masked), and the model learns to predict them based on context. This helps the model understand relationships in data, improving tasks like text generation or classification.
-
Massive Multitask Language Understanding
It is a benchmark in machine learning and natural language processing (NLP) used to evaluate the general language understanding and reasoning ability of large language models (LLMs).
It was introduced in 2021 by Hendrycks et al. to test a model’s performance across a broad set of knowledge-rich and reasoning-heavy tasks. It uses accuracy (percentage of correct answers) to evaluate multiple-choice questions.
MMLU is designed to go beyond simple pattern recognition and test if a model can handle:
- Factual knowledge
- Reasoning ability
- Multitask generalization across topics
It’s widely used to benchmark LLMs like GPT-4, Claude, PaLM, LLaMA, etc.
What’s in the Benchmark?
- 57 diverse tasks
- Divided into 4 main categories:
- Humanities (e.g., history, law)
- STEM (e.g., physics, computer science)
- Social Sciences (e.g., economics, psychology)
- Other (e.g., professional law, medical exams)
Each task has a training/test/dev split but the MMLU benchmark only evaluates on test data using zero-shot or few-shot prompting.
Example Question (from MMLU - Physics):
What is the unit of electric resistance? A) Volt B) Ampere C) Ohm D) Watt Correct Answer: C) Ohm
-
Matrix Embedding
A matrix embedding refers to stacking multiple embeddings into a matrix form. This is common when dealing with sequences like:
- Sentences (word embeddings stacked into a 2D matrix)
- Paragraphs (sentence embeddings stacked)
- Users/items in recommender systems
Shape example:
If you have a sentence of 10 words and each word embedding is 300-dimensional, the sentence embedding matrix is:
shape = (10, 300)
-
Prompt Engineering
Prompt engineering is a manual, human-driven approach to designing effective prompts that elicit the desired output from a pre-trained language model. This method relies on understanding the behavior and limitations of the LLM and crafting input prompts accordingly. The process is akin to writing a clever query or instruction to get the best possible result without changing the underlying model parameters.
For example, consider a sentiment analysis task. A naive prompt might be:
“The movie was okay.”
This may not give you a useful output unless you explicitly instruct the model. A prompt-engineered version would look like:
“Classify the sentiment of the following review as Positive, Negative, or Neutral: ‘The movie was okay.’”
Prompt engineering involves iterations of trial-and-error, understanding model quirks, and using techniques like few-shot learning (giving examples in the prompt) or zero-shot learning (giving just the instruction) to guide the model
-
Prompt Tuning
Prompt tuning, in contrast, is a machine-learned, automated approach to crafting prompts. It involves training a small set of parameters (prompt tokens) that are prepended to the input. These tokens are optimized using gradient descent to perform well on a specific downstream task. The base model remains frozen; only the prompt embeddings are updated.
The primary goal of prompt tuning is to adapt a large, pre-trained language model to new tasks without updating the entire model. This method is highly efficient in terms of storage and compute, as it requires updating only a tiny fraction of the parameters.
-
Quantization
Quantization is a technique to:
- Compress large models by reducing precision (e.g., from float32 → int8).
- Make them run faster, use less memory, and even run on CPU or mobile.
Tools like llama.cpp, ggml, and mlc-llm quantize models to make them run on M1 chips, Raspberry Pi, or Android.
-
Soft Prompt
Soft prompts, on the other hand, are learned vectors. They exist only in the embedding space and do not correspond to actual tokens in the model’s vocabulary. They cannot be directly interpreted by humans. These are used exclusively in prompt tuning and are optimized for task performance, often outperforming hard prompts in accuracy but at the cost of interpretability.
-
Temperature
Temperature is a scalar value (usually between 0 and 2) used during the sampling process from a probability distribution to control the level of randomness in the output.
When a model generates text, it computes a probability distribution over the possible next tokens (words, characters, etc.). The temperature modifies this distribution before sampling:
\[P_i^{(\text{adjusted})} = \frac{\exp\left(\frac{\log P_i}{T}\right)}{\sum_j \exp\left(\frac{\log P_j}{T}\right)}\]Where:
- $P_i$ is the original probability of token i,
- $T$ is the temperature.
Example
Example:
Prompt: “Once upon a time, in a land far away,”
- T = 0.2 → “there lived a wise old king who ruled with kindness and wisdom.”
- T = 1.0 → “a dragon taught poetry to wandering clouds.”
- T = 1.8 → “the moon whispered jellyfish secrets through laser bananas.”
Use Cases:
- Low temperature (0–0.5): Factual answers, programming help, summarization.
- Medium temperature (0.7–1.0): Creative writing, marketing copy, storytelling.
- High temperature (1.2+): Brainstorming ideas, surreal or poetic content.
-
Tensor
In machine learning (ML), a tensor is a generalization of scalars, vectors, and matrices to higher dimensions and is a core data structure used to represent and process data.
Formal Definition:
A tensor is a multidimensional array of numerical values. Its rank (or order) denotes the number of dimensions:
- 0D tensor: Scalar (e.g., 5)
- 1D tensor: Vector (e.g., [1, 2, 3])
- 2D tensor: Matrix (e.g., [[1, 2], [3, 4]])
- 3D+ tensor: Higher-dimensional arrays (e.g., a stack of matrices)
Why Tensors Matter in ML:
- Input/output representation: Data like images (3D: height × width × channels), text sequences (2D: batch × sequence length), and time series are commonly represented as tensors.
- Efficient computation: Libraries like PyTorch and TensorFlow use tensor operations heavily, leveraging GPUs/TPUs for fast computation.
- Backpropagation: Tensors support automatic differentiation, essential for training neural networks.
Example in Code (PyTorch):
import torch # 2D tensor (matrix) x = torch.tensor(\[[1.0, 2.0], \[3.0, 4.0]]) print(x.shape) # torch.Size(\[2, 2])
In summary, a tensor is the fundamental building block for data in machine learning frameworks, offering a consistent and optimized structure for mathematical operations.
-
Token
In Natural Language Processing (NLP), a token is a basic unit of text used for processing and analysis. It typically represents a word, subword, character, or symbol, depending on the tokenization strategy.
Definition:
A token is a meaningful element extracted from raw text during tokenization, the process of breaking text into smaller pieces.
Common Types of Tokens:
Token Type Example for “I’m learning NLP!” Word token [“I”, “‘m”, “learning”, “NLP”, “!”] Subword [“I”, “’”, “m”, “learn”, “##ing”, “NLP”, “!”] (e.g., BERT) Character [“I”, “’”, “m”, “ “, “l”, “e”, “a”, “r”, “n”, “i”, “n”, “g”, “ “, “N”, “L”, “P”, “!”]
Why Tokens Matter:
• Input to models: NLP models operate on sequences of tokens, not raw text. • Efficiency: Tokenizing helps standardize and normalize text, aiding in tasks like classification, translation, and summarization. • Vocabulary mapping: Tokens are converted to numerical IDs using a vocabulary (lookup table), enabling neural models to process them.
Tokenization Example (Python + NLTK):
from nltk.tokenize import word_tokenize text = "I'm learning NLP!" tokens = word_tokenize(text) print(tokens) # Output: ['I', "'m", 'learning', 'NLP', '!']
Summary:
A token in NLP is a unit of text—often a word or subword—that forms the basis for downstream processing and modeling. Tokenization strategy varies depending on the language and model architecture.
-
Word Embedding
A word embedding is a type of embedding specifically used in Natural Language Processing (NLP). It maps words (or subwords) to real-valued vectors in a continuous vector space, where semantically similar words are close together.
Example word embeddings:
- Word2Vec
- GloVe
- FastText
- BERT (contextual embeddings)
Properties:
- Vectors are typically 50 to 1,024 dimensions
- Similar meanings → similar vectors (cosine similarity)
Example:
word_vectors["king"] - word_vectors["man"] + word_vectors["woman"] ≈ word_vectors["queen"]
See: Cosine Similarity
AI Glossary
- Attention
- Causal Language Model
- Chain of Thought Prompting
- Cosine Similarity
- Dependent Variable
- Embedding
- Gradient Descent
- Hallucination
- Hard Prompt
- Independent Variable
- LLM Chaining
- Masked Model
- Massive Multitask Language Understanding
- Matrix Embedding
- Prompt Engineering
- Prompt Tuning
- Quantization
- Soft Prompt
- Temperature
- Tensor
- Token
- Word Embedding