Softmax
Softmax is an activation function used in multi-class classification to convert a vector of real-valued scores (logits) into a probability distribution over classes.
Definition
Given logits $z = (z_1, z_2, \dots, z_K)$,
\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}, \quad i = 1,\dots,K\]-
Logits are the raw, unnormalized outputs of a model before an activation function (such as sigmoid or softmax) is applied.
Key properties
- Outputs values in (0, 1)
- Probabilities sum to 1
- Preserves ordering: larger $z_i$ ⇒ larger probability
- Smooth and differentiable (good for backpropagation)
Intuition
- Each $z_i$ is a score for class $i$
- Exponentiation emphasizes larger scores
- Normalization forces competition between classes
- Produces a probability-like output
Example: \(z = (2, 1, 0) \;\Rightarrow\; \text{softmax}(z) \approx (0.67,\;0.24,\;0.09)\)
Why softmax is used
- Enables multi-class classification
- Works naturally with cross-entropy loss
- Output can be interpreted as class probabilities
Decision rule
\(\hat{y} = \arg\max_i \text{softmax}(z_i)\) (Note: this is equivalent to $\arg\max_i z_i$.)
Softmax vs Sigmoid
| Sigmoid | Softmax |
|---|---|
| Binary classification | Multi-class classification |
| Single output | Multiple outputs |
| Independent probabilities | Competing probabilities |
| $\sigma(z)\in(0,1)$ | $\sum_i p_i = 1$ |