Neural Networks
Neural networks learn data representations through layers of transformations. They achieve state-of-the-art performance on unstructured data (images, text, audio) and learn features automatically without manual engineering.
Architecture
A neural network consists of layers: Input Layer -> Hidden Layers -> Output Layer
Each neuron computes a linear combination followed by a non-linear activation: z = w*x + b, then a = activation(z)
Activation Functions
| Function | Formula | Use Case |
|---|---|---|
| ReLU | max(0, x) | Hidden layers (default) |
| Sigmoid | 1/(1+e^-x) | Binary output |
| Tanh | (e^x - e^-x)/(e^x + e^-x) | Hidden layers |
| Softmax | e^(x_i) / Sum(e^(x_j)) | Multi-class output |
Training: Backpropagation
| Step | Description |
|---|---|
| Forward pass | Compute predictions through network |
| Compute loss | Compare predictions to targets |
| Backward pass | Compute gradients via chain rule |
| Update weights | w = w - learning_rate * gradient |
Training step sequence:
- Zero out accumulated gradients from previous iteration
- Forward pass: compute model output for the input
- Compute loss by comparing output to target
- Backward pass: compute gradients via backpropagation
- Update weights using the optimizer
Regularization Techniques
| Technique | Description |
|---|---|
| Dropout | Randomly zero neurons during training |
| Batch Normalization | Normalize layer inputs |
| L2 Regularization | Weight decay penalty |
| Early Stopping | Stop when validation loss increases |
Common Architectures
CNN (Convolutional Neural Networks)
For images: Convolution -> Pooling -> Fully Connected
Captures local spatial patterns through learned filters.
RNN/LSTM (Recurrent Neural Networks)
For sequences: Hidden state carries information across time steps.
LSTM adds gating mechanisms to handle long-range dependencies.
Transformer
For sequences: Self-attention mechanism enables parallel processing.
Attention allows any position to relate to any other position.
Reference
| Topic | Description |
|---|---|
| Non-linear activation necessity | Without non-linearity, stacked layers collapse to a single linear transformation. Non-linearity enables learning complex patterns. |
| Vanishing gradient problem | Gradients shrink exponentially in deep networks, preventing early layer updates. Solutions: ReLU, batch normalization, residual connections, proper initialization. |
| Dropout mechanism | Randomly zeroing neurons prevents co-adaptation. Equivalent to training an ensemble of sub-networks. |
| Architecture selection | CNN: images (local patterns). RNN: sequences with short-range dependencies. Transformer: sequences where any position relates to any other (requires sufficient data). |
| Layer/neuron count selection | Start small, increase until validation loss stops improving. Deeper networks are often more effective than wider ones but harder to train. |