Neural Networks

Neural networks learn data representations through layers of transformations. They achieve state-of-the-art performance on unstructured data (images, text, audio) and learn features automatically without manual engineering.

Architecture

A neural network consists of layers: Input Layer -> Hidden Layers -> Output Layer

Each neuron computes a linear combination followed by a non-linear activation: z = w*x + b, then a = activation(z)

Activation Functions

Function	Formula	Use Case
ReLU	max(0, x)	Hidden layers (default)
Sigmoid	1/(1+e^-x)	Binary output
Tanh	(e^x - e^-x)/(e^x + e^-x)	Hidden layers
Softmax	e^(x_i) / Sum(e^(x_j))	Multi-class output

Training: Backpropagation

Step	Description
Forward pass	Compute predictions through network
Compute loss	Compare predictions to targets
Backward pass	Compute gradients via chain rule
Update weights	w = w - learning_rate * gradient

Training step sequence:

Zero out accumulated gradients from previous iteration
Forward pass: compute model output for the input
Compute loss by comparing output to target
Backward pass: compute gradients via backpropagation
Update weights using the optimizer

Regularization Techniques

Technique	Description
Dropout	Randomly zero neurons during training
Batch Normalization	Normalize layer inputs
L2 Regularization	Weight decay penalty
Early Stopping	Stop when validation loss increases

Common Architectures

CNN (Convolutional Neural Networks)

For images: Convolution -> Pooling -> Fully Connected

Captures local spatial patterns through learned filters.

RNN/LSTM (Recurrent Neural Networks)

For sequences: Hidden state carries information across time steps.

LSTM adds gating mechanisms to handle long-range dependencies.

Transformer

For sequences: Self-attention mechanism enables parallel processing.

Attention allows any position to relate to any other position.

Reference

Topic	Description
Non-linear activation necessity	Without non-linearity, stacked layers collapse to a single linear transformation. Non-linearity enables learning complex patterns.
Vanishing gradient problem	Gradients shrink exponentially in deep networks, preventing early layer updates. Solutions: ReLU, batch normalization, residual connections, proper initialization.
Dropout mechanism	Randomly zeroing neurons prevents co-adaptation. Equivalent to training an ensemble of sub-networks.
Architecture selection	CNN: images (local patterns). RNN: sequences with short-range dependencies. Transformer: sequences where any position relates to any other (requires sufficient data).
Layer/neuron count selection	Start small, increase until validation loss stops improving. Deeper networks are often more effective than wider ones but harder to train.

Architecture​

Activation Functions​

Training: Backpropagation​

Regularization Techniques​

Common Architectures​

CNN (Convolutional Neural Networks)​

RNN/LSTM (Recurrent Neural Networks)​

Transformer​

Reference​

Table of Contents