Skip to main content

Neural Networks

Neural networks learn data representations through layers of transformations. They achieve state-of-the-art performance on unstructured data (images, text, audio) and learn features automatically without manual engineering.

Architecture

A neural network consists of layers: Input Layer -> Hidden Layers -> Output Layer

Each neuron computes a linear combination followed by a non-linear activation: z = w*x + b, then a = activation(z)

Activation Functions

FunctionFormulaUse Case
ReLUmax(0, x)Hidden layers (default)
Sigmoid1/(1+e^-x)Binary output
Tanh(e^x - e^-x)/(e^x + e^-x)Hidden layers
Softmaxe^(x_i) / Sum(e^(x_j))Multi-class output

Training: Backpropagation

StepDescription
Forward passCompute predictions through network
Compute lossCompare predictions to targets
Backward passCompute gradients via chain rule
Update weightsw = w - learning_rate * gradient

Training step sequence:

  1. Zero out accumulated gradients from previous iteration
  2. Forward pass: compute model output for the input
  3. Compute loss by comparing output to target
  4. Backward pass: compute gradients via backpropagation
  5. Update weights using the optimizer

Regularization Techniques

TechniqueDescription
DropoutRandomly zero neurons during training
Batch NormalizationNormalize layer inputs
L2 RegularizationWeight decay penalty
Early StoppingStop when validation loss increases

Common Architectures

CNN (Convolutional Neural Networks)

For images: Convolution -> Pooling -> Fully Connected

Captures local spatial patterns through learned filters.

RNN/LSTM (Recurrent Neural Networks)

For sequences: Hidden state carries information across time steps.

LSTM adds gating mechanisms to handle long-range dependencies.

Transformer

For sequences: Self-attention mechanism enables parallel processing.

Attention allows any position to relate to any other position.

Reference

TopicDescription
Non-linear activation necessityWithout non-linearity, stacked layers collapse to a single linear transformation. Non-linearity enables learning complex patterns.
Vanishing gradient problemGradients shrink exponentially in deep networks, preventing early layer updates. Solutions: ReLU, batch normalization, residual connections, proper initialization.
Dropout mechanismRandomly zeroing neurons prevents co-adaptation. Equivalent to training an ensemble of sub-networks.
Architecture selectionCNN: images (local patterns). RNN: sequences with short-range dependencies. Transformer: sequences where any position relates to any other (requires sufficient data).
Layer/neuron count selectionStart small, increase until validation loss stops improving. Deeper networks are often more effective than wider ones but harder to train.