Module 5 Deep Learning Architectures

Published:

Deep Learning Tutorial Series

This module introduces models for sequential data and gives a high-level overview of modern deep learning systems, including attention and transformers.


5.1 Learning Goals

By the end of this module, you should understand:

  • Why sequence models are needed
  • What sequential data is
  • Basic idea of RNNs
  • Why LSTM/GRU were introduced
  • Intuition behind attention mechanisms
  • High-level idea of Transformers
  • What pretrained models and transfer learning are
  • How modern deep learning systems are built

Deep Learning Architectures

  • Deep learning architectures (IBM blog, 2017)

    This article classifies deep learning architectures into supervised and unsupervised learning and introduces several popular deep learning architectures: convolutional neural networks, recurrent neural networks (RNNs), long short-term memory/gated recurrent unit (GRU), self-organizing map (SOM), autoencoders (AE) and restricted Boltzman machine (RBM). It also gives an overview of deep belief networks (DBN) and deep stacking networks (DSNs)

  • Illustrated: 10 CNN Architectures: LeNet-5, AlexNet, VGG-16, Inception-v1, Inception-v3, ResNet-50, Xception, Inception-v4, Inception-ResNets, ResNeXt-50
  • Understanding AlexNet

5.2 What is Sequential Data?

Sequential data has an order where previous elements affect future ones.

Examples:

  • Text (words in a sentence)
  • Time series (stock prices, sensor data)
  • Speech signals
  • Video frames

Unlike images, order matters.


5.3 Why CNNs Are Not Enough

CNNs are good for spatial patterns but:

  • They do not naturally model time/order
  • They assume fixed-size local structure
  • They struggle with long-range dependencies

We need models that “remember” past information.


5.4 Recurrent Neural Networks (RNNs)

RNNs process sequences step-by-step.

Core idea:

  • Take input one step at a time
  • Maintain a hidden “memory” state
x1 → x2 → x3 → x4
     ↓    ↓    ↓
   hidden state updates

At each step:

  • Output depends on current input + past memory

RNN problems:

  • Forget long-term information
  • Vanishing gradient issue
  • Hard to train on long sequences

Example problem:

  • Early words in a sentence are forgotten in long text

5.6 LSTM and GRU (Improved Memory Models)

LSTM (Long Short-Term Memory) fixes RNN limitations.

Key idea:

  • Add a controlled memory system
  • Decide what to remember and forget

GRU is a simpler version of LSTM.

They introduce:

  • Memory gates
  • Controlled information flow

Result:

  • Better handling of long sequences

Intuition: Why LSTM Works

Think of memory like a notebook:

  • RNN: overwrites notes constantly
  • LSTM: decides what to keep, update, or erase

This helps retain important long-term context.


Attention Mechanism (Core Idea)

Attention allows a model to focus on important parts of input.

Instead of processing sequentially:

  • Model looks at all parts at once
  • Assigns importance weights
Input → Weighted focus on relevant parts → Output

Example (sentence):

“The cat that I saw yesterday was sleeping”

Attention helps connect “cat” with “was sleeping”.

Why Attention is Powerful

  • Handles long-range dependencies well
  • Parallel computation (faster than RNNs)
  • Better interpretability
  • Scales efficiently

5.10 Transformers (Modern Standard)

Transformers are the dominant architecture in modern deep learning.

Core components:

  • Self-attention
  • Feedforward layers
  • Positional encoding

Self-Attention (Intuition)

Each token (word) looks at all other tokens and decides:

  • Which words are important
  • How strongly they are related

Example:

Sentence:

“The dog chased the ball because it was excited”

Attention helps determine what “it” refers to.


Why Transformers Replaced RNNs

Transformers are preferred because:

  • Parallel processing (fast training)
  • Better long-range modeling
  • Scales well with large data
  • Works across text, images, audio

5.16 Where Each Model is Used

ModelUse Case
RNNBasic sequence modeling (legacy)
LSTM/GRUTime series, speech (still used)
CNNImage tasks
TransformerNLP, vision, multimodal systems

5.17 Key Takeaways

  • Sequential data requires memory-aware models
  • RNNs introduced sequence processing
  • LSTM/GRU improved long-term memory
  • Attention allows flexible focus on inputs
  • Transformers are the modern standard
  • Pretrained models dominate real-world applications

Acknowledgement

Part of the contents are generated by ChatGPT.

Return to the Main Page of Deep Learning and Machine Learning .