Module 5 Deep Learning Architectures

Published: June 24, 2026

Deep Learning Tutorial Series

This module introduces models for sequential data and gives a high-level overview of modern deep learning systems, including attention and transformers.

5.1 Learning Goals

By the end of this module, you should understand:

Why sequence models are needed
What sequential data is
Basic idea of RNNs
Why LSTM/GRU were introduced
Intuition behind attention mechanisms
High-level idea of Transformers
What pretrained models and transfer learning are
How modern deep learning systems are built

Deep Learning Architectures

Deep learning architectures (IBM blog, 2017)
This article classifies deep learning architectures into supervised and unsupervised learning and introduces several popular deep learning architectures: convolutional neural networks, recurrent neural networks (RNNs), long short-term memory/gated recurrent unit (GRU), self-organizing map (SOM), autoencoders (AE) and restricted Boltzman machine (RBM). It also gives an overview of deep belief networks (DBN) and deep stacking networks (DSNs)
Illustrated: 10 CNN Architectures: LeNet-5, AlexNet, VGG-16, Inception-v1, Inception-v3, ResNet-50, Xception, Inception-v4, Inception-ResNets, ResNeXt-50
Understanding AlexNet

5.2 What is Sequential Data?

Sequential data has an order where previous elements affect future ones.

Examples:

Text (words in a sentence)
Time series (stock prices, sensor data)
Speech signals
Video frames

Unlike images, order matters.

5.3 Why CNNs Are Not Enough

CNNs are good for spatial patterns but:

They do not naturally model time/order
They assume fixed-size local structure
They struggle with long-range dependencies

We need models that “remember” past information.

5.4 Recurrent Neural Networks (RNNs)

RNNs process sequences step-by-step.

Core idea:

Take input one step at a time
Maintain a hidden “memory” state

x1 → x2 → x3 → x4
     ↓    ↓    ↓
   hidden state updates

At each step:

Output depends on current input + past memory

RNN problems:

Forget long-term information
Vanishing gradient issue
Hard to train on long sequences

Example problem:

Early words in a sentence are forgotten in long text

5.6 LSTM and GRU (Improved Memory Models)

LSTM (Long Short-Term Memory) fixes RNN limitations.

Key idea:

Add a controlled memory system
Decide what to remember and forget

GRU is a simpler version of LSTM.

They introduce:

Memory gates
Controlled information flow

Result:

Better handling of long sequences

Intuition: Why LSTM Works

Think of memory like a notebook:

RNN: overwrites notes constantly
LSTM: decides what to keep, update, or erase

This helps retain important long-term context.

Attention Mechanism (Core Idea)

Attention allows a model to focus on important parts of input.

Instead of processing sequentially:

Model looks at all parts at once
Assigns importance weights

Input → Weighted focus on relevant parts → Output

Example (sentence):

“The cat that I saw yesterday was sleeping”

Attention helps connect “cat” with “was sleeping”.

Why Attention is Powerful

Handles long-range dependencies well
Parallel computation (faster than RNNs)
Better interpretability
Scales efficiently

5.10 Transformers (Modern Standard)

Transformers are the dominant architecture in modern deep learning.

Core components:

Self-attention
Feedforward layers
Positional encoding

Self-Attention (Intuition)

Each token (word) looks at all other tokens and decides:

Which words are important
How strongly they are related

Example:

Sentence:

“The dog chased the ball because it was excited”

Attention helps determine what “it” refers to.

Why Transformers Replaced RNNs

Transformers are preferred because:

Parallel processing (fast training)
Better long-range modeling
Scales well with large data
Works across text, images, audio

5.16 Where Each Model is Used

Model	Use Case
RNN	Basic sequence modeling (legacy)
LSTM/GRU	Time series, speech (still used)
CNN	Image tasks
Transformer	NLP, vision, multimodal systems

5.17 Key Takeaways

Sequential data requires memory-aware models
RNNs introduced sequence processing
LSTM/GRU improved long-term memory
Attention allows flexible focus on inputs
Transformers are the modern standard
Pretrained models dominate real-world applications

Acknowledgement

Part of the contents are generated by ChatGPT.

Return to the Main Page of Deep Learning and Machine Learning .

Junqing Zhang