Module 4 — CNNs and Computer Vision

Published:

Deep Learning Tutorial Series

This module introduces Convolutional Neural Networks (CNNs), which are the standard approach for image-related deep learning tasks.


4.1 Learning Goals

By the end of this module, you should understand:

  • Why fully connected networks are not ideal for images
  • What a convolution operation is
  • How CNNs extract features from images
  • The role of filters and feature maps
  • Pooling layers and their purpose
  • How to build a basic CNN in PyTorch
  • How CNNs are used for image classification

4.2 Why Not Use Fully Connected Networks for Images?

Images are structured data.

Example: 28×28 image = 784 inputs

Problems with fully connected layers:

  • Too many parameters
  • Ignore spatial structure
  • Poor generalization on images

CNNs solve this by exploiting spatial patterns.


4.3 Core Idea of CNNs

CNNs learn patterns in local regions of images.

Instead of looking at all pixels at once, they look at small patches.


4.4 Convolution Operation (Intuition)

A filter (kernel) slides over an image and extracts features.

Example idea:

  • Detect edges
  • Detect corners
  • Detect textures

Operation:

Image + Filter → Feature Map

4.5 Filters (Kernels)

A filter is a small matrix (e.g., 3×3) used to scan an image.

It learns to detect specific patterns:

  • Horizontal edges
  • Vertical edges
  • Shapes

Filters are learned automatically during training.


4.6 Feature Maps

A feature map is the output of applying a filter.

  • Bright regions → strong feature presence
  • Dark regions → weak or no feature

Each convolution layer produces multiple feature maps.


4.7 Stride and Padding

Stride

How far the filter moves each step.

  • Stride = 1 → detailed scan
  • Stride = 2 → faster, smaller output

Padding

Adds borders to preserve image size.

  • “Same” padding keeps dimensions
  • “Valid” reduces dimensions

4.8 Pooling Layers

Pooling reduces spatial size while keeping important information.

Max Pooling

Keeps strongest activation in a region.

Example:

[1, 3]
[2, 4] → 4

Benefits:

  • Reduces computation
  • Improves robustness
  • Prevents overfitting

4.9 CNN Architecture Overview

A typical CNN looks like:

Input Image
   ↓
Convolution + ReLU
   ↓
Pooling
   ↓
Convolution + ReLU
   ↓
Pooling
   ↓
Fully Connected Layer
   ↓
Output

4.10 Building a CNN in PyTorch

```python id=”cnn_model” import torch import torch.nn as nn

class SimpleCNN(nn.Module): def init(self): super(SimpleCNN, self).init()

    self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, padding=1)
    self.relu = nn.ReLU()
    self.pool = nn.MaxPool2d(2, 2)

    self.conv2 = nn.Conv2d(16, 32, 3, padding=1)

    self.fc1 = nn.Linear(32 * 7 * 7, 128)
    self.fc2 = nn.Linear(128, 10)

def forward(self, x):
    x = self.pool(self.relu(self.conv1(x)))
    x = self.pool(self.relu(self.conv2(x)))

    x = x.view(x.size(0), -1)

    x = self.relu(self.fc1(x))
    x = self.fc2(x)

    return x ```

4.11 Dataset: CIFAR-10 (Recommended)

CIFAR-10 contains real-world images:

  • 10 classes
  • 32×32 RGB images
  • Examples: airplane, cat, dog, car

4.12 Loading CIFAR-10 Dataset

```python id=”cifar_load” import torchvision import torchvision.transforms as transforms

transform = transforms.Compose([ transforms.ToTensor() ])

train_data = torchvision.datasets.CIFAR10( root=”./data”, train=True, download=True, transform=transform )

train_loader = torch.utils.data.DataLoader( train_data, batch_size=64, shuffle=True )


---

# 4.13 Training Setup

```python id="cnn_train_setup"
import torch.optim as optim

model = SimpleCNN()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

4.14 Training Loop

```python id=”cnn_train_loop” for epoch in range(5): total_loss = 0

for images, labels in train_loader:
    outputs = model(images)

    loss = criterion(outputs, labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    total_loss += loss.item()

print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}") ```

4.15 Model Evaluation

```python id=”cnn_eval” correct = 0 total = 0

with torch.no_grad(): for images, labels in train_loader: outputs = model(images) _, predicted = torch.max(outputs, 1)

    total += labels.size(0)
    correct += (predicted == labels).sum().item()

print(“Accuracy:”, correct / total) ```


4.16 What CNNs Learn (Intuition)

CNN layers learn hierarchical features:

Early layers:

  • Edges
  • Lines

Middle layers:

  • Shapes
  • Patterns

Deep layers:

  • Objects (cat, car, etc.)

4.17 Common Issues

  • Incorrect input shape (channel mismatch)
  • Forgetting normalization
  • Overfitting on small datasets
  • Too large learning rate

4.18 Key Takeaways

  • CNNs are designed for images
  • Convolution extracts local features
  • Pooling reduces complexity
  • CNNs learn hierarchical representations
  • They outperform fully connected networks on vision tasks

Acknowledgement

Part of the contents are generated by ChatGPT.

Return to the Main Page of Deep Learning and Machine Learning .