You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YOLO11 GhostConv + Knowledge Distillation + Quantization

This notebook implements a complete model optimization pipeline for YOLO11 targeting edge devices, including: custom architecture with GhostConv, Knowledge Distillation, and Quantization.

🎯 Overview

This notebook implements a 3-stage YOLO11 optimization pipeline:

1. Custom Architecture (YOLO11n-GhostConv)

Replace Conv layers with GhostConv to reduce parameters
Retain C3k2 and C2PSA blocks for feature extraction
Architecture optimized for traffic dataset (5 classes)

2. Knowledge Distillation (KD)

Teacher model: YOLO11l (large model)
Student model: YOLO11n-GhostConv (custom lightweight)
Techniques:
- Feature-based distillation (MSE loss)
- Logit-based distillation (KL divergence)
- Temperature scaling (T=4.0)
- Progressive KD with warmup epochs

3. Quantization

FP32 → INT8 quantization with TFLite
FP32 → FP16 quantization
Calibration dataset for INT8
Performance comparison: FP32 vs INT8 vs FP16

📁 Notebook Structure

Section 1: Initialization

Mount Google Drive
Setup project directories
Import Ultralytics modules (GhostConv, C3k2, C2PSA)
Clone and install Ultralytics from source

Section 2: Custom Architecture

Define YOLO11_TinyGhost architecture in YAML
Backbone with GhostConv layers
Head with Detect layer for 5 classes
Train baseline model (50 epochs)

Section 3: Knowledge Distillation

Class implementations:

KDConfig: Configuration for KD training
KnowledgeDistillationTrainer: Custom trainer inheriting from DetectionTrainer
- Forward hooks to capture intermediate features
- Feature distillation loss (normalized MSE)
- Logit distillation loss (KL divergence with temperature)
- Combined loss: (1-α-β)*L_hard + α*L_feature + β*L_logit

Training strategy:

Warmup phase (8 epochs): hard loss only
After warmup: combine hard + KD losses
KD layers: ["model.4", "model.6", "model.10"] (P3, P4, PSA)
Hyperparameters: α=0.3, β=0.2, T=4.0

Section 4: Visualization

Training metrics plotting (mAP, loss curves)
F1 score tracking
Precision/Recall curves
Box/Class/DFL loss comparison

Section 5: Fine-tuning

Load best KD checkpoint
Fine-tune on multi-view intersection dataset
Freeze 3 backbone layers
Low learning rate (1e-5) with cosine scheduler

Section 6: Quantization

Export formats:

INT8 TFLite (with calibration dataset)
FP16 TFLite

Evaluation:

Compare mAP50 and mAP50-95
FP32 vs INT8 vs FP16
Image size: 416x416

🔧 System Requirements

Hardware

GPU: CUDA-compatible (T4 or better recommended)
RAM: 16GB+
Storage: 10GB+ for datasets and models

Software

Python >= 3.8
PyTorch >= 1.13
CUDA >= 11.3
Google Colab (recommended)

📦 Installation

1. Clone Ultralytics from source

!git clone https://github.com/ultralytics/ultralytics
%cd ultralytics
!pip install -e .

2. Dependencies

pip install torch torchvision
pip install matplotlib pandas
pip install opencv-python pillow

3. Dataset structure

dataset/
├── images/
│   ├── train/
│   └── val/
├── labels/
│   ├── train/
│   └── val/
└── data.yaml

🚀 Usage Guide

Step 1: Prepare Data

PROJECT_DIR = "/content/drive/MyDrive/yolo_ghostblock"
DATASET_DIR = "/content/drive/MyDrive/dataset/yolo_mtid_motor/dataset"

Step 2: Train Baseline GhostConv Model

model = YOLO("yolo11_tinyghost.yaml")
model.train(
    data=f"{DATASET_DIR}/data.yaml",
    epochs=50,
    imgsz=640,
    device=0
)

Step 3: Knowledge Distillation

# Load teacher and student
teacher = YOLO("path/to/teacher.pt")
student = YOLO("path/to/student.pt")

# Create KD trainer
TrainerClass = create_kd_trainer_class(
    teacher_model=teacher,
    kd_alpha=0.3,
    kd_beta=0.2,
    kd_temperature=4.0,
    kd_layers=["model.4", "model.6", "model.10"]
)

# Train with KD
trainer = TrainerClass(overrides={...})
trainer.train()

Step 4: Quantization

# Export INT8
model.export(
    format="tflite",
    int8=True,
    data=CALIB_YAML,
    imgsz=416
)

# Evaluate quantized model
model_int8 = YOLO("best_int8.tflite")
metrics = model_int8.val(data=DATA_YAML, imgsz=416)

📊 Results

Model Comparison

Model	Parameters	Size	mAP50	mAP50-95
YOLO11l (Teacher)	~20M	~40MB	0.95+	0.80+
YOLO11n-Ghost	~2M	~4MB	0.92+	0.75+
+ KD	~2M	~4MB	0.94+	0.78+
+ INT8	~2M	~1MB	0.93+	0.76+

Quantization Impact

FP32 → INT8: ~75% size reduction, ~1-2% mAP drop
FP32 → FP16: ~50% size reduction, ~0.5% mAP drop

Training Curves

Box Loss: converges after ~30 epochs
mAP50: reaches plateau ~35-40 epochs
F1 Score: 0.85-0.90 range

📖 Technical Details

GhostConv Architecture

backbone:
  - [-1, 1, GhostConv, [64, 3, 2]]
  - [-1, 1, GhostConv, [128, 3, 2]]
  - [-1, 1, C3k2, [256, False, 0.25]]
  ...

KD Loss Formula

L_total = (1 - α - β) * L_hard + α * L_feature + β * L_logit

L_feature = MSE(normalize(S_feat), normalize(T_feat))
L_logit = KL(softmax(S/T), softmax(T/T)) * T²

Quantization Config

INT8: Post-training quantization with calibration
Calibration: 100-200 images from training set
Input: uint8 [0, 255] or float32 normalized

⚙️ Hyperparameters

Training

Epochs: 40-50
Batch size: 16
Image size: 640x640
Learning rate: 5e-5 (baseline), 1e-5 (fine-tune)
Optimizer: AdamW with cosine scheduler

Knowledge Distillation

α (feature): 0.3
β (logit): 0.2
Temperature: 4.0
Warmup epochs: 8
KD layers: P3, P4, PSA output

Quantization

Format: TFLite
Input size: 416x416 (edge deployment)
Calibration samples: 100

🐛 Troubleshooting

Issue 1: CUDA Out of Memory

# Reduce batch size
batch = 8

# Enable mixed precision
amp = True

Issue 2: Feature Shape Mismatch in KD

Check teacher and student architecture compatibility
Verify KD layer names match between models
Ensure input sizes are consistent

Issue 3: INT8 Quantization Accuracy Drop

Increase number of calibration samples
Use representative dataset (diverse conditions)
Consider QAT (Quantization-Aware Training)

📚 References

Papers

Resources

🎯 Key Features

Architecture Optimization

GhostConv: Reduces FLOPs by ~50% compared to standard convolutions
Lightweight backbone: Maintains accuracy while reducing parameters
Flexible head: Supports multiple detection tasks

Knowledge Distillation

Multi-level distillation: Combines feature and logit knowledge transfer
Temperature-scaled softmax: Smooths probability distributions
Progressive training: Warmup phase for stable convergence

Model Compression

INT8 quantization: 4x memory reduction
FP16 quantization: 2x memory reduction
Edge-ready: Optimized for mobile/embedded deployment

💡 Best Practices

Training

Start with pre-trained weights when possible
Use data augmentation (mosaic, mixup, etc.)
Monitor validation metrics closely
Apply early stopping (patience=10-15)

Knowledge Distillation

Ensure teacher model is well-trained (mAP > 90%)
Match batch normalization statistics
Use appropriate temperature (T=3-5 for object detection)
Gradually increase KD loss weight

Quantization

Use diverse calibration dataset
Test on representative test set
Profile inference speed on target device
Consider hybrid quantization (some layers FP32)

📈 Performance Metrics

Speed Benchmarks

Model	FP32 (ms)	FP16 (ms)	INT8 (ms)	Device
YOLO11l	45	28	N/A	T4 GPU
YOLO11n-Ghost	12	8	N/A	T4 GPU
INT8 TFLite	N/A	N/A	25	Edge TPU

Accuracy vs Efficiency

YOLO11l: Highest accuracy, largest model
YOLO11n-Ghost: Best accuracy/size trade-off
+ KD: Closes gap with teacher
+ INT8: Minimal accuracy loss, deployable

🔄 Workflow Summary

graph LR
A[YOLO11l Teacher] --> B[Design GhostConv Student]
B --> C[Train Baseline]
C --> D[Knowledge Distillation]
D --> E[Fine-tune]
E --> F[Quantize INT8/FP16]
F --> G[Deploy to Edge]

🚀 Deployment

TFLite Conversion

# Export to TFLite INT8
model.export(
    format="tflite",
    int8=True,
    data="calibration.yaml",
    imgsz=416
)

Inference Example

import numpy as np
from PIL import Image

# Load TFLite model
interpreter = tf.lite.Interpreter(model_path="best_int8.tflite")
interpreter.allocate_tensors()

# Preprocess image
img = Image.open("test.jpg").resize((416, 416))
input_data = np.array(img, dtype=np.uint8).reshape(1, 416, 416, 3)

# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]['index'])

👥 Contributing

Contributions are welcome! Areas for improvement:

Additional distillation techniques (attention transfer, etc.)
QAT implementation
More lightweight architectures
Deployment examples for different platforms

📄 License

This notebook follows the Ultralytics AGPL-3.0 License.

🙏 Acknowledgments

Ultralytics for YOLO11 framework
GhostNet for efficient convolution design
Google Colab for compute resources

Note: This notebook is designed to run on Google Colab with GPU runtime. Adjust paths and configurations for local environments as needed.

Last Updated: January 2026
Version: v11
Compatibility: Ultralytics 8.0+

Downloads last month: -

Papers for mrbrownn43/YOLOv11_customized-for-RaspberryPi4