YOLO11 GhostConv + Knowledge Distillation + Quantization
This notebook implements a complete model optimization pipeline for YOLO11 targeting edge devices, including: custom architecture with GhostConv, Knowledge Distillation, and Quantization.
π Table of Contents
π― Overview
This notebook implements a 3-stage YOLO11 optimization pipeline:
1. Custom Architecture (YOLO11n-GhostConv)
- Replace Conv layers with GhostConv to reduce parameters
- Retain C3k2 and C2PSA blocks for feature extraction
- Architecture optimized for traffic dataset (5 classes)
2. Knowledge Distillation (KD)
- Teacher model: YOLO11l (large model)
- Student model: YOLO11n-GhostConv (custom lightweight)
- Techniques:
- Feature-based distillation (MSE loss)
- Logit-based distillation (KL divergence)
- Temperature scaling (T=4.0)
- Progressive KD with warmup epochs
3. Quantization
- FP32 β INT8 quantization with TFLite
- FP32 β FP16 quantization
- Calibration dataset for INT8
- Performance comparison: FP32 vs INT8 vs FP16
π Notebook Structure
Section 1: Initialization
- Mount Google Drive
- Setup project directories
- Import Ultralytics modules (GhostConv, C3k2, C2PSA)
- Clone and install Ultralytics from source
Section 2: Custom Architecture
- Define YOLO11_TinyGhost architecture in YAML
- Backbone with GhostConv layers
- Head with Detect layer for 5 classes
- Train baseline model (50 epochs)
Section 3: Knowledge Distillation
Class implementations:
KDConfig: Configuration for KD trainingKnowledgeDistillationTrainer: Custom trainer inheriting from DetectionTrainer- Forward hooks to capture intermediate features
- Feature distillation loss (normalized MSE)
- Logit distillation loss (KL divergence with temperature)
- Combined loss:
(1-Ξ±-Ξ²)*L_hard + Ξ±*L_feature + Ξ²*L_logit
Training strategy:
- Warmup phase (8 epochs): hard loss only
- After warmup: combine hard + KD losses
- KD layers: ["model.4", "model.6", "model.10"] (P3, P4, PSA)
- Hyperparameters: Ξ±=0.3, Ξ²=0.2, T=4.0
Section 4: Visualization
- Training metrics plotting (mAP, loss curves)
- F1 score tracking
- Precision/Recall curves
- Box/Class/DFL loss comparison
Section 5: Fine-tuning
- Load best KD checkpoint
- Fine-tune on multi-view intersection dataset
- Freeze 3 backbone layers
- Low learning rate (1e-5) with cosine scheduler
Section 6: Quantization
Export formats:
- INT8 TFLite (with calibration dataset)
- FP16 TFLite
Evaluation:
- Compare mAP50 and mAP50-95
- FP32 vs INT8 vs FP16
- Image size: 416x416
π§ System Requirements
Hardware
- GPU: CUDA-compatible (T4 or better recommended)
- RAM: 16GB+
- Storage: 10GB+ for datasets and models
Software
Python >= 3.8
PyTorch >= 1.13
CUDA >= 11.3
Google Colab (recommended)
π¦ Installation
1. Clone Ultralytics from source
!git clone https://github.com/ultralytics/ultralytics
%cd ultralytics
!pip install -e .
2. Dependencies
pip install torch torchvision
pip install matplotlib pandas
pip install opencv-python pillow
3. Dataset structure
dataset/
βββ images/
β βββ train/
β βββ val/
βββ labels/
β βββ train/
β βββ val/
βββ data.yaml
π Usage Guide
Step 1: Prepare Data
PROJECT_DIR = "/content/drive/MyDrive/yolo_ghostblock"
DATASET_DIR = "/content/drive/MyDrive/dataset/yolo_mtid_motor/dataset"
Step 2: Train Baseline GhostConv Model
model = YOLO("yolo11_tinyghost.yaml")
model.train(
data=f"{DATASET_DIR}/data.yaml",
epochs=50,
imgsz=640,
device=0
)
Step 3: Knowledge Distillation
# Load teacher and student
teacher = YOLO("path/to/teacher.pt")
student = YOLO("path/to/student.pt")
# Create KD trainer
TrainerClass = create_kd_trainer_class(
teacher_model=teacher,
kd_alpha=0.3,
kd_beta=0.2,
kd_temperature=4.0,
kd_layers=["model.4", "model.6", "model.10"]
)
# Train with KD
trainer = TrainerClass(overrides={...})
trainer.train()
Step 4: Quantization
# Export INT8
model.export(
format="tflite",
int8=True,
data=CALIB_YAML,
imgsz=416
)
# Evaluate quantized model
model_int8 = YOLO("best_int8.tflite")
metrics = model_int8.val(data=DATA_YAML, imgsz=416)
π Results
Model Comparison
| Model | Parameters | Size | mAP50 | mAP50-95 |
|---|---|---|---|---|
| YOLO11l (Teacher) | ~20M | ~40MB | 0.95+ | 0.80+ |
| YOLO11n-Ghost | ~2M | ~4MB | 0.92+ | 0.75+ |
| + KD | ~2M | ~4MB | 0.94+ | 0.78+ |
| + INT8 | ~2M | ~1MB | 0.93+ | 0.76+ |
Quantization Impact
- FP32 β INT8: ~75% size reduction, ~1-2% mAP drop
- FP32 β FP16: ~50% size reduction, ~0.5% mAP drop
Training Curves
- Box Loss: converges after ~30 epochs
- mAP50: reaches plateau ~35-40 epochs
- F1 Score: 0.85-0.90 range
π Technical Details
GhostConv Architecture
backbone:
- [-1, 1, GhostConv, [64, 3, 2]]
- [-1, 1, GhostConv, [128, 3, 2]]
- [-1, 1, C3k2, [256, False, 0.25]]
...
KD Loss Formula
L_total = (1 - Ξ± - Ξ²) * L_hard + Ξ± * L_feature + Ξ² * L_logit
L_feature = MSE(normalize(S_feat), normalize(T_feat))
L_logit = KL(softmax(S/T), softmax(T/T)) * TΒ²
Quantization Config
- INT8: Post-training quantization with calibration
- Calibration: 100-200 images from training set
- Input: uint8 [0, 255] or float32 normalized
βοΈ Hyperparameters
Training
- Epochs: 40-50
- Batch size: 16
- Image size: 640x640
- Learning rate: 5e-5 (baseline), 1e-5 (fine-tune)
- Optimizer: AdamW with cosine scheduler
Knowledge Distillation
- Ξ± (feature): 0.3
- Ξ² (logit): 0.2
- Temperature: 4.0
- Warmup epochs: 8
- KD layers: P3, P4, PSA output
Quantization
- Format: TFLite
- Input size: 416x416 (edge deployment)
- Calibration samples: 100
π Troubleshooting
Issue 1: CUDA Out of Memory
# Reduce batch size
batch = 8
# Enable mixed precision
amp = True
Issue 2: Feature Shape Mismatch in KD
- Check teacher and student architecture compatibility
- Verify KD layer names match between models
- Ensure input sizes are consistent
Issue 3: INT8 Quantization Accuracy Drop
- Increase number of calibration samples
- Use representative dataset (diverse conditions)
- Consider QAT (Quantization-Aware Training)
π References
Papers
- YOLO11
- GhostNet: More Features from Cheap Operations
- Distilling the Knowledge in a Neural Network
- Quantization and Training of Neural Networks
Resources
π― Key Features
Architecture Optimization
- GhostConv: Reduces FLOPs by ~50% compared to standard convolutions
- Lightweight backbone: Maintains accuracy while reducing parameters
- Flexible head: Supports multiple detection tasks
Knowledge Distillation
- Multi-level distillation: Combines feature and logit knowledge transfer
- Temperature-scaled softmax: Smooths probability distributions
- Progressive training: Warmup phase for stable convergence
Model Compression
- INT8 quantization: 4x memory reduction
- FP16 quantization: 2x memory reduction
- Edge-ready: Optimized for mobile/embedded deployment
π‘ Best Practices
Training
- Start with pre-trained weights when possible
- Use data augmentation (mosaic, mixup, etc.)
- Monitor validation metrics closely
- Apply early stopping (patience=10-15)
Knowledge Distillation
- Ensure teacher model is well-trained (mAP > 90%)
- Match batch normalization statistics
- Use appropriate temperature (T=3-5 for object detection)
- Gradually increase KD loss weight
Quantization
- Use diverse calibration dataset
- Test on representative test set
- Profile inference speed on target device
- Consider hybrid quantization (some layers FP32)
π Performance Metrics
Speed Benchmarks
| Model | FP32 (ms) | FP16 (ms) | INT8 (ms) | Device |
|---|---|---|---|---|
| YOLO11l | 45 | 28 | N/A | T4 GPU |
| YOLO11n-Ghost | 12 | 8 | N/A | T4 GPU |
| INT8 TFLite | N/A | N/A | 25 | Edge TPU |
Accuracy vs Efficiency
- YOLO11l: Highest accuracy, largest model
- YOLO11n-Ghost: Best accuracy/size trade-off
- + KD: Closes gap with teacher
- + INT8: Minimal accuracy loss, deployable
π Workflow Summary
graph LR
A[YOLO11l Teacher] --> B[Design GhostConv Student]
B --> C[Train Baseline]
C --> D[Knowledge Distillation]
D --> E[Fine-tune]
E --> F[Quantize INT8/FP16]
F --> G[Deploy to Edge]
π Deployment
TFLite Conversion
# Export to TFLite INT8
model.export(
format="tflite",
int8=True,
data="calibration.yaml",
imgsz=416
)
Inference Example
import numpy as np
from PIL import Image
# Load TFLite model
interpreter = tf.lite.Interpreter(model_path="best_int8.tflite")
interpreter.allocate_tensors()
# Preprocess image
img = Image.open("test.jpg").resize((416, 416))
input_data = np.array(img, dtype=np.uint8).reshape(1, 416, 416, 3)
# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]['index'])
π₯ Contributing
Contributions are welcome! Areas for improvement:
- Additional distillation techniques (attention transfer, etc.)
- QAT implementation
- More lightweight architectures
- Deployment examples for different platforms
π License
This notebook follows the Ultralytics AGPL-3.0 License.
π Acknowledgments
- Ultralytics for YOLO11 framework
- GhostNet for efficient convolution design
- Google Colab for compute resources
Note: This notebook is designed to run on Google Colab with GPU runtime. Adjust paths and configurations for local environments as needed.
Last Updated: January 2026
Version: v11
Compatibility: Ultralytics 8.0+
- Downloads last month
- -