As deep learning models continue to deliver remarkable performance across a wide range of tasks, deploying them in resource-constrained environments such as mobile devices, edge systems, and autonomous platforms remains a challenge due to their high computational demands, memory usage, and power consumption. This has made it essential to explore strategies that can reduce these overheads while preserving performance. In this context, our project explores how techniques like quantization and knowledge distillation can be effectively used to optimize deep neural networks for deployment in such limited-resource settings.
Quantization reduces the precision of weights and activations (e.g., from 32-bit to 8-bit), cutting memory and compute costs significantly, but may introduce accuracy loss.
Knowledge Distillation trains a smaller model (student) to mimic a larger one (teacher) using soft outputs, enabling compression with minimal performance drop, though extreme compression can hurt accuracy.
In this work, we benchmark both techniques: implementing static PTQ (manual and PyTorch-based) and performing KD on GPU, to analyze their trade-offs in accuracy and efficiency for deployment in constrained environments.The code implementing our method is available at: GitHub Repository