Fully Quantized Vision Transformers: Integer-Only Architecture for Object Detection
The French Alternative Energies and Atomic Energy Commission (CEA) is a leading research and innovation organization active in three major fields: energy, information and health technologies, and defense. Within the CEA, the Laboratory for Systems and Technology Integration (LIST), located in Saclay (Ile-de-France), drives technology transfer and innovation in embedded systems. The Embedded Artificial Intelligence Laboratory (LIAE) specializes in designing optimized AI solutions for embedded environments, with a focus on efficiency in surface, power consumption, and computing performance.
Vision Transformers (ViTs) have recently demonstrated superior performance over Convolutional Neural Networks (CNNs) in a wide range of computer vision tasks. Their strong representational capacity, however, comes at the price of significantly higher computational complexity and memory requirements. This poses a major challenge for deploying ViT-based models on resource-constrained devices such as mobile or embedded systems. To address this issue, our lab is actively exploring quantization techniques, which allow efficient inference by constraining all operations, including matrix multiplications and activations, to integer arithmetic. This line of research has already led to promising results in semantic segmentation with the development of the I-Segmenter [1], a fully quantized ViT framework that achieves competitive accuracy while drastically reducing the inference cost.
Internship Objective
The objective of this internship is to extend our ongoing work on ViTs by incorporating object detection. The project will first focus on quantizing a detection head, such as RT-DETR [2,3], in order to ensure its compatibility with an integer-only ViT encoder. The next step will be to integrate the quantized segmentation and detection heads into a unified integer-only multi-task ViT capable of handling both tasks simultaneously. Finally, the proposed system will be evaluated on PyTorch and TVM backends, with performance measured in terms of accuracy using mean Average Precision (mAP), as well as efficiency using throughput (FPS), parameter count, and computational cost (MACs).
Expected Outcomes
* A proof-of-concept implementation of a fully quantized multi-task ViT for segmentation and detection.
* Comparative benchmarks against mixed-precision or floating-point baselines.
* Results with potential for inclusion in a scientific publication.
References:
[1] J. Sassoon, M. Szczepanski, and M. Poreba. 2025. I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation (under review)
[2] Y. Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y. Liu, and J. Chen. DETRs Beat YOLOs on Real-time Object Detection.
[3] W. Lv, Y. Zhao, Q. Chang, K. Huang, G. Wang, and Y. Liu. RT-DETRv2
#AI #IA #edgeAI #edgeIA #ViT #CNN #Vision #VisionTransformer #detection #objectdetection #RT-DETR #quantization #integer #optimisation
Requested profile: Master degree (BAC+5) in Computer Science, Applied Mathematics, or related field.
Qualification: Strong knowledge of deep learning (CNNs, VIT) and PyTorch. Good programming skills in Python (C++ is a plus). Interest in efficient AI, embedded systems, or model quantization.
To apply: resume + cover letter + marks and ranking for the three previous academic years