8-bit floating-point formats for deep learning H/F

Job description: By default, computations in a deep neural network are done with numbers represented in the 32-bit floating-point format (fp32). This format can represent a great variety of real-valued numbers but requires 4 bytes to store each number used, which can be a problem for memory-constrained environments such as embedded systems. 8-bit fixed-point (int8) is a common format for deep neural network inference [1], which enables great compression with little loss in accuracy [2]. But training a neural network in reduced precision is much less commonly done. When training, 8-bit fixed-point suffers from its relatively small dynamic range, which incurs significant degradation in accuracy. To correct this flaw, some authors [3, 4] proposed to make all computations in the learning phase in 8-bit floating-point format (fp8). They claim that it yields networks with just the same performances as networks trained in full precision at various tasks (language modelling, image classification). Yet, despite these promises, no library is publicly available to perform deep learning in 8 bits. During this internship, the intern will: Produce a research bibliography on numerical formats for deep learning Develop python deep learning modules simulating the behaviour of fp8 Run experiments on datasets and compare results with other numerical formats (optional) Implement fp8 modules in C++ (optional) Measure energy consumption and cache miss/hit rate (optional) Extend the previous work on other unusual numerical formats What comes with the offer: Office in Grenoble, France, a world-class nanotech hub, with high-level experts all around A unique quality of life, with quick access to mountains: skiing, cycling, trailing, hiking, paragliding spots can be reached in less than 1hr by car Subsidized lunch Employee benefits : culture, sport events, free-of-charge music room, subsidized activities … Start date is flexible: the internship may start during the second semester of the 2024-25 academic year. References [1] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, et Y. Bengio, « Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations ». [2] S. Han, H. Mao, et W. J. Dally, « Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding », arXiv, arXiv:1510.00149, feb. 2016 [3] N. Wang, J. Choi, D. Brand, C.-Y. Chen, et K. Gopalakrishnan, « Training Deep Neural Networks with 8-bit Floating Point Numbers ». [4] P. Micikevicius et al., « FP8 Formats for Deep Learning », 29 sept 2022, arXiv: arXiv:2209.05433

Level of qualification studied: Bac+5 - Master 2

Languages: English Intermediate,French Intermediate