| Kite - Bootstrap Admin Template

Linaro Connect 2025

LIS25-117 Supercharging Generative AI: KleidiAI, PyTorch and Arm Neoverse.

Technical presentation - 30 minutes (including q&a)

Wednesday, 14 May 11:30 - 11:55
Room: Session room 3 | Opala III

Generative AI (GenAI) is rapidly reshaping our daily life, and entire industries. Yet their increasing computational demands often hinder cost-effective deployment. This talk presents an end-to-end solution for accelerating GenAI workloads on Arm® Neoverse™ by combining Arm®’s seamless software-level AI acceleration with KleidiAI’s cutting-edge optimizations. Specifically, we have integrated KleidiAI’s highly optimized 4-bit weight-only kernels with dynamic activation quantization directly into PyTorch, making them easily accessible as part of widely accessible and official PyTorch distribution. To further simplify adoption, we have developed a new TorchAO quantizer API that serves as a standardized and easy-to-use solution for quantizing any PyTorch model, including large language models (LLMs) and other GenAI models. By coupling this integration with TorchChat for LLM serving, we empower developers to deploy resource-efficient, high-performance LLMs at scale. This comprehensive approach streamlines the 4-bit quantization process, leverages advanced KleidiAI’s 4-bit matrix multiplication kernels, and delivers significant performance improvements – all while reducing computational costs on Arm® platforms. As a result of these optimizations, we achieve generation speeds of over 66 tokens per second on models such as Llama 2 (7B), compared to 12 tokens per second in their default non-quantized state. Given that human reading speed is around 5–7 tokens per second, our solution enables real-time, interactive AI applications that can efficiently serve multiple requests per second in large-scale deployments. This substantial performance boost makes running GenAI models on Arm not just viable, but highly competitive for cloud applications. Our solution is more than just a performance tweak—it’s a strategic enabler for the next generation of Arm-based cloud services. By reducing computational costs and energy consumption while boosting inference speed, we help unlock new opportunities for businesses to deploy cutting-edge AI solutions. This directly contributes to a more vibrant, competitive Arm® ecosystem.

Add to My Agenda

Presented by

Nikhil Gupta
Senior Software Engineer at ARM Ltd.
Nikhil Gupta is a Senior Software Engineer in the ML Infrastructure Software team, specializing in High-Performance Computing (HPC) and optimizing machine learning workloads for Arm® Neoverse™ servers... View more