Technical presentation - 30 minutes (including q&a)
Generative AI (GenAI) is rapidly reshaping our daily life, and entire industries. Yet their increasing computational demands often hinder cost-effective deployment. This talk presents an end-to-end solution for accelerating GenAI workloads on ArmĀ® Neoverse⢠by combining ArmĀ®ās seamless software-level AI acceleration with KleidiAIās cutting-edge optimizations. Specifically, we have integrated KleidiAIās highly optimized 4-bit weight-only kernels with dynamic activation quantization directly into PyTorch, making them easily accessible as part of widely accessible and official PyTorch distribution. To further simplify adoption, we have developed a new TorchAO quantizer API that serves as a standardized and easy-to-use solution for quantizing any PyTorch model, including large language models (LLMs) and other GenAI models. By coupling this integration with TorchChat for LLM serving, we empower developers to deploy resource-efficient, high-performance LLMs at scale. This comprehensive approach streamlines the 4-bit quantization process, leverages advanced KleidiAIās 4-bit matrix multiplication kernels, and delivers significant performance improvements ā all while reducing computational costs on ArmĀ® platforms. As a result of these optimizations, we achieve generation speeds of over 66 tokens per second on models such as Llama 2 (7B), compared to 12 tokens per second in their default non-quantized state. Given that human reading speed is around 5ā7 tokens per second, our solution enables real-time, interactive AI applications that can efficiently serve multiple requests per second in large-scale deployments. This substantial performance boost makes running GenAI models on Arm not just viable, but highly competitive for cloud applications. Our solution is more than just a performance tweakāitās a strategic enabler for the next generation of Arm-based cloud services. By reducing computational costs and energy consumption while boosting inference speed, we help unlock new opportunities for businesses to deploy cutting-edge AI solutions. This directly contributes to a more vibrant, competitive ArmĀ® ecosystem.