Developing low-power, high-throughput artificial intelligence chips for edge devices and real-time inference systems
Synopsis
Artificial Intelligence (AI) chips are a class of hardware accelerators specifically designed for real-time inference of AI algorithms. Digital AI chip(s) are custom CPUs designed for efficient implementations of DNN (Deep Neural Networks) workloads. AI chips differ from general-purpose CPUs/GPUs in optimizing the compute engine architecture such as memory access, and processing elements organization to minimize power and latency. Their architecture is driven by their workload(s). Developers are continually looking for smarter architectures that can do more with less, hence optimizing area, power, and latency. Development of new architectures and circuits for the evolving demands of AI workflows is a big challenge. This is an actively explored area and is core to the functioning of AI chips. Possible architectures include: (1) conventional architectures used in CPU/GPU with optimizations in the access mechanisms, (2) architectures similar to FMCW radar that can provide on-chip memory and has high fan-in which helps accelerate specific workloads such as DNN, (3) processing-in-memory architectures that improve energy efficiency by moving compute closer to memory (Lin et al., 2021; Cheng et al., 2024; Omidsajedi et al., 2024).
AI chips target an increasingly growing number of applications, each with its own unique workloads which require the ability to execute diverse algorithms of different complexities. Real-time inference systems that need continuous inference are critical in this epoch of automated processes and performance chasing machines. However, state-of-the-art AI chips, architectures, and tools struggle to meet the conflicting demands of the number crunching characteristics of inference workloads on latency/throughput and flexibility requirements. Demarketing continuous inference with a microarchitecture analysis is performed to identify workloads that stress various microarchitecture components of RAIC including conditions that render it inefficient. This identifies opportunities to improve area-efficient AI chips. Various techniques based on hardware-centric software changes and accelerator-specific coils are presented to optimize performance in terms of throughput and latency and demonstrate their efficacy on RAIC. With continued miniaturization devices in edge devices will be incapable of extracting sufficient electrical power from conventional battery chemistry to support growing storage, communication and processing demands. Design choices must minimize energy consumption per operation, communication costs and fabrication time which will likely lead to proposals based on new material systems (Venkataramani et al., 2021; Shuvo et al., 2022; Santoso & Surya, 2024).