An algorithm for the hardware adaptation of machine learning models
Please login to view abstract download link
Efficient machine learning (ML) aims at either maximizing the complexity of the ML problem that can be solved with a given compute budget or minimizing the compute budget used to solve an ML problem of given complexity. The latter line of research includes the development of bespoke hardware architectures and ML model compression techniques. At their intersection lies the technique of quantization, using low-precision numeric data types to reduce model storage requirements, increase operand bandwidth, and simplify datapath logic. The simplification of datapath logic – a common feature of application-specific integrated circuits (ASICs) - constrains both operator signatures and composition order; this is in contrast to general purpose processors, which impose little to no constraint. Therefore, given an ML model developed for general purpose hardware, quantizing it for deployment on a bespoke architecture requires to insert data-type-changing operators in the model so that each operand has a legal data type. In this work, I present an algorithm that, given a computational graph and the operator table of a target hardware architecture, inserts data-type-changing operators (e.g., cast, quantization) in the graph so that the output graph is guaranteed to be legal for – i.e., it is adapted to - the target architecture. Although the algorithm’s asymptotic complexity is polynomial in the size of the graph, empirical properties of real-world graphs reduce it to linear, making the algorithm practical in real-world use cases. The algorithm can be integrated in ML model compression pipelines to automate quantization across models and hardware architectures. I demonstrate this claim by applying it to adapt both computer vision and language models to an energy-efficient bespoke heterogeneous architecture.
