- Tony Wang

# How to think about deep learning inference

If you are reading this, chances are you know what a neural network is. You are also probably familiar with the term "inference", which means executing a neural network with pretrained weights to get a prediction on an input example. I'm also going to assume that you care about the efficiency of this operation, which can typically be measured on three axes: latency, throughput, and accuracy.

**Latency**: how long does it take for an input example to be processed, on average. For example, if your neural network is running on a drone for obstacle detection, then latency pretty much limits how fast your drone can fly.

**Throughput**: how many input examples can you process per second. Say you are running a NLP model, and need to process X number of documents. Then you will need to pay for X / Thoughput seconds of computation on whatever hardware you're running on, either in $ or Watts. One can typically tradeoff throughput for latency -- if you can live with longer latency, you can generally get better usage of your hardware, for example through batching your inputs. I will perhaps talk about this tradeoff curve in a later post.

**Accuracy**: how accurate is your final model. I'm assuming here that you're allowed to change your model through popular model compression tricks like quantization and compression. These tricks can typically cut down the model by a lot (thus lowering latency and increasing throughput) at the cost of reducing the performance of the neural network at its target task. Sometimes, this could make sense, especially if the model isn't that accurate to begin with.

So the best way to think about your deep learning inference problem is constrained optimization on a 3D surface, where the axes are latency, throughput and accuracy. The surface represents the Pareto optimal frontier achievable by current hardware, software and compression methods. There might be additional constraints such as hard latency cutoffs or throughput targets. You might be optimizing for a specific objective (highest accuracy given latency/throughput constraints), or a combination (e.g. the latency-throughput product).