Skip to main content

Question 32

You developed an ML model with AI Platform, and you want to move it to production. You serve a few thousand queries per second and are experiencing latency issues. Incoming requests are served by a load balancer that distributes them across multiple Kubeflow CPU-only pods running on Google Kubernetes Engine
(GKE). Your goal is to improve the serving latency without changing the underlying infrastructure. What should you do?

  • A. Significantly increase the max_batch_size TensorFlow Serving parameter.
  • B. Switch to the tensorflow-model-server-universal version of TensorFlow Serving.
  • C. Significantly increase the max_enqueued_batches TensorFlow Serving parameter.
  • D. Recompile TensorFlow Serving using the source to support CPU-specific optimizations. Instruct GKE to choose an appropriate baseline minimum CPU platform for serving nodes.


References