Question 203

A TensorFlow machine learning model on Compute Engine virtual machines (n2-standard-32) takes two days to complete training. The model has custom TensorFlow operations that must run partially on a CPU. You want to reduce the training time in a cost-effective manner. What should you do?

A. Change the VM type to n2-highmem-32.
B. Change the VM type to e2-standard-32.
C. Train the model using a VM with a GPU hardware accelerator.
D. Train the model using a VM with a TPU hardware accelerator.

https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus

CPUs

Quick prototyping that requires maximum flexibility
Simple models that don't take long to train
Small models with small, effective batch sizes
Models that contain many custom TensorFlow operations written in C++
Models that are limited by available I/O or the networking bandwidth of the host system

GPUs

Models with a significant number of custom TensorFlow/PyTorch/JAX operations that must run at least partially on CPUs
Models with TensorFlow ops that are not available on Cloud TPU (see the list of available TensorFlow ops)
Medium-to-large models with larger effective batch sizes

TPUs

Models dominated by matrix computations
Models with no custom TensorFlow/PyTorch/JAX operations inside the main training loop
Models that train for weeks or months
Large models with large effective batch sizes
Models with ultra-large embeddings common in advanced ranking and recommendation workloads

Cloud TPUs are not suited to the following workloads:

Linear algebra programs that require frequent branching or contain many element-wise algebra operations
Workloads that require high-precision arithmetic
Neural network workloads that contain custom operations in the main training loop

Back to top