Question 203
A TensorFlow machine learning model on Compute Engine virtual machines (n2-standard-32) takes two days to complete training. The model has custom TensorFlow operations that must run partially on a CPU. You want to reduce the training time in a cost-effective manner. What should you do?
- A. Change the VM type to n2-highmem-32.
- B. Change the VM type to e2-standard-32.
- C. Train the model using a VM with a GPU hardware accelerator.
- D. Train the model using a VM with a TPU hardware accelerator.
https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus
CPUs
- Quick prototyping that requires maximum flexibility
- Simple models that don't take long to train
- Small models with small, effective batch sizes
- Models that contain many custom TensorFlow operations written in C++
- Models that are limited by available I/O or the networking bandwidth of the host system
GPUs
- Models with a significant number of custom TensorFlow/PyTorch/JAX operations that must run at least partially on CPUs
- Models with TensorFlow ops that are not available on Cloud TPU (see the list of available TensorFlow ops)
- Medium-to-large models with larger effective batch sizes
TPUs
- Models dominated by matrix computations
- Models with no custom TensorFlow/PyTorch/JAX operations inside the main training loop
- Models that train for weeks or months
- Large models with large effective batch sizes
- Models with ultra-large embeddings common in advanced ranking and recommendation workloads
Cloud TPUs are not suited to the following workloads:
- Linear algebra programs that require frequent branching or contain many element-wise algebra operations
- Workloads that require high-precision arithmetic
- Neural network workloads that contain custom operations in the main training loop