Skip to main content

Question 203

A TensorFlow machine learning model on Compute Engine virtual machines (n2-standard-32) takes two days to complete training. The model has custom TensorFlow operations that must run partially on a CPU. You want to reduce the training time in a cost-effective manner. What should you do?

  • A. Change the VM type to n2-highmem-32.
  • B. Change the VM type to e2-standard-32.
  • C. Train the model using a VM with a GPU hardware accelerator.
  • D. Train the model using a VM with a TPU hardware accelerator.

https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus

  • Quick prototyping that requires maximum flexibility
  • Simple models that don't take long to train
  • Small models with small, effective batch sizes
  • Models that contain many custom TensorFlow operations written in C++
  • Models that are limited by available I/O or the networking bandwidth of the host system
  • Models with a significant number of custom TensorFlow/PyTorch/JAX operations that must run at least partially on CPUs
  • Models with TensorFlow ops that are not available on Cloud TPU (see the list of available TensorFlow ops)
  • Medium-to-large models with larger effective batch sizes
  • Models dominated by matrix computations
  • Models with no custom TensorFlow/PyTorch/JAX operations inside the main training loop
  • Models that train for weeks or months
  • Large models with large effective batch sizes
  • Models with ultra-large embeddings common in advanced ranking and recommendation workloads

Cloud TPUs are not suited to the following workloads:

  • Linear algebra programs that require frequent branching or contain many element-wise algebra operations
  • Workloads that require high-precision arithmetic
  • Neural network workloads that contain custom operations in the main training loop