Skip to main content

Question 70

You lead a data science team at a large international corporation. Most of the models your team trains are large-scale models using high-level TensorFlow APIs on AI Platform with GPUs. Your team usually takes a few weeks or months to iterate on a new version of a model. You were recently asked to review your team’s spending. How should you reduce your Google Cloud compute costs without impacting the model’s performance?

  • A. Use AI Platform to run distributed training jobs with checkpoints.
  • B. Use AI Platform to run distributed training jobs without checkpoints.
  • C. Migrate to training with Kuberflow on Google Kubernetes Engine, and use preemptible VMs with checkpoints.
  • D. Migrate to training with Kuberflow on Google Kubernetes Engine, and use preemptible VMs without checkpoints.

References 

https://cloud.google.com/blog/products/ai-machine-learning/reduce-the-costs-of-ml-workflows-with-preemptible-vms-and-gpus?hl=en

https://www.kubeflow.org/docs/distributions/gke/pipelines/preemptible/

https://cloud.google.com/optimization/docs/guide/checkpointing