Skip to main content

Lab: Improving Data Quality

Overview

Machine learning models can only consume numeric data, and that numeric data should be 1s or 0s. Data is said to be messy or untidy if it is missing attribute values, contains noise or outliers, has duplicates, wrong data, or upper/lower case column names, or is essentially not ready for ingestion by a machine learning algorithm.

In this lab, you will present and solve some of the most common issues of untidy data. Note that different problems will require different methods, and they are beyond the scope of this notebook.


Objectives

In this lab, you will learn how to:

  • Resolve missing values.

  • Convert the Date feature column to a datetime format.

  • Rename a feature column, remove a value from a feature column.

  • Create one-hot encoding features.

  • Understand temporal feature conversions.


Task 1. Set up your environment

Enable the Vertex AI API

  1. In the Google Cloud Console, on the Navigation menu, click Vertex AI > Dashboard.

  2. Click ENABLE ALL RECOMMENDED API.


Task 2. Launch Vertex AI Notebooks instance

  1. In the Google Cloud Console, on the Navigation Menu, click Vertex AI > Workbench. Select User-Managed Notebooks.

  2. On the Notebook instances page, click New Notebook > TensorFlow Enterprise > TensorFlow Enterprise 2.6 (with LTS) > Without GPUs.

  3. In the New notebook instance dialog, confirm the name of the deep learning VM, if you don’t want to change the region and zone, leave all settings as they are and then click Create. The new VM will take 2-3 minutes to start.

  4. Click Open JupyterLab.
    A JupyterLab window will open in a new tab.

  5. You will see “Build recommended” pop up, click Build. If you see the build failed, ignore it.



Task 3. Clone course repo within your Vertex AI Notebooks instance

To clone the training-data-analyst notebook in your JupyterLab instance:

  1. In JupyterLab, to open a new terminal, click the Terminal icon.

  2. At the command-line prompt, run the following command:


    git clone https://github.com/GoogleCloudPlatform/training-data-analyst


  3. To confirm that you have cloned the repository, double-click on the training-data-analyst directory and ensure that you can see its contents.
    The files for all the Jupyter notebook-based labs throughout this course are available in this directory.



Task 4. Improve data quality

  1. In the notebook interface, navigate to training-data-analyst > courses > machine_learning > deepdive2 > launching_into_ml > labs, and open improve_data_quality.ipynb.

  2. In the notebook interface, click Edit > Clear All Outputs.

  3. Carefully read through the notebook instructions and fill in lines marked with #TODO where you need to complete the code as needed.


Note: Tips

  • To run the current cell, click the cell and press SHIFT+ENTER. Other cell commands are listed in the notebook UI under Run.
  • Hints may also be provided for the tasks to guide you along. Highlight the text to read the hints (they are in white text).
  • If you need more help, look at the complete solution by navigating to training-data-analyst > courses > machine_learning > deepdive2 > launching_into_ml > solutions, and open improve_data_quality.ipynb.