Skip to main content

Lab Intro: Improve quality of your data

This lab focuses on improving data quality. Recall that machine learning models can only consume numeric data and that numeric data should be ones or zeroes. Data is said to be messy or untidy if it is missing attribute values, contains noise or outliers, has duplicates, wrong data, upper-lower case column names and is essentially not ready for ingestion by a machine-learning algorithm.

This lab presents and solves some of the most common issues of untidy data. Note that different problems will require different methods, and they are beyond the scope of this lab. You first solve the missing values. Then you will convert the data feature columns to a date-time format. Then you will rename a feature column and will remove a value from a feature column. Then you will create one-hot encodings. Lastly, you will see examples of temporal features conversions.