Improve data quality
In the course, you will learn that there are two phases in machine learning, a training phase and an inference phase. You will see that an ML problem can be thought of as being all about data.
In any ML project, after you define the best use case and establish the success criteria, the process of delivering an ML model to production involves the following steps. These steps can be completed manually or can be completed by an automated pipeline.
The first three steps deal with data. We can assess the quality of our data in these three steps. In data extraction you retrieve data from various sources. Those sources can be streaming in real time or batch. For example, you may extract data from a Customer Relationship Management system, or CRM, to analyze customer behavior. This data may be structured where it is in a given format such a CSV, a text, JSON, or XML format. Or, you may have unstructured source data where you may have images of your customers or text comments from your chat sessions with your customers. Or, you may have to extract streaming data from your company's transportation vehicles that are equipped with sensors that transmit data in real time. Other examples of unstructured data may include books and journals, documents, metadata, health records, audio and video. In data analysis, you analyze the data you've extracted. For example, you can use Exploratory Data Analysis, or EDA, which involves using graphics and basic sample statistics to get a feeling for what information might be obtainable from your dataset. You look at various aspects of the data such as outliers or anomalies, trends, and data distributions, all while attempting to identify those features that can aid in increasing the predictive power of your machine-learning model.
So after you've extracted and analyzed your data, the next step in the process is data preparation. Data preparation includes data transformation, which is the process of changing or converting the format, structure, or values of data you've extracted into another format or structure. There are many ways to prepare or transform data from machine-learning model. For example, you may need to perform data cleansing where you need to remove superfluous and repeated records from log data. Or you may need to alter data types where a data feature was mistyped and you need to convert it. Or you may need to convert categorical data to numerical data. Most ML models require categorical data to be in numerical format, but some models work with either numeric or categorical features while others can handle mixed type features. As a first step toward determining their data quality levels, organizations typically perform data asset inventories in which the relative accuracy, uniqueness, and validity of data is measured. The established baseline ratings for data sets can then be compared against the data and systems on an ongoing basis to help identify new data quality issues so they can be resolved. Let's look at the attributes related to data quality.
- Data accuracy relates to whether the data value was stored or an object or the correct values. For example, does the data match up with the rest of what you're looking at in terms of a real world object or event it describes? Enabling correct conclusions to be drawn from it? To be correct, data values must be the right value and must be represented in a consistent and unambiguous form. For example, is the given data set consistent and correlative with different representations of the same information across multiple data sets? Timeliness can be measured as the time between when information is expected and when it is readily available for use. For example, how long is the time difference between data capture and the real world event being captured?
- Data completeness relates to whether all the intended data being produced in the data set is complete. Or, is any of the data missing? Let's look at some examples. So, what are ways to improve data quality. Well, you can resolve missing values, convert date time features to a date/time format if it is not in the correct format already, you could also parse the data/time features to get temporal features that allow you to create more insight in to your data. You can remove unwanted values from a feature column. And you can convert categorical feature columns to one-hot encodings. Missing values can skew your data. In this example, data, zip code and model year have two rows with missing values. While fuel, make, light_duty, and vehicles shows three rows with missing values. We can pick one column, date, and see that it has a NaN, or not a number, on index row three. We can confirm that by seeing the row two is also true. There is a missing value. We can run code to show us all the unique and missing values from our features, and date, model year, and several other features have missing values. This is an example of messy or untidy data. Another example of untidy or messy data is when a feature is loaded with the wrong format. In this example, date is listed as a non-null object. Date is not a non-null object and should be converted to a date/time data type. We can convert the date feature to the date/time data type with the two date/time function in PANDAS. We should also consider parsing the date feature in to three distinct feature columns: year, month, and day. This would allow you to look at the seasonality of your data, to spot trends, and to also perform time series related predictions.
- Another data quality issue is unwanted screen characters in a column. Now the intent of the less than sign is valid, the researcher wants to show models less than 2006. But we cannot leave this less than sign in our feature column. There are many ways to deal with this. We could create year buckets, for example. But in this case, we can simply remove the less than sign given that the number of vehicles with this model year is small. But this would mean removing the row and we could loose data. The strategic decisions you will need to make regarding how to handle this type of problem is beyond the scope of this introduction, but please see our readings for more resources.
- Another aspect to consider when improving data quality is to examine your categorical feature columns. In this example, we highlight the column Light_ duty. Light_duty refers to vehicle type, which can either be light duty or not. Or yes or no. This Light_duty category has the yes or no string values the create a categorical column. How do we deal with this type of feature? We cannot just put words in to a matching and remodel, can we? You will typically deal with categorical features by employing a process called One-hot encoding. In One-hot encoding you convert the categorical data in to a vector of numbers. You do this because machine learned algorithms can't work with categorical data directly. Instead you generate one Boolean column for each category or class. Only one of these columns could take the value one for each sample. That explains the term One-hot encoding. And in this example taken from the California Housing Data Set using our Keras lab, the feature called Ocean_Proximity contains text features. Because we can't work with these strings directly, a Boolean column for each category class is generated, meaning this is either a one or a zero in each column. You will not see a column of all zeroes and you will only see the number one in each column. Now, there are many ways to convert categorical features. Our lab will show you one way, but there are also other methods.
Improving data quality can be done before and after data expiration. It is not uncommon to load the data and begin some type of descriptive analysis. We can explore and clean data iteratively, as you will see the lab. The process does not have to be a sequential process. Many times it helps to improve the quality of our data before we can really explore it. The importance of data quality cannot be overemphasized.
Recall that machine learning is a way to use standard algorithms to derive predictive insights from data and make repeated decisions. Thus, in machine learning, your original source data will typical be split in to a training, validation, and a test set. The quality of your source data will influence the predictive value of your model.