Skip to main content

Data analysis and visualization

The purpose of an EDA is to find insights which will serve for data cleaning, preparation, or transformation, which will ultimately be used in a machine learning algorithm. We use data analysis and data visualization at every step of the machine learning process where each step; data exploration, data cleaning, model building, presenting results, these steps will belong to one notebook. Let's have a look at some examples.

A histogram is a graphical display of data using bars of different heights. In a histogram, each bar groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and spread of continuous sampled data. In this example, we use seaborn's distplot function to plot a histogram of the feature median house value.

Another commonly used plot type is a simple scatter plot. Instead of plots being joined by line segments, as in a line plot, here the points are represented individually with a dot, circle or other shape. In this example, we use matplotlib's pyplot function to plot a scatter plot. A scatter plot is a graph in which the values of two variables are plotted against two axes. The pattern of the resulting points revealing any correlation that may be present. Here we can see that by plotting housing location latitude on the X axis and longitude on the Y axis, we see that the resulting revealed correlation pattern is the state of California.

In this example, we use seaborn's heatmap function to show correlations. A heatmap is a graphical representation of data that uses a system of color coding to represent different values. For example, you can see the correlation between all the features in your dataset. The lighter the shade, the stronger the correlation. This is a quick and easy way to see which features may influence your target. If you think about it, a heatmap plots multiple variables and can be thought of as an example of multivariate graphical analysis, another area of exploratory data analysis.

So to summarize, data analysis which is the second step in the ML pipeline, is a crucial milestone and must be used to prepare the data before model training. The purpose of exploratory data analysis includes being able to gain maximum insight into the dataset and its underlying structure, as well as to create a list of outliers or other anomalies and most importantly, the ability to identify the most influential features. There are many more ways to explore, analyze and plot data, make it a goal to expand your knowledge of them. Have fun.