Skip to main content

How is EDA used in machine learning

How is EDA used in machine learning? As we mentioned, the exploratory data analysis approach does not impose deterministic or probabilistic models on the data. On the contrary; the EDA approach allows the data to suggest admissible models that best fit the data. For exploratory data analysis, the focus is on the data, its structure, outliers and models suggested by the data. Although there are other methods, exploratory data analysis is typically performed using the following methods.

Univariate analysis is the simplest form of analyzing data; "uni" means one, so in other words, your data has only one variable. It doesn't deal with causes or relationships, unlike regression, and its major purpose is to describe. It takes the data, it summarizes that data, and it finds patterns in the data. In this example, you see two types of univariate data, categorical and continuous. With the categorical feature type, you can perform numerical EDA using Pandas' crosstab function, and you can perform visual EDA using Seaborn's countplot function. With the continuous feature type, you can perform numerical EDA using Pandas' describe function, and you can visualize boxplots, distribution plots and kernel density estimation plots, or KDE plots in Python, using Matplotlib or using Seaborn. There are many EDA tools at your disposal, but that is beyond the scope of this lesson. In this univariate data example, there is just one feature; ocean proximity, with five categories. You can use Seaborn's countplot function to count the number of observations in each category. Our visualization is a simple bar chart.

Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of statistical analysis, and is used to find out if there is a relationship between two sets of values. It usually involves the variables X and Y. We can analyze bivariate data and multivariate data in Python, using Matplotlib or using Seaborn, and there are other tools as well. One of the most powerful features of Seaborn is the ability to easily build conditional plots. This lets us see what the data looks like when segmented by one or more variables. The easiest way to do this is through the factor plot method, which is used to draw a categorical plot up to a facet grid. Seaborn's jointplot function draws a plot of two variables with bivariate and univariate graphs. Seaborn's factorplot map method can map a factorplot onto a KDE, distribution or boxplot chart. A common plot of bivariate data is the simple line plot. In this example, we use Seaborn's regplot function to visualize a linear relationship between two sets of features. In this case, trip distance, our X label, and fair amount, our target, appear to have a linear relationship. Note that although the majority of the data tends to group together in a linear fashion, there are also outliers present as well.