
For more information and functions which you can use read beginner’s guide to exploratory data analysis.īoth missing values and outliers are of concern for Machine Learning models as they tend to push the result towards extreme values.

It will also provide information about missing values or outliers if any. Analyzing Summary Statistics – Here, we will simply create summary statistics for all the variables to understand the behavior of all the independent variables. If you are wondering why so? Then don’t worry we got that covered in coming sections.Ģ. The price variable follows normal distribution and It is good that the target variable follows a normal distribution from linear regressions perspective. Geom_histogram(aes(y =.density.), fill = "orange") + Ggplot(data=housing, aes(housing$Price)) + To achieve this, we will be drawing a histogram with a density plot. Checking distribution of target variable – First, you should always try to understand the nature of your target variable. What are the things which derive target variables?īelow are few things which we should consider exploring from the statistical point of view:ġ. In other words, try to figure if there is a statistically significant relationship between the target and independent variables. However, the key to a successful EDA is to keep asking the questions which one believes helps in solving the business problem or put across all sorts of hypothesis and then testing them using appropriate statistical tests. If done correctly, it can reveal many aspects of the data, which will surely help you build better models.Įvery dataset is different, and thus, it isn’t easy to list down steps one should perform as part of data exploration. Mostly, this involves slicing and dicing of data at different levels, and results are often presented with visual methods. It is an approach to understand and summarize the main characteristics of a given data. Let us look at the top six observations of USA housing data. In the below case study, we will be using USA housing data to predict the price. Here we will be using a case study approach to help you understand the linear regression algorithm.

In the above equation, β_0 coefficient represents intercept and β_i coefficient represents slope.
