This article demonstrates how to explore data with R. It is very important to explore data before starting to build a predictive model. It gives an idea about the structure of the dataset like number of continuous or categorical variables and number of observations (rows).
The snapshot of the dataset used in this tutorial is pasted below. We have five variables - Q1, Q2, Q3, Q4 and Age. The variables Q1-Q4 represents survey responses of a questionnaire. The response lies between 1 and 6. The variable Age represents age groups of the respondents. It lies between 1 to 3. 1 represents Generation Z, 2 represents Generation X and Y, 3 represents Baby Boomers.
Sample Data |
mydata <- read.csv("C:/Users/Deepanshu/Documents/Book1.csv", header=TRUE)
You can also create sample data which would be used further to demonstrate data exploration techniques. The program below creates random observations with replacement.
mydata = data.frame( Q1 = sample(1:6, 100, replace = TRUE), Q2 = sample(1:6, 100, replace = TRUE), Q3 = sample(1:6, 100, replace = TRUE), Q4 = sample(1:6, 100, replace = TRUE), Age= sample(1:3, 100, replace = TRUE) )
summary(mydata)
Data Exploration with R |
To calculate summary of a particular column, say third column, you can use the following syntax :
summary(mydata[3])
To calculate summary of a particular column by its name, you can use the following syntax :
summary(mydata$Q1)
names(mydata)
> names(mydata) # [1] "Q1" "Q2" "Q3" "Q4" "Age"
nrow(mydata)
> nrow(mydata) # [1] 100
ncol(mydata)
> ncol(mydata) # [1] 5
str(mydata)
> str(mydata) 'data.frame': 100 obs. of 5 variables: $ Q1 : int 1 5 3 1 6 2 2 1 4 1 ... $ Q2 : int 3 3 3 1 1 4 2 2 6 1 ... $ Q3 : int 4 2 1 4 3 6 1 4 4 4 ... $ Q4 : int 3 5 1 1 3 4 2 2 5 1 ... $ Age: int 3 1 1 1 1 1 3 1 2 3 ...
head(mydata)
Q1 Q2 Q3 Q4 Age 1 1 3 4 3 3 2 5 3 2 5 1 3 3 3 1 1 1 4 1 1 4 1 1 5 6 1 3 3 1 6 2 4 6 4 1
In the code below, we are selecting first 5 rows of dataset.
head(mydata, n=5)
head(mydata, n= -1)
tail(mydata)
In the code below, we are selecting last 5 rows of dataset.
tail(mydata, n=5)
tail(mydata, n= -1)
library(dplyr) sample_n(mydata, 5)
If dplyr package is not already installed, make sure you install it before running the above script using the command install.packages("dplyr")
.
library(dplyr) sample_frac(mydata, 0.1)
In this case, it selects 10% random rows from mydata data frame.
The function below returns number of missing values in each variable of a dataset.
colSums(is.na(mydata))
It can also be written like -
sapply(mydata, function(y) sum(is.na(y)))
sum(is.na(mydata$Q1))
#to create the data used in this tutorial, use following command
ReplyDeletemydata = data.frame(Q1 = sample(1:6, 15, replace = TRUE),Q2 = sample(1:6, 15, replace = TRUE),Q3 = sample(1:6, 15, replace = TRUE), Q4 = sample(1:6, 15, replace = TRUE), Age = sample(1:3, 15, replace = TRUE))
Thanks for pointing it out. I have added it to this post. Cheers!
DeleteGreat article!
ReplyDelete