####
**R Data Science:**
R Programming A-Z: R For Data Science With Real Exercises!

This article demonstrates how to explore data with R. It is very important to explore data before starting to build a predictive model. It gives an idea about the structure of the dataset like number of continuous or categorical variables and number of observations (rows).

The snapshot of the dataset used in this tutorial is pasted below. We have five variables - Q1, Q2, Q3, Q4 and Age. The variables Q1-Q4 represents survey responses of a questionnaire. The response lies between 1 and 6. The variable Age represents age groups of the respondents. It lies between 1 to 3. 1 represents Generation Z, 2 represents Generation X and Y, 3 represents Baby Boomers.

The

To calculate summary of a particular column, say third column, you can use the following syntax :

**Dataset**The snapshot of the dataset used in this tutorial is pasted below. We have five variables - Q1, Q2, Q3, Q4 and Age. The variables Q1-Q4 represents survey responses of a questionnaire. The response lies between 1 and 6. The variable Age represents age groups of the respondents. It lies between 1 to 3. 1 represents Generation Z, 2 represents Generation X and Y, 3 represents Baby Boomers.

Sample Data |

**Import data into R**The

**read.csv()**function is used to import CSV file into R. The header = TRUE tells R that header is included in the data that we are going to import.mydata <- read.csv("C:/Users/Deepanshu/Documents/Book1.csv", header=TRUE)

**1. Calculate basic descriptive statistics**summary(mydata)

Data Exploration with R |

summary( mydata[3])

To calculate summary of a particular column by its name, you can use the following syntax :

In the code below, we are selecting first 5 rows of dataset.

In the code below, we are selecting last 5 rows of dataset.

The function below returns number of missing values in each variable of a dataset.

summary( mydata$Q1)

**2.**

**Lists name of variables in a dataset**

names(mydata)

**3.****Calculate number of rows in a dataset**nrow(mydata)

**4.****Calculate number of columns in a dataset**ncol(mydata)

**5.****List structure of a dataset**str(mydata)

**6. See****first 6 rows of dataset**head(mydata)

**7.****First n rows of dataset**In the code below, we are selecting first 5 rows of dataset.

head(mydata, n=5)

**8.****All rows but the last row**head(mydata, n= -1)

**9.****Last 6 rows of dataset**tail(mydata)

**10.****Last n rows of dataset**In the code below, we are selecting last 5 rows of dataset.

tail(mydata, n=5)

**11.****All rows but the first row**tail(mydata, n= -1)

**12. Number of missing values**The function below returns number of missing values in each variable of a dataset.

colSums(is.na(mydata))

**13. Number of missing values in a single variable**sum(is.na(mydata$Q1))

#to create the data used in this tutorial, use following command

ReplyDeletemydata = data.frame(Q1 = sample(1:6, 15, replace = TRUE),Q2 = sample(1:6, 15, replace = TRUE),Q3 = sample(1:6, 15, replace = TRUE), Q4 = sample(1:6, 15, replace = TRUE), Age = sample(1:3, 15, replace = TRUE))