1. Replacing / Recoding values
By 'recoding', it means replacing existing value(s) with the new value(s).
Create Dummy Data
mydata = data.frame(State = ifelse(sign(rnorm(25))==-1,'Delhi','Goa'), Q1= sample(1:25))In this example, we are replacing 1 with 6 in Q1 variable
mydata$Q1[mydata$Q1==1] <- 6In this example, we are replacing "Delhi" with "Mumbai" in State variable. We need to convert the variable from factor to character.
mydata$State = as.character(mydata$State)In this example, we are replacing 2 and 3 with NA values in whole dataset.
mydata$State[mydata$State=='Delhi'] <- 'Mumbai'
mydata[mydata == 2 | mydata == 3] <- NAAnother method
You have to first install the car package.
# Install the car packageRecoding a given range
install.packages("car")
# Load the car package
library("car")
# Recode 1 to 6
mydata$Q1 <- recode(mydata$Q1, "1=6")
# Recoding 1 through 4 to 0 and 5 and 6 to 1
mydata$Q1 <- recode(mydata$Q1, "1:4=0; 5:6=1")
The lo keyword tells recode to start the range at the lowest value.
The hi keyword tells recode to end the range at the highest value.
# Recoding lowest value through 4 to 0 and 5 to highest value to 1
mydata$Q1 <- recode(mydata$Q1, "lo:4=0; 5:hi=1")
You can specify else condition in the recode statement. It means how to treat remaining values that was not already recoded.
# Recoding lowest value through 4 to 0, 5 and 6 to 1, remaining values to 3,
mydata$Q1 <- recode(mydata$Q1, "lo:4=0; 5:6=1;else = 3")
# Create a new column called Ques1Note : Make sure you have installed and loaded "car" package before running the above syntax.
mydata$Ques1<- recode(mydata$Q1, "1:4=0; 5:6=1")
How to use IF ELSE Statement
Sample Data
samples = data.frame(x =c(rep(1:10)), y=letters[1:10])If a value of variable x is greater than 6, create a new variable called t1 and write 2 against the corresponding values else make it 1.
samples$t1 = ifelse(samples$x>6,2,1)How to use AND Condition
samples$t3 = ifelse(samples$x>1 & samples$y=="b" ,2,1)How to use NESTED IF ELSE Statement
samples$t4 = ifelse(samples$x>=1 & samples$x<=4,1,ifelse(samples$x>=5 & samples$x<=7,2,3))
To rename variables, you have to first install the dplyr package.
# Install the plyr package
install.packages("dplyr")
install.packages("dplyr")
# Load the plyr package
library(dplyr)
library(dplyr)
# Rename Q1 variable to var1
mydata <- rename(mydata, var1 = Q1)
4. Keeping and Dropping Variables
In this example, we keep only first two variables .
mydata1 <- mydata[1:2]
In this example, we keep first and third through sixth variables .
mydata1 <- mydata[c(1,3:6)]In this example, we select variables using their names such as v1, v2, v3.
newdata <- mydata[c("v1", "v2", "v3")]
Deleting a particular column (Fifth column)
mydata [-5]Dropping Q3 variable
mydata$Q3 <- NULL
Deleting multiple columns
mydata [-(3:4) ]Dropping multiple variables by their names
df = subset(mydata, select = -c(x,z) )
5. Subset data (Selecting Observations)
By 'subsetting' data, it implies filtering rows (observations).
Create Sample Data
mydata = data.frame(Name = ifelse(sign(rnorm(25))==-1,'ABC','DEF'), age = sample(1:25))
Selecting first 10 observations
newdata <- mydata[1:10,]Selecting values wherein age is equal to 3
mydata<-subset(mydata, age==3)
Copy data into a new data frame called 'newdata'
newdata<-subset(mydata, age==3)
Conditional Statement (AND) while selecting observations
newdata<-subset(mydata, Name=="ABC" & age==3)
Conditional Statement (OR) while selecting observations
newdata<-subset(mydata, Name=="ABC" | age==3)
Greater than or less than expression
newdata<-subset(mydata, age>=3)
Keeping only missing records
newdata<-subset(mydata, is.na(age))
Keeping only non-missing records
newdata<-subset(mydata, !is.na(age))
6. Sorting
Sorting is one of the most common data manipulation task. It is generally used when we want to see the top 5 highest / lowest values of a variable.
Sorting a vector
x= sample(1:50)The function sort() is used for sorting a 1 dimensional vector. It cannot be used for more than 1 dimensional vector.
x = sort(x, decreasing = TRUE)
Sorting a data frame
mydata = data.frame(Gender = ifelse(sign(rnorm(25))==-1,'F','M'), SAT= sample(1:25))
Sort gender variable in ascending order
mydata.sorted <- mydata[order(mydata$Gender),]
Sort gender variable in ascending order and then SAT in descending order
mydata.sorted1 <- mydata[order(mydata$Gender, -mydata$SAT),]Note : "-" sign before mydata$SAT tells R to sort SAT variable in descending order.
7. Value labeling
Use factor() for nominal data
mydata$Gender <- factor(mydata$Gender, levels = c(1,2), labels = c("male", "female"))Use ordered() for ordinal data
mydata$var2 <- ordered(mydata$var2, levels = c(1,2,3,4), labels = c("Strongly agree", "Somewhat agree", "Somewhat disagree", "Strongly disagree"))
8. Dealing with missing data
Number of missing values in a variable
colSums(is.na(mydata))Number of missing values in a row
rowSums(is.na(mydata))List rows of data that have missing values
mydata[!complete.cases(mydata),]Creating a new dataset without missing data
mydata1 <- na.omit(mydata)Convert a value to missing
mydata[mydata$Q1==999,"Q1"] <- NA
9. Aggregate by groups
The following code calculates mean for variable "x" by grouped variable "y".
samples = data.frame(x =c(rep(1:10)), y=round((rnorm(10))))
mydata <- aggregate(x~y, samples, mean, na.rm = TRUE)
10. Frequency for a vector
11. Merging (Matching)
mydata <- merge(mydata1, mydata2, by=c("ID"))Detailed Tutorial : Joining and Merging
12. Removing Duplicates
data = read.table(text="
X Y Z
6 5 0
6 5 0
6 1 5
8 5 3
1 NA 1
8 7 2
2 0 2", header=TRUE)
In the example below, we are removing duplicates in a whole data set. [Equivalent to NODUP in SAS]
mydata1 <- unique(data)In the example below, we are removing duplicates by "Y" column. [Equivalent to NODUPKEY in SAS]
mydata2 <- subset(data, !duplicated(data[,"Y"]))
13. Combining Columns and Rows
If the columns of two matrices have the same number of rows, they can be combined into a larger matrix using cbind function. In the example below, A and B are matrices.
If the columns of two matrices have the same number of rows, they can be combined into a larger matrix using cbind function. In the example below, A and B are matrices.
newdata<- cbind(A, B)Similarly, we can combine the rows of two matrices if they have the same number of columns with the rbind function. In the example below, A and B are matrices.
newdata<- rbind(A, B)
14. Combining Rows when different set of columns
install.packages("gtools") #If not installed
library(gtools)
mydata <- smartbind(mydata1, mydata2)
Next Step :
Learn Data Manipulation with dplyr Package
Hi Folk,
ReplyDeleteI tried to replace one of the attribute of some variable by using the below code
D22[D22$Q1==1]=6.
Unfortunately it ddnt work. I dont have access to install car package.
Could you please help in this..
D22$Q1[D22$Q1==1]=6 should work. Thanks!
DeleteHi Folk,
ReplyDeleteI am working on some project on R,
I need to identify outliers, I tried to use the quantile colde,
But it didnt work,
Could you please help me on the same
If you want to see outliers for a particular column in the dataset, you can plot boxplot for that column. There are some standard ways to identify outliers based on Inter Quartile Range (IQR)
Delete
ReplyDelete# how to work on percentiles
duration=mydata_cluster_1$Q4_GPRS_VOLUME
#how to get the quantile forspecific variable
quant=quantile(mydata_cluster_1,c(.05,.1,.15,.2,.25,.30,.35,.40, .9, .95, .98)) )
The above code I wrote on one single variable, what if I have to run it on all variables
good
ReplyDeleteAmazing article.. Please publish much more articles for beginners. And specially articles like knit and shiny (GUI BASED).
ReplyDeleteWhat function should we use To convert HTML file into data frame
ReplyDelete(Note: once you get the HTML file in a list format , how will you convert to a data frame )
please describe the logic of this command "State = ifelse(sign(rnorm(25))==-1"
ReplyDelete