**1. Replacing / Recoding values**

**By 'recoding', it means replacing existing value(s) with the new value(s).**

**Create Dummy Data**

mydata = data.frame(State = ifelse(sign(rnorm(25))==-1,'Delhi','Goa'), Q1= sample(1:25))

*In this example, we are replacing 1 with 6 in Q1 variable*

mydata$Q1[mydata$Q1==1] <- 6In this example, we are replacing "Delhi" with "Mumbai" in State variable. We need to convert the variable from factor to character.

mydata$State = as.character(mydata$State)In this example, we are replacing 2 and 3 with NA values in whole dataset.

mydata$State[mydata$State=='Delhi'] <- 'Mumbai'

mydata[mydata == 2 | mydata == 3] <- NA

**Another method**

You have to first install the

**car**package.

# Install the car package

install.packages("car")

# Load the car package

library("car")

# Recode 1 to 6

mydata$Q1 <- recode(mydata$Q1, "1=6")

**Recoding a given range**

# Recoding 1 through 4 to 0 and 5 and 6 to 1

mydata$Q1 <- recode(mydata$Q1, "1:4=0; 5:6=1")

**You don't need to specify lowest and highest value of a range.**

Thelokeyword tells recode to start the range at the lowest value.

Thehikeyword tells recode to end the range at the highest value.

# Recoding lowest value through 4 to 0 and 5 to highest value to 1

mydata$Q1 <- recode(mydata$Q1, "lo:4=0; 5:hi=1")

**You can specify else condition in the recode statement. It means how to treat remaining values that was not already recoded.**

# Recoding lowest value through 4 to 0, 5 and 6 to 1, remaining values to 3,

mydata$Q1 <- recode(mydata$Q1, "lo:4=0; 5:6=1;else= 3")

**2. Recoding to a new column**

# Create a new column called Ques1

mydata$Ques1<- recode(mydata$Q1, "1:4=0; 5:6=1")

**Note :**Make sure you have installed and loaded "car" package before running the above syntax.

**How to use IF ELSE Statement**

**Sample Data**

samples = data.frame(x =c(rep(1:10)), y=letters[1:10])

**If a value of variable x is greater than 6, create a new variable called t1 and write 2 against the corresponding values else make it 1.**

samples$t1 = ifelse(samples$x>6,2,1)

**How to use AND Condition**

samples$t3 = ifelse(samples$x>1 & samples$y=="b" ,2,1)

**How to use NESTED IF ELSE Statement**

samples$t4 = ifelse(samples$x>=1 & samples$x<=4,1,ifelse(samples$x>=5 & samples$x<=7,2,3))

**3. Renaming variables**

To rename variables, you have to first install the

**dplyr**package.

# Install the plyr package

install.packages("dplyr")

install.packages("dplyr")

# Load the plyr package

library(dplyr)

library(dplyr)

# Rename Q1 variable to var1

mydata <- **rename**(mydata, var1 = Q1)

**4.**

**Keeping and Dropping Variables**

In this example, we keep only first two variables .

mydata1 <- mydata[1:2]

In this example, we keep first and third through sixth variables .

mydata1 <- mydata[c(1,3:6)]In this example, we select variables using their names such as v1, v2, v3.

newdata <- mydata[c("v1", "v2", "v3")]

*Deleting a particular column (Fifth column)*mydata [-5]

**Dropping Q3 variable**

mydata$Q3 <- NULL

*Deleting multiple columns*mydata [-(3:4) ]

*Dropping multiple variables by their names*df = subset(mydata, select = -c(x,z) )

**5.**

**Subset data (Selecting Observations)**

By 'subsetting' data, it implies filtering rows (observations).

**Create Sample Data**

mydata = data.frame(Name = ifelse(sign(rnorm(25))==-1,'ABC','DEF'), age = sample(1:25))

**Selecting first 10 observations**

newdata <- mydata[1:10,]

**Selecting values wherein age is equal to 3**

mydata<-subset(mydata, age==3)

**Copy data into a new data frame called 'newdata'**

newdata<-subset(mydata, age==3)

**Conditional Statement (AND) while selecting observations**

newdata<-subset(mydata, Name=="ABC"&age==3)

**Conditional Statement (OR) while selecting observations**

newdata<-subset(mydata, Name=="ABC"|age==3)

**Greater than or less than expression**

newdata<-subset(mydata, age>=3)

**Keeping only missing records**

newdata<-subset(mydata, is.na(age))

**Keeping only non-missing records**

newdata<-subset(mydata, !is.na(age))

**6.**

**Sorting**

Sorting is one of the most common data manipulation task. It is generally used when we want to see the top 5 highest / lowest values of a variable.

**Sorting a vector**

x= sample(1:50)The function

x = sort(x, decreasing = TRUE)

**sort()**is used for sorting a 1 dimensional vector. It cannot be used for more than 1 dimensional vector.

**Sorting a data frame**

mydata = data.frame(Gender = ifelse(sign(rnorm(25))==-1,'F','M'), SAT= sample(1:25))

Sort gender variable in ascending order

mydata.sorted <- mydata[order(mydata$Gender),]

Sort gender variable in ascending order and then SAT in descending order

mydata.sorted1 <- mydata[order(mydata$Gender, -mydata$SAT),]

**Note :**"-" sign before mydata$SAT tells R to sort SAT variable in descending order.

**7.**

**Value labeling**

Use factor() for nominal data

mydata$Gender <- factor(mydata$Gender, levels = c(1,2), labels = c("male", "female"))Use ordered() for ordinal data

mydata$var2 <- ordered(mydata$var2, levels = c(1,2,3,4), labels = c("Strongly agree", "Somewhat agree", "Somewhat disagree", "Strongly disagree"))

**8. Dealing with**

**missing data**

**Number of missing values in a variable**colSums(is.na(mydata))

**Number of missing values in a row**rowSums(is.na(mydata))

**List rows of data that have missing values**mydata[!complete.cases(mydata),]

*Creating a new dataset without missing data*mydata1 <- na.omit(mydata)

*Convert a value to missing*mydata[mydata$Q1==999,"Q1"] <- NA

**9. Aggregate by groups**

The following code calculates mean for variable "x" by grouped variable "y".

samples = data.frame(x =c(rep(1:10)), y=round((rnorm(10))))

mydata <- aggregate(x~y, samples, mean, na.rm = TRUE)

**10. Frequency for a vector**

**table()**function.

**11. Merging (Matching)**

mydata <- merge(mydata1, mydata2, by=c("ID"))

**Detailed Tutorial : Joining and Merging**

**12. Removing Duplicates**

data = read.table(text="

X Y Z

6 5 0

6 5 0

6 1 5

8 5 3

1 NA 1

8 7 2

2 0 2", header=TRUE)

In the example below, we are removing duplicates in a whole data set. [Equivalent to

**NODUP in SAS**]

mydata1 <- unique(data)In the example below, we are removing duplicates by

**"Y"**column. [Equivalent to

**NODUPKEY in SAS**]

mydata2 <- subset(data, !duplicated(data[,"Y"]))

**13. Combining Columns and Rows**

**If the columns of two matrices have the same number of rows, they can be combined into a larger matrix using**

**cbind**function. In the example below, A and B are matrices.

newdata<- cbind(A, B)Similarly, we can combine the rows of two matrices if they have the same number of columns with the

**rbind**function. In the example below, A and B are matrices.

newdata<- rbind(A, B)

**14**

**. Combining Rows when different set of columns**

install.packages("gtools") #If not installed

library(gtools)

mydata <- smartbind(mydata1, mydata2)

**Next Step**

**:**

**Learn Data Manipulation with dplyr Package**

Hi Folk,

ReplyDeleteI tried to replace one of the attribute of some variable by using the below code

D22[D22$Q1==1]=6.

Unfortunately it ddnt work. I dont have access to install car package.

Could you please help in this..

D22$Q1[D22$Q1==1]=6 should work. Thanks!

DeleteHi Folk,

ReplyDeleteI am working on some project on R,

I need to identify outliers, I tried to use the quantile colde,

But it didnt work,

Could you please help me on the same

If you want to see outliers for a particular column in the dataset, you can plot boxplot for that column. There are some standard ways to identify outliers based on Inter Quartile Range (IQR)

Delete

ReplyDelete# how to work on percentiles

duration=mydata_cluster_1$Q4_GPRS_VOLUME

#how to get the quantile forspecific variable

quant=quantile(mydata_cluster_1,c(.05,.1,.15,.2,.25,.30,.35,.40, .9, .95, .98)) )

The above code I wrote on one single variable, what if I have to run it on all variables

good

ReplyDeleteAmazing article.. Please publish much more articles for beginners. And specially articles like knit and shiny (GUI BASED).

ReplyDelete