####
**Live Online Training :**
Data Science with R

- Explain Advanced Algorithms in Simple English

- Live Projects

- Case Studies

- Job Placement Assistance

- Get 10% off till Oct 26, 2017

- Batch starts from October 28, 2017

**dplyr R Tutorial**

The dplyr package is one of the most powerful and popular package in R. This package was written by the most popular R programmer Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc. This post includes several examples and tips of how to use dply package for cleaning and transforming data. It's a complete tutorial on data manipulation and data wrangling with R.

**What is dplyr?**

The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.

**What's special about dplyr?**

**dplyr vs. Base R Functions**

dplyr functions process faster than base R functions. It is because dplyr functions were written in a computationally efficient manner. They are also more stable in the syntax and better supports data frames than vectors.

**SQL Queries vs. dplyr**

People have been utilizing SQL for analyzing data for decades. Every modern data analysis software such as Python, R, SAS etc supports SQL commands. But SQL was never designed to perform data analysis. It was rather designed for querying and managing data. There are many data analysis operations where SQL fails or makes simple things difficult. For example, calculating median for multiple variables, converting wide format data to long format etc. Whereas, dplyr package was designed to do data analysis.

The names of dplyr functions are similar to SQL commands such asselect()for selecting variables,group_by()- group data by grouping variable, join() - joining two data sets. Also includesinner_join()andleft_join(). It also supports sub queries for which SQL was popular for.

**How to install and load dplyr package**

To install the dplyr package, type the following command.

install.packages("dplyr")

To load dplyr package, type the command below

library(dplyr)

**Important dplyr Functions to remember**

dplyr Function | Description | Equivalent SQL |
---|---|---|

select() | Selecting columns (variables) | SELECT |

filter() | Filter (subset) rows. | WHERE |

group_by() | Group the data | GROUP BY |

summarise() | Summarise (or aggregate) data | - |

arrange() | Sort the data | ORDER BY |

join() | Joining data frames (tables) | JOIN |

mutate() | Creating New Variables | COLUMN ALIAS |

**Data : Income Data by States**

**Note :**This data do not contain actual income figures of the states.

This dataset contains 51 observations (rows) and 16 variables (columns). The snapshot of few rows and columns of the dataset is shown below.

**Download the Dataset**

**How to load Data**

Submit the following code.

**Change the file path in the code below.**mydata = read.csv("C:\\Users\\Deepanshu\\Documents\\sampledata.csv")

**Example 1 : Selecting Random N Rows**

The

**sample_n**function selects random rows from a data frame (or table). The second parameter of the function tells R the number of rows to select.

sample_n(mydata,3)

**Example 2 : Selecting Random Fraction of Rows**

The

**sample_frac**function returns randomly N% of rows. In the example below, it returns randomly 10% of rows.

sample_frac(mydata,0.1)

**Example 3 : Remove Duplicate Rows based on all the variables (Complete Row)**

The

**distinct function**is used to eliminate duplicates.

x1 = distinct(mydata)

**Example 4 : Remove Duplicate Rows based on a variable**

The

**.keep_all**function is used to retain all other variables in the output data frame.

x2 = distinct(mydata, Index,.keep_all= TRUE)

**Example 5 : Remove Duplicates Rows based on multiple variables**

In the example below, we are using two variables -

**Index, Y2010**to determine uniqueness.

x2 = distinct(mydata,Index,Y2010, .keep_all= TRUE)

**select( ) Function**

It is used to select only desired variables.

select() syntax :select(data , ....)Data Frame

data :

.... :Variables by name or by function

**Example 6 : Selecting Variables (or Columns)**

Suppose you are asked to select only a few variables. The code below selects variables "Index", columns from "State" to "Y2008".

mydata2 = select(mydata, Index, State:Y2008)

**Example 7 : Dropping Variables**

The

**minus sign**before a variable tells R to drop the variable.

mydata = select(mydata, -Index, -State)The above code can also be written like :

mydata = select(mydata, -c(Index,State))

**Example 8 : Selecting or Dropping Variables starts with 'Y'**

The

**starts_with()**function is used to select variables starts with an alphabet.

mydata3 = select(mydata, starts_with("Y"))Adding a negative sign before starts_with() implies dropping the variables starts with 'Y'

mydata33 = select(mydata, -starts_with("Y"))

*The following functions helps you to select variables based on their names.*Helpers | Description |
---|---|

starts_with() | Starts with a prefix |

ends_with() | Ends with a prefix |

contains() | Contains a literal string |

matches() | Matches a regular expression |

num_range() | Numerical range like x01, x02, x03. |

one_of() | Variables in character vector. |

everything() | All variables. |

**Example 9 : Selecting Variables contain 'I' in their names**

mydata4 = select(mydata, contains("I"))

**Example 10 : Reorder Variables**

The code below keeps variable

**'State'**in the front and the remaining variables follow that.

mydata5 = select(mydata, State, everything())

**rename( ) Function**

It is used to change variable name.

rename() syntax :rename(data , new_name = old_name)

data :Data Frame

new_name :New variable name you want to keep

old_name :Existing Variable Name

**Example 11 : Rename Variables**

The rename function can be used to rename variables.

In the following code, we are renaming

**'Index'**variable to

**'Index1'**.

mydata6 = rename(mydata, Index1=Index)

Output |

**filter( ) Function**

It is used to subset data with matching logical conditions.

filter() syntax :filter(data ,....)

data :Data Frame

.... :Logical Condition

**Example 12 : Filter Rows**

Suppose you need to subset data. You want to filter rows and retain only those values in which Index is equal to A.

mydata7 = filter(mydata, Index == "A")

**Example 13 : Multiple Selection Criteria**

The

**%in%**operator can be used to select multiple items. In the following program, we are telling R to select rows against 'A' and 'C' in column 'Index'.

mydata7 = filter(mydata6, Index %in% c("A", "C"))

**Example 14 : 'AND' Condition in Selection Criteria**

Suppose you need to apply 'AND' condition. In this case, we are picking data for 'A' and 'C' in the column 'Index' and income greater than 1.3 million in Year 2002.

mydata8 = filter(mydata6, Index %in% c("A", "C") & Y2002 >= 1300000 )

**Example 15 : 'OR' Condition in Selection Criteria**

The 'I' denotes OR in the logical condition. It means any of the two conditions.

mydata9 = filter(mydata6, Index %in% c("A", "C") | Y2002 >= 1300000)

**Example 16 : NOT Condition**

The "!" sign is used to reverse the logical condition.

mydata10 = filter(mydata6, !Index %in% c("A", "C"))

**Example 17 : CONTAINS Condition**

The

**grepl function**is used to search for pattern matching. In the following code, we are looking for records wherein column

**state**contains

**'Ar'**in their name.

mydata10 = filter(mydata6, grepl("Ar", State))

**summarise( ) Function**

It is used to summarize data.

summarise() syntax :summarise(data ,....)

data :Data Frame

..... :Summary Functions such as mean, median etc

**Example 18 : Summarize selected variables**

In the example below, we are calculating mean and median for the variable Y2015.

summarise(mydata, Y2015_mean = mean(Y2015), Y2015_med=median(Y2015))

Output |

**Example 19 : Summarize Multiple Variables**

In the following example, we are calculating number of records, mean and median for variables Y2005 and Y2006. The

**summarise_at**function allows us to select multiple variables by their names.

summarise_at(mydata, vars(Y2005, Y2006), funs(n(), mean, median))

Output |

**Example 20 : Summarize with Custom Functions**

We can also use custom functions in the summarise function. In this case, we are computing the number of records, number of missing values, mean and median for variables Y2011 and Y2012. The

**dot (.)**denotes each variables specified in the second argument of the function.summarise_at(mydata, vars(Y2011, Y2012),

funs(n(), missing =sum(is.na(.)), mean(., na.rm = TRUE), median(.,na.rm = TRUE)))

Summarize : Output |

**How to apply Non-Standard Functions**

Suppose you want to subtract mean from its original value and then calculate variance of it.

set.seed(222)

mydata <- data.frame(X1=sample(1:100,100), X2=runif(100))

summarise_at(mydata,vars(X1,X2),function(x) var(x - mean(x)))

X1 X2

1 841.6667 0.08142161

**Example 21 : Summarize all Numeric Variables**

The

**summarise_if**function allows you to summarise conditionally.

summarise_if(mydata, is.numeric, funs(n(),mean,median))

**Alternative Method :**

**First,**store data for all the numeric variables

numdata = mydata[sapply(mydata,is.numeric)]

**Second,**the

**summarise_all**function calculates summary statistics for all the columns in a data frame

summarise_all(numdata, funs(n(),mean,median))

**Example 22 : Summarize Factor Variable**

We are checking the

**number of levels/categories**and**count of missing observations**in a categorical (factor) variable.summarise_all(mydata["Index"], funs(nlevels(.), sum(is.na(.))))

**arrange() function :**

**Use :**Sort data

**Syntax**

arrange(data_frame, variable(s)_to_sort)

or

data_frame%>%arrange(variable(s)_to_sort)

To sort a variable in descending order, use

**desc(x)**.**Example 23 : Sort Data by Multiple Variables**

The default sorting order of

**arrange() function**is ascending. In this example, we are sorting data by multiple variables.arrange(mydata, Index, Y2011)

Suppose you need to sort one variable by descending order and other variable by ascending oder.

arrange(mydata,desc(Index), Y2011)

**Pipe Operator %>%**

It is important to understand the pipe (%>%) operator before knowing the other functions of dplyr package. dplyr utilizes pipe operator from another package

**(magrittr)**.

It allows you to write sub-queries like we do it in sql.

**Note :**All the functions in dplyr package can be used

**without**the pipe operator. The question arises

**"Why to use pipe operator %>%". The answer is**it lets to wrap multiple functions together with the use of %>%.

**Syntax :**

filter(data_frame, variable == value)

or

data_frame%>%filter(variable == value)

*The %>% is NOT restricted to filter function. It can be used with any function.*

**Example :**

The code below demonstrates the usage of pipe %>% operator. In this example, we are selecting 10 random observations of two variables "Index" "State" from the data frame "mydata".

dt = sample_n(select(mydata, Index, State),10)

or

dt = mydata%>%select(Index, State)%>%sample_n(10)

Output |

**group_by() function :**

**Use :**Group data by categorical variable

**Syntax :**

group_by(data, variables)

or

data %>% group_by(variables)

**Example 24 : Summarise Data by Categorical Variable**

We are calculating count and mean of variables Y2011 and Y2012 by variable Index.

t = summarise_at(group_by(mydata, Index), vars(Y2011, Y2012), funs(n(), mean(., na.rm = TRUE)))

The above code can also be written like

t = mydata %>% group_by(Index) %>%

summarise_at(vars(Y2011:Y2015), funs(n(), mean(., na.rm = TRUE)))

**do() function :**

**Use :**Compute within groups

**Syntax :**

do(data_frame, expressions_to_apply_to_each_group)

**Note :**

*The*

**dot (.)**is required to refer to a data frame.**Example 25 : Filter Data within a Categorical Variable**

Suppose you need to pull top 2 rows from 'A', 'C' and 'I' categories of variable Index.

t = mydata %>% filter(Index %in% c("A", "C","I")) %>%group_by(Index)%>%

do(head( . , 2))

Output : do() function |

**Example 26 : Selecting 3rd Maximum Value by Categorical Variable**

We are calculating third maximum value of variable Y2015 by variable Index. The following code first selects only two variables Index and Y2015. Then it filters the variable Index with 'A', 'C' and 'I' and then it groups the same variable and sorts the variable Y2015 in descending order. At last, it selects the third row.

t = mydata %>% select(Index, Y2015) %>%

filter(Index %in% c("A", "C","I")) %>%

group_by(Index) %>%

do(arrange(.,desc(Y2015))) %>%slice(3)

The

**slice() function**is used to select rows by position.Output |

**Using Window Functions**

Like SQL, dplyr uses window functions that are used to subset data within a group. It returns a vector of values. We could use

**min_rank() function**that calculates rank in the preceding example,

t = mydata %>% select(Index, Y2015) %>%

filter(Index %in% c("A", "C","I")) %>%

group_by(Index) %>%

filter(min_rank(desc(Y2015)) == 3)

**Example 27 : Summarize, Group and Sort Together**

In this case, we are computing mean of variables Y2014 and Y2015 by variable Index. Then sort the result by calculated mean variable Y2015.

t = mydata %>%

group_by(Index)%>%

summarise(Mean_2014 = mean(Y2014, na.rm=TRUE),

Mean_2015 = mean(Y2015, na.rm=TRUE)) %>%

arrange(desc(Mean_2015))

**mutate() function :**

**Use :**Creates new variables

**Syntax :**

mutate(data_frame, expression(s) )

or

data_frame %>% mutate(expression(s))

**Example 28 : Create a new variable**

The following code calculates division of Y2015 by Y2014 and name it "change".

mydata1 = mutate(mydata, change=Y2015/Y2014)

**Example 29 : Multiply all the variables by 1000**

It creates new variables and name them with suffix "_new".

mydata11 = mutate_all(mydata, funs("new" = .* 1000))

Output |

The output shown in the image above is truncated due to high number of variables.

1: In Ops.factor(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, :

‘*’ not meaningful for factors

2: In Ops.factor(1:51, 1000) : ‘*’ not meaningful for factors

It implies you are multiplying 1000 to string(character) values which are stored as factor variables. These variables are 'Index', 'State'. It does not make sense to apply multiplication operation on character variables. For these two variables, it creates newly created variables which contain only NA.

**Note -**The above code returns the following error messages -**Warning messages:**1: In Ops.factor(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, :

‘*’ not meaningful for factors

2: In Ops.factor(1:51, 1000) : ‘*’ not meaningful for factors

It implies you are multiplying 1000 to string(character) values which are stored as factor variables. These variables are 'Index', 'State'. It does not make sense to apply multiplication operation on character variables. For these two variables, it creates newly created variables which contain only NA.

**Solution :****See****Example 45**- Apply multiplication on only numeric variables**Example 30 : Calculate Rank for Variables**

Suppose you need to calculate rank for variables Y2008 to Y2010.

mydata12 = mutate_at(mydata, vars(Y2008:Y2010), funs(Rank=min_rank(.)))

Output |

By default,

**min_rank()**assigns 1 to the smallest value and high number to the largest value. In case, you need to assign rank 1 to the largest value of a variable, use**min_rank(desc(.))**mydata13 = mutate_at(mydata, vars(Y2008:Y2010), funs(Rank=min_rank(desc(.))))

**Example 31 : Select State that generated highest income among the variable 'Index'**

out = mydata %>% group_by(Index) %>% filter(min_rank(desc(Y2015)) == 1) %>%

select(Index, Y2015)

**Example 32 : Cumulative Income of 'Index' variable**

The

**cumsum function**calculates cumulative sum of a variable. With

**mutate function,**we insert a new variable called 'Total' which contains values of cumulative income of variable Index.

out2 = mydata %>% group_by(Index) %>% mutate(Total=cumsum(Y2015)) %>%

select(Index, Y2015, Total)

**join() function :**

**Use :**Join two datasets

**Syntax :**

inner_join(x, y, by = )

left_join(x, y, by = )

right_join(x, y, by = )

full_join(x, y, by = )

semi_join(x, y, by = )

anti_join(x, y, by = )

**x, y -**datasets (or tables) to merge / join

**by -**common variable (primary key) to join by.

**Example 33 : Common rows in both the tables**

Let's create two data frames say df1 and df2.

df1 <- data.frame(ID = c(1, 2, 3, 4, 5),

w = c('a', 'b', 'c', 'd', 'e'),

x = c(1, 1, 0, 0, 1),

y=rnorm(5),

z=letters[1:5])

df2 <- data.frame(ID = c(1, 7, 3, 6, 8),

a = c('z', 'b', 'k', 'd', 'l'),

b = c(1, 2, 3, 0, 4),

c =rnorm(5),

d =letters[2:6])

**INNER JOIN**returns rows when there is a match in both tables. In this example, we are merging df1 and df2 with ID as common variable (primary key).

df3 = inner_join(df1, df2, by = "ID")

Output : INNER JOIN |

If the primary key does not have same name in both the tables, try the following way:

inner_join(df1, df2, by = c("ID"="ID1"))

**Example 34 : Applying LEFT JOIN**

**LEFT JOIN :**It returns all rows from the left table, even if there are no matches in the right table.

left_join(df1, df2, by = "ID")

Output : LEFT JOIN |

Combine Data Vertically

Combine Data Vertically

**intersect(x, y)**

Rows that appear in both x and y.

Rows that appear in either or both x and y.

Rows that appear in x but not y.

**union(x, y)**Rows that appear in either or both x and y.

**setdiff(x, y)**Rows that appear in x but not y.

**Example 35 : Applying INTERSECT**

**Prepare Sample Data for Demonstration**mtcars$model <- rownames(mtcars)

first <- mtcars[1:20, ]

second <- mtcars[10:32, ]

**INTERSECT**selects unique rows that are common to both the data frames.intersect(first, second)

**Example 36 : Applying UNION****UNION**displays all rows from both the tables and removes duplicate records from the combined dataset. By using

**union_all function**, it allows duplicate rows in the combined dataset.

x=data.frame(ID = 1:6, ID1= 1:6)

y=data.frame(ID = 1:6, ID1 = 1:6)

union(x,y)

union_all(x,y)

**Example 37 : Rows appear in one table but not in other table**

setdiff(first, second)

**Example 38 : IF ELSE Statement**

**Syntax :**

if_else(condition, true, false, missing = NULL)true : Value if condition meets

false : Value if condition does not meet

missing : Value if missing cases.It will be used to replace missing values (Default : NULL)

df <- c(-10,2, NA)

if_else(df < 0, "negative", "positive", missing = "missing value")

**Create a new variable with IF_ELSE**

If a value is less than 5, add it to 1 and if it is greater than or equal to 5, add it to 2. Otherwise 0.

df =data.frame(x = c(1,5,6,NA))

df %>% mutate(newvar=if_else(x<5, x+1, x+2,0))

Output |

**Nested IF ELSE**

Multiple IF ELSE statement can be written using if_else() function. See the example below -

mydf =data.frame(x = c(1:5,NA))

mydf %>% mutate(newvar=if_else(is.na(x),"I am missing",

if_else(x==1,"I am one",

if_else(x==2,"I am two",

if_else(x==3,"I am three","Others")))))

**Output**

x newvar

1 I am one

2 I am two

3 I am three

4 Others

5 Others

NA I am missing

**SQL-Style CASE WHEN Statement**

We can use

**case_when()**function to write nested if-else queries. In case_when(), you cannot use variables directly within case_when() wrapper so it should be written like

**.$x**which is equivalent to

**mydf$x**.

**TRUE**refers to ELSE statement.

mydf %>% mutate(flag =case_when(is.na(.$x) ~ "I am missing",

.$x == 1 ~ "I am one",

.$x == 2 ~ "I am two",

.$x == 3 ~ "I am three",

TRUE ~ "Others"))

**Important Point**

Make sure you setis.na()condition at the beginning in nested ifelse. Otherwise, it would not be executed.

**Example 39 : Apply ROW WISE Operation**

Suppose you want to find maximum value in each row of variables 2012, 2013, 2014, 2015. The

**rowwise()**function allows you to apply functions to rows.df = mydata %>%

rowwise()%>% mutate(Max= max(Y2012:Y2015)) %>%

select(Y2012:Y2015,Max)

Output |

**Example 40 : Combine Data Frames**

Suppose you are asked to combine two data frames. Let's first create two sample datasets.

df1=data.frame(ID = 1:6, x=letters[1:6])

df2=data.frame(ID = 7:12, x=letters[7:12])

Input Datasets |

**bind_rows() function**combine two datasets with rows. So combined dataset would contain

**12 rows (6+6) and 2 columns.**

xy = bind_rows(df1,df2)

It is equivalent to base R function rbind.

xy = rbind(df1,df2)

The

This example explains the advanced usage of

It includes functions like select_if, mutate_if, summarise_if. They come into action only when logical condition meets. See examples below.

**bind_cols() function**combine two datasets with columns. So combined dataset would contain**4 columns and 6 rows.**xy = bind_cols(x,y)The output is shown below-

or

xy = cbind(x,y)

cbind Output |

**Example 41 : Calculate Percentile Values**
The

**quantile()**function is used to determine Nth percentile value. In this example, we are computing percentile values by variable Index.mydata %>% group_by(Index) %>%

summarise(Pecentile_25=quantile(Y2015, probs=0.25),

Pecentile_50=quantile(Y2015, probs=0.5),

Pecentile_75=quantile(Y2015, probs=0.75),

Pecentile_99=quantile(Y2015, probs=0.99))

The

**ntile()**function is used to divide the data into N bins.x= data.frame(N= 1:10)

x = mutate(x, pos = ntile(x$N,5))

**Example 42 : Automate Model Building**This example explains the advanced usage of

**do() function**. In this example, we are building linear regression model for each level of a categorical variable. There are 3 levels in variable cyl of dataset mtcars.length(unique(mtcars$cyl))

**Result : 3**by_cyl <-group_by(mtcars, cyl)

models <- by_cyl %>%do(mod = lm(mpg ~ disp, data = .))

summarise(models, rsq = summary(mod)$r.squared)

models %>%do(data.frame(

var = names(coef(.$mod)),

coef(summary(.$mod)))

)

Output : R-Squared Values |

**if() Family of Functions**It includes functions like select_if, mutate_if, summarise_if. They come into action only when logical condition meets. See examples below.

**Example 43 : Select only numeric columns**
The

**select_if()**function returns only those columns where logical condition is TRUE. The**is.numeric**refers to retain only numeric variables.mydata2 =select_if(mydata, is.numeric)

Similarly, you can use the following code for selecting factor columns -

mydata3 = select_if(mydata, is.factor)

**Example 44 : Number of levels in factor variables**

Like select_if() function, summarise_if() function lets you to summarise only for variables where logical condition holds.

summarise_if(mydata, is.factor, funs(nlevels(.)))

It returns 19 levels for variable Index and 51 levels for variable State.

Example 45 : Multiply by 1000 to numeric variables

Example 45 : Multiply by 1000 to numeric variables

mydata11 =mutate_if(mydata, is.numeric, funs("new" = .* 1000))

**Example 46 : Convert value to NA**

In this example, we are converting "" to NA using

**na_if()**function.

k <- c("a", "b", "", "d")

na_if(k, "")

**Result :**"a" "b" NA "d"

**Endnotes**

There are hundreds of packages that are dependent on this package. The main benefit it offers is to take off fear of R programming and make coding effortless and lower processing time. However, some R programmers prefer

**data.table**package for its speed. I would recommend learn both the packages. Check out

**data.table tutorial**. The data.table package wins over dplyr in terms of speed if data size greater than 1 GB.

Thanks for share, great stuff and examples.

ReplyDeleteThank you for stopping by my blog!

DeleteThis is the best tutorial out there

ReplyDeleteexcellent!

ReplyDeletethx

Z

Thank you, this is very helpful.

ReplyDeleteHaving searched many sites and lectures I am bookmarking your site after looking at this page. Its the simplicity of your presentation. Thanks.

ReplyDeleteThank you for stopping by my blog. Glad you found it useful. Cheers!

DeleteThank you, this indeed very helpful and precise. Great Job!

ReplyDeleteThank you for your appreciation!

DeleteI followed along your script step by step and got a warning message

ReplyDeletein Example 29 : Multiply all the variables by 1000 as follows:

1: In Ops.factor(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, :

‘*’ not meaningful for factors

2: In Ops.factor(1:51, 1000) : ‘*’ not meaningful for factors

What did it mean? Could you please give me some explanation. Thanks.

This error says 'multiplying 1000 on factor(string) variables' does not make sense. Run this command - str(mydata[,1:2])

DeleteFirst two variables in the dataframe mydata are strings that are stored as factor variables.

I got it. Thanks. I think I'm ready to go next to your another tutorial - data.table. It's quite interesting to learn R from your blog posts.

DeleteExample#22 - gives incorrect number of levels as 0 - fix is below

ReplyDeleteWhy doesn't nlevels() work?

> summarise_all(dt["Index"], funs(nlevels(.), sum(is.na(.))))

# A tibble: 1 × 2

nlevels sum

1 0 0

===============

The fix is change nlevels() to length(unique(.)) as below

summarise_all(dt["Index"], funs(length(unique(.)), sum(is.na(.))))

# A tibble: 1 × 2

length sum

1 19 0

It works fine at my end. Check out the code below -

Deletelibrary(dplyr)

mydata = read.csv("C:\\Users\\Deepanshu\\Documents\\sampledata.csv")

summarise_all(mydata["Index"], funs(nlevels(.), sum(is.na(.))))

Example #21

ReplyDeleteAlternatively, we can use the following:

mydata %>% summarise_if(is.numeric, funs(n(),mean,median))

Thank you for posting alternative method. I have added it to the tutorial. Cheers!

DeleteVery helpful tutorial. Thanks!

ReplyDeleteThis is a great tutorial. A doubt that crept to me when I tries to mix multiple functions. Any reason why the following is not working:

ReplyDeleteDF <- mutate_if(mydata, is.numeric & contains('Y2015'), funs('new' = *.100));

but this works:

DF <- mutate_if(mydata, is.numeric, funs('new' = *.100));

what if I want to mutate to add only a column for Y2015?

Thanks

Great tutorials. Took too much time to found this tutorial.

ReplyDeleteWonderfull for a newcomer to R !

ReplyDeleteThank you for this great presentation.

ReplyDeleteExample 39 is wrong.

ReplyDeleteThanks!

ReplyDeleteW. r. t. chapter 'SQL-Style CASE WHEN Statement': The workaround .$ is not necessary anymore from dplyr version 0.7.0

ReplyDelete