This tutorial explains how to transpose a data frame in R using tidyr
package. It helps to convert data from long to wide format or wide to long format.
Transpose Data with R |
The code below creates a sample data frame that would be used to explain examples in this tutorial.
data <- read.table(text="X Y Z ID12 2012-06 566 ID1 2012-06 10239 ID6 2012-06 524 ID12 2012-07 2360 ID1 2012-07 13853 ID6 2012-07 2352 ID12 2012-08 3950 ID1 2012-08 14738 ID6 2012-08 4104",header=TRUE)
Converting Data from Long to Wide Format
Suppose you have data containing three variables such as X, Y and Z. The variable 'X' contains IDs and the variable 'Y' contains dates and the variable 'Z' contains income. The data is structured in a long format and you need to convert it to wide format. The snapshot of data and desired output is shown below -
R : Convert Long to Wide Format |
pivot_wider()
In tidyr package, pivot_wider( )
function converts data from long to wide Format. See the arguments of this function below:
pivot_wider( data, names_from, values_from, values_fill = NULL, id_cols = everything(), names_sep = "_" )
data
: Data frame to be transposed.names_from
: Column(s) to be used for naming columns in the output.values_from
: Column(s) to get the values from.values_fill
: If specified, this value will replace any missing values.id_cols
: Identifier columns. By default, it takes all columns except for the columns in the 'names_from' and 'values_from' arguments.names_sep
: Separator to use between values in the resulting column names.
You can install 'tidyr' package by using the command : install.packages('tidyr')
library(tidyr) mydt <- pivot_wider(data, names_from = Y, values_from = Z)
spread()
There is one more function in tidyr named spread()
that widens data by increasing the number of columns while decreasing the number of rows from the input to the transposed output data.
The syntax of spread() function is as follows :
spread(data, key, value, fill = NA)
data
: data frame name.key
: Column name which will be used for column headings.value
: Column name which will be used for filling the rows.fill
: If used, missing values will be replaced with this value.
mydt <- spread(data, Y, Z)
# A tibble: 3 × 4
X `2012-06` `2012-07` `2012-08`
1 ID12 566 2360 3950
2 ID1 10239 13853 14738
3 ID6 524 2352 4104
pivot_wider vs. spread
: We recommend using the pivot_wider() function since it is actively maintained by developers and it's easier to use and has more features than the spread() function.In this section, we will see how to treat missing values in the output data. In the data below, we have information about income generated from 3 products - Product A, B and C, reported semi-annually. We have limited data about Product C.
df <- read.table(text="Year SemiYear Product Income 1 1 ProductA 13377 1 2 ProductA 14069 1 1 ProductB 11426 1 2 ProductB 11750 1 2 ProductC 11750 2 1 ProductA 11122 2 2 ProductA 11202 2 1 ProductB 14712 2 2 ProductB 10169 ",header=TRUE)
pivot_wider()
library(tidyr) mydt_pivot <- pivot_wider(df, names_from = Product, values_from = Income)
# A tibble: 4 × 5
Year SemiYear ProductA ProductB ProductC
1 1 1 13377 11426 NA
2 1 2 14069 11750 11750
3 2 1 11122 14712 NA
4 2 2 11202 10169 NA
As you can see in the output, we have NAs in the 'ProductC' column. To replace them with 0, we can use the argument values_fill
in the 'pivot_wider' function.
mydt_pivot <- pivot_wider(df,
names_from = Product,
values_from = Income,
values_fill = 0)
spread()
We can use the argument fill
in the 'spread' function to replace missing values with 0.
library(tidyr) mydt_spread <- spread(df, Product, Income, fill = 0)
# A tibble: 4 × 5
Year SemiYear ProductA ProductB ProductC
1 1 1 13377 11426 0
2 1 2 14069 11750 11750
3 2 1 11122 14712 0
4 2 2 11202 10169 0
Converting Data from Wide to Long Format
Suppose you have data containing information of species and their sepal length. The data of sepal length of species are in columns.
Wide to Long Format |
mydata = read.table(text= "ID setosa versicolor virginica 1 5.1 NA NA 2 4.9 NA NA 3 NA 7 NA 4 NA 6.4 NA 5 NA NA 6.3 6 NA NA 5.8 ", header=TRUE)
pivot_longer()
The pivot_longer()
function transforms data from wide to long format. See the arguments below :
pivot_longer( data, cols, names_to = "name", values_to = "value", names_prefix = NULL, names_sep = NULL, values_drop_na = FALSE )
data
: Data frame to be reshaped to long format.cols
: Column(s) to be transformed into longer format.names_to
: New column to be created from the columns specified in the 'cols' argument.values_to
: Name of the column having values in the output.names_prefix
: Pattern used to remove matching text from the start of each column name.names_sep
: Separator used in column names.values_drop_na
: Removing rows containing only NAs in the 'value_to' column.
mydt_longer <- pivot_longer( mydata, cols = c(setosa, versicolor, virginica), names_to = "Species", values_to = "Sepal.Length", values_drop_na = TRUE )
gather()
The syntax of the gather()
function is as follows:
gather(data, key, value, ..., na.rm = FALSE)
data
: Data framekey
: Name of new key columnvalue
: Name of new value column...
: Columns to gather.na.rm
: na.rm=TRUE means removing rows from the output where the value is missing
mydt_gather <- gather(mydata, Species, Sepal.Length, c("setosa", "versicolor","virginica" ), na.rm = TRUE)
# A tibble: 6 × 3
ID Species Sepal.Length
1 1 setosa 5.1
2 2 setosa 4.9
3 3 versicolor 7
4 4 versicolor 6.4
5 5 virginica 6.3
6 6 virginica 5.8
values_drop_na
to TRUE.
Hi Deepashu,
ReplyDeleteThanks for sharing such a wonderful article on Transposing data.
However, I still have some doubt associated with, if I will compare it with SAS.
The variable which you are saying that its an "ID" variable, by sas I can say that it a "BY" variable because that's something which we are keeping it on the left hand side and rest we are "ID" by sas.
please reply to clear my small dubiousness.
Thanks
Rishabh
One of the best overall tutorials I have seen. I promise this is getting bookmarked for my reference.
ReplyDelete