Data Types and Structures in R

Deepanshu Bhalla 25 Comments

Unlike SAS and SPSS, R has several different data types (structures) including vectors, factors, data frames, matrices, arrays and lists. The data frame structure is more like a spreadsheet in MS Excel.

1. Vectors

A vector is an object that contains a set of values called its elements.

Numeric vector
x <- c(1,2,3,4,5,6)

The operator <– is equivalent to "=" sign.

Character vector
State <- c("DL", "MU", "NY", "DL", "NY", "MU")
R is a case-sensitive language. It means uppercase and lowercase letters in variable names, function names and data structures are not considered same. For example, "State", "state" and "STATE" are all separate vectors in R.

To calculate frequency for State vector, you can use table function.

table function in R

To calculate mean for a vector, you can use mean function.

x <- c(1,2,3,NA,5,6)
mean function in R

Since the above vector contains a NA (not available) value, the mean function returns NA.

To calculate mean for a vector excluding NA values, you can include na.rm = TRUE parameter in mean function.

mean(x, na.rm=TRUE)
na.rm = TRUE in R

You can use square brackets [element_position] to access elements of a vector.

my_vector <- c(4,2,1,3,6,5)

my_vector[c(1,4)] # 1st and 4th position
# Output : 4 3

my_vector[2:4] # 2nd to 4th position
# Output : 2 1 3
x <- c(1,2,3,4,5,6)
subscripts in R

sum(x[c(3,5)]) returns the sum of the elements in x at positions 3 and 5.

2. Factors

R has a special data structure to store categorical variables. It tells R that a variable is nominal or ordinal by making it a factor.

Simplest form of the factor() function
gender <- c(1,2,1,2,1,2)
gender <- factor(gender)
How to label factors

The factor function has three parameters:

  1. Vector Name
  2. Values (Optional)
  3. Value labels (Optional)
gender <- c(1,2,1,2,1,2,1,2)
gender <- factor(gender, 
                 levels = c(1,2),
                 labels = c("male","female"))

In this example, the 'gender' vector will be a factor with levels "male" and "female" and the numeric values 1 and 2 will be mapped to these levels.

label factors in R

Now you will see the labels in the output generated by 'table()' function.

3. Matrix

All values in columns in a matrix must have the same mode (numeric, character, etc.) and the same length.

The cbind() function joins columns together into a matrix. See the usage below -

x <- c(1,2,3,4,5)
y <- c(1,3,5,7,9)
z <- c(1,2,5,4,7)
mymatrix <- cbind(x,y,z)
Matrix in R

You can also use the matrix() function for creating a matrix in R. The syntax of the matrix function is as follows:

matrix(data, nrow = ..., ncol = ...)
# Create a matrix with 3 rows and 2 columns
my_matrix <- matrix(1:6, nrow = 3, ncol = 2)

To see dimension of the matrix, you can use dim() function.

dim function in R

To see correlation of the matrix, you can use cor() function.

cor function in R

You can use square brackets [row_position,column_position] to select specific rows or columns.

mymatrix[3,] # 3rd row of matrix
mymatrix[1,3] # 1st row of 3rd column
mymatrix[1:2,2:3] # rows 1,2 of columns 2nd and 3rd

The numbers to the left side in brackets are the row numbers. The form [1, ] means that it is row number one and the blank following the comma means that R has displayed all the columns.

4. Arrays

Arrays are similar to matrices but can have more than two dimensions.

# Creating an array
my_array <- array(1:12, dim = c(2, 3, 4))  # 3D array with 2x3x4 dimensions

, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

, , 3

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 4

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12
5. Data Frames

A data frame is similar to SAS and SPSS datasets. It contains variables and records.

It is more general than a matrix, in that different columns can have different modes (numeric, character, factor etc.)

The data.frame() function is used to combine variables (vectors and factors) into a data frame.

x <- c(1,2,3,4,5)
y <- c(1,3,5,7,9)
z <- c(1,2,5,4,7)
gender <- c("m","f","m","m","f")
mydata <- data.frame(x,y,z,gender)

You can also specify the columns within the 'data.frame()' function.

mydata <- data.frame(x = c(1,2,3,4,5),
                     y = c(1,3,5,7,9),
                     z = c(1,2,5,4,7),
                     gender = c("m","f","m","m","f"))

To convert a column "x" to factor, you can use the function as.factor()

mydata$x = as.factor(mydata$x)

To convert a column "y" to character, you can use the function as.character()

mydata$y = as.character(mydata$y)

To convert a column "y" to numeric, you can use the function as.numeric()

mydata$y = as.numeric(mydata$y)
6. Lists

A list allows you to store a variety of objects.

mylist <- list(x,y,z,gender,mydata)
lists in R

You can use the double square brackets [[n]] which can be used for extracting an element from the list. 'n' refers to the index of the element you want to extract.

extracting an element from the list in R
7. tibble

tibble is a modern version of a data frame. It's a part of tidyverse package. It is more efficient than data.frame().

# Creating a tibble having sample data
my_tibble <- tibble(
  name = c("Dave", "Sandy", "Tim"),
  age = c(25, 30, 35)

Both tibble() and data.frame() have many similarities. There are rare cases where you would need tibble() over data.frame() if you are already used to data.frame(). However, tibble comes with several benefits such as better printing, stricter column naming conventions etc. It also smoothly integrate with the tidyverse packages. Learn tibble vs dataframe
How to know data type of a column

1. 'class' is a property assigned to an object that determines how generic functions operate with it. It is not a mutually exclusive classification.

# [1] "data.frame"

2. 'mode' is a mutually exclusive classification of objects according to their basic structure. The 'atomic' modes are numeric, complex, charcter and logical.

> x <- 1:16
> x <- factor(x)
> class(x)
[1] "factor"
> mode(x)
[1] "numeric"
Related Posts
Spread the Word!
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

Post Comment 25 Responses to "Data Types and Structures in R"
  1. Congrats, Mr. Bhalla. This post was very clear, straight and useful. Thanks for sharing it with us.

    1. Thank you for your appreciation. Glad you found it useful.

    2. I agree with Prof. Luiz. It is the best tutorial I came across uptill now! Congrats... and heartfelt thanks!

  2. Awesome excllent bro...Thanks alot really Thanks..

  3. this is really Awesome post Bro !!! If possible can you add some case studies will be really helpful to get some practical knowledge

  4. This is very useful who needs supports to stand..

  5. Thanku so much Please share practice exercises as well at the end of each session to practice

  6. great, easy to understand for user who is starting yet!
    I have knowledge of R and looking for visulization of data sets, if have any specific link, request to you, please share it to me.

  7. what is the correlation? can you please explain that part

  8. Such great content..

    Could you please specify what's the difference between List and Array then?
    Does an array cannot contain any of the things such as 'vectors', 'factors', etc?
    Can vector be 2-dimensional?


  9. Thanks a lot for this great R tutorial!

  10. your tutorial is very helpful to me . easy to understand . congratulations sir

  11. Hi, your resources are very useful and simple to understand.

  12. simple and easy to understand.

  13. absolutely what i was looking for..thank you.

Next → ← Prev