Site icon TechVidvan

Data Reshaping in R – Popular Functions to Organise Data

In this TechVidvan tutorial, discover why data reshaping is required in R and learn how to use different functions in R to do this.

For an analysis project, the gathered data is messy and unstructured most of the time. It is collected from different sources, has different variables, and has irregular formats.

With this tutorial, learn how to change the gathered data and conform it to our desired format in R. We will take a look at the functions in R that allow us to do this and much more. So, get ready for the ride!

What is Data Reshaping in R?

Before we can perform any kind of analysis, we first need to shape the gathered data into a regular and processable format. We need to ensure that all of the data fits into proper variables.

We also need to take care of missing values and put placeholders in their place that our analysis tools can understand.

This is the first step in any analysis project. We call this process as data reshaping.

Now let’s learn popular functions used for data reshaping in R.

The cbind(), rbind(), and t() Functions

There are many functions in R that allow us to manipulate data objects in many ways.

cbind(), rbind(), and t() are the most commonly used functions for data reshaping. We will be going through them one-by-one:

1. cbind(): The cbind() function allows us to join objects as column. We can combine matrices, data-frames, vectors or any combination of these.

Code:

vec1 <- c(1,2,3,4,5)
vec2 <- c(6,7,8,9,10)
mat1 <- matrix(c(1:15),c(5,3))
mat2 <- matrix(c(16:30),c(5,3))
df1 <- data.frame(matrix(c(1:30),nrow=5))
df2 <- data.frame(matrix(c(31:60),nrow=5))
cbind(vec1,vec2)

Output:

Code:

cbind(mat1,mat2)

Code:

cbind(df1,df2)

Output:

Code:

cbind(vec1,mat2)

Code:

cbind(vec2,df1)

Output:

Code:

cbind(mat1,df2)

Output:

Note: The number of rows should be the same for the cbind() function to work.

Wait! before proceeding ahead get a clear understanding of user-defined functions in R.

2. rbind(): The rbind() function allows us to join objects as rows.

Code:

rbind(vec1,vec2)

Code:

rbind(vec1,df2)

Output:

Note: The number of columns should be the same for the rbind() function to work.

3. t(): Thet() function transposes a matrix that is it turns the rows into columns and columns into rows.

Code:

t(mat1)

Code:

t(mat2)

Output:

The Tidyr package

The tidyr package is the most commonly used R package for data reshaping in R. tidyr helps you tidy your data.

It allows you to convert it into the desired format and make it easier to process and analyze. Tidyr simplifies the process of data reshaping.

To install tidyr, use the following command:

install.packages("tidyr")

Once installed, include it into your current R session by using the library() command:

library(tidyr)

These functions of tidyr are very useful in data reshaping and keeping your data tidy.

  1. gather()
  2. spread()
  3. unite()
  4. separate()

Let’s look at these functions and their usage:

1. gather() Function

The gather() function helps us in reshaping wide-format data-frames to long-format.

Sometimes, datasets have attributes of common concern spread across different columns. This creates unnecessary variables. Such a dataset is said to be in the wide-format.

It would be more efficient to stack similar attributes together and turning the dataset into long-format. The gather() function allows us to do that.

Code:

month <- month.abb[1:3]
delhi <- sample(seq(-5,47,by=0.01),3,rep=TRUE)
mumbai <-sample(seq(-5,47,by=0.01),3,rep=TRUE)
chennai <-sample(seq(-5,47,by=0.01),3,rep=TRUE)
bangalore <- sample(seq(-5,47,by=0.01),3,rep=TRUE)
kolkata <- sample(seq(-5,47,by=0.01),3,rep=TRUE)
data <- data.frame(month,delhi,mumbai,bangalore,chennai,kolkata)
data

Output:

Code:

gathered_data <- gather(data,key="city",value="avg.temp",-month)
gathered_data

Output:

2. spread() Function

The spread() function is the complement to the gather() function. It spreads long-format data-frames to wide-format.

Code:

spread_data <- spread(gathered_data,key="city",value="avg.temp")

Output:

3. unite() Function

Take a look at the following dataset:

S.no Month Year Temp.
1 jan 2018 4.64
2 feb 2018 19.68
3 jan 2019 2.56
4 mar 2019 36.74

In the dataset, the month and year have separate columns. It looks inefficient, doesn’t it?

The two variables month and year can be in the same column without affecting the information conveyed by the data. This is exactly what the unite() function does.

Code:

months <- c("jan","feb","jan","mar")
year <- c("2018","2018","2019","2019")
temp <- c(4.64,19.68,2.56,36.74)
delhi_temp <- data.frame(months,year,temp)
delhi_temp

Output:

Code:

united_delhi_temp <- unite(delhi_temp,"interval",months,year)
united_delhi_temp

Output:

4. separate() Function

The separate() function is the complement to the unite() function. It separates values into separate columns.

Code:

sep_delhi_temp <- separate(united_delhi_temp,
interval,c("month","year"))
sep_delhi_temp

Output:

The Reshape2 Package

Reshape2 is another R package that is used for data reshaping. Reshape2 can be considered as an older version of the tidyr package.

The development of the reshape2 package has stopped. The most commonly used reshape2 functions are the melt() and merge() functions.

1. melt() Function

The melt() function is very similar to the gather() function from the tidyr package. It melts the input data frame and converts wide-format data into long-format. For example:

Code:

mdata <- melt(data,id=c("month"),variable.name="city", value.name="avg.temp")
mdata

Output:

2. merge() Function

The merge() function can merge data frames. The function merges the input data frames horizontally, therefore, the data frames must have the same variable that is column names. For example:

Code:

months2 <- c("apr", "mar", "feb", "jun")
year2 <- c("2018","2018","2019","2019")
temp2 <- c(38.75,37.68,28.56,41.74)
delhi_temp2 <- data.frame(months2,year2,temp2)
colnames(delhi_temp2) <- c("months","year","temp")
delhi_temp2

Output:

Code:

merge_delhi_temp <- merge(delhi_temp,delhi_temp2,by="year")
merge_delhi_temp

Output:

Summary

Data reshaping is the first step of any data analysis project. It is also called data formatting and data cleaning.

In this article, we looked at the functions in base R that allow us to reshape and transform our data.

We also looked at the most popular data processing and transformation package, the tidyr package. The tidyr package is a part of the tidyverse collection of R packages. It is a must-have for beginner, intermediate and advanced R programmers.

Still, have some doubts about the data reshaping in R? Ask us and our TechVidvan experts will be happy to help you.

Keep Visiting ?

Exit mobile version