Data Reshaping in R – Popular Functions to Organise Data
In this TechVidvan tutorial, discover why data reshaping is required in R and learn how to use different functions in R to do this.
For an analysis project, the gathered data is messy and unstructured most of the time. It is collected from different sources, has different variables, and has irregular formats.
With this tutorial, learn how to change the gathered data and conform it to our desired format in R. We will take a look at the functions in R that allow us to do this and much more. So, get ready for the ride!
What is Data Reshaping in R?
Before we can perform any kind of analysis, we first need to shape the gathered data into a regular and processable format. We need to ensure that all of the data fits into proper variables.
We also need to take care of missing values and put placeholders in their place that our analysis tools can understand.
This is the first step in any analysis project. We call this process as data reshaping.
Now let’s learn popular functions used for data reshaping in R.
The cbind(), rbind(), and t() Functions
There are many functions in R that allow us to manipulate data objects in many ways.
cbind(), rbind(), and t() are the most commonly used functions for data reshaping. We will be going through them one-by-one:
1. cbind(): The cbind() function allows us to join objects as column. We can combine matrices, data-frames, vectors or any combination of these.
vec1 <- c(1,2,3,4,5) vec2 <- c(6,7,8,9,10) mat1 <- matrix(c(1:15),c(5,3)) mat2 <- matrix(c(16:30),c(5,3)) df1 <- data.frame(matrix(c(1:30),nrow=5)) df2 <- data.frame(matrix(c(31:60),nrow=5)) cbind(vec1,vec2)
Note: The number of rows should be the same for the
cbind() function to work.
Wait! before proceeding ahead get a clear understanding of user-defined functions in R.
2. rbind(): The
rbind() function allows us to join objects as rows.
Note: The number of columns should be the same for the
rbind() function to work.
3. t(): The
t() function transposes a matrix that is it turns the rows into columns and columns into rows.
The Tidyr package
tidyr package is the most commonly used R package for data reshaping in R.
tidyr helps you tidy your data.
It allows you to convert it into the desired format and make it easier to process and analyze. Tidyr simplifies the process of data reshaping.
tidyr, use the following command:
Once installed, include it into your current R session by using the
These functions of
tidyr are very useful in data reshaping and keeping your data tidy.
Let’s look at these functions and their usage:
1. gather() Function
gather() function helps us in reshaping wide-format data-frames to long-format.
Sometimes, datasets have attributes of common concern spread across different columns. This creates unnecessary variables. Such a dataset is said to be in the wide-format.
It would be more efficient to stack similar attributes together and turning the dataset into long-format. The
gather() function allows us to do that.
month <- month.abb[1:3] delhi <- sample(seq(-5,47,by=0.01),3,rep=TRUE) mumbai <-sample(seq(-5,47,by=0.01),3,rep=TRUE) chennai <-sample(seq(-5,47,by=0.01),3,rep=TRUE) bangalore <- sample(seq(-5,47,by=0.01),3,rep=TRUE) kolkata <- sample(seq(-5,47,by=0.01),3,rep=TRUE) data <- data.frame(month,delhi,mumbai,bangalore,chennai,kolkata) data
gathered_data <- gather(data,key="city",value="avg.temp",-month) gathered_data
2. spread() Function
spread() function is the complement to the
gather() function. It spreads long-format data-frames to wide-format.
spread_data <- spread(gathered_data,key="city",value="avg.temp")
3. unite() Function
Take a look at the following dataset:
In the dataset, the month and year have separate columns. It looks inefficient, doesn’t it?
The two variables month and year can be in the same column without affecting the information conveyed by the data. This is exactly what the
unite() function does.
months <- c("jan","feb","jan","mar") year <- c("2018","2018","2019","2019") temp <- c(4.64,19.68,2.56,36.74) delhi_temp <- data.frame(months,year,temp) delhi_temp
united_delhi_temp <- unite(delhi_temp,"interval",months,year) united_delhi_temp
4. separate() Function
separate() function is the complement to the
unite() function. It separates values into separate columns.
sep_delhi_temp <- separate(united_delhi_temp, interval,c("month","year")) sep_delhi_temp
The Reshape2 Package
Reshape2 is another R package that is used for data reshaping. Reshape2 can be considered as an older version of the
The development of the reshape2 package has stopped. The most commonly used reshape2 functions are the
1. melt() Function
melt() function is very similar to the
gather() function from the
tidyr package. It melts the input data frame and converts wide-format data into long-format. For example:
mdata <- melt(data,id=c("month"),variable.name="city", value.name="avg.temp") mdata
2. merge() Function
merge() function can merge data frames. The function merges the input data frames horizontally, therefore, the data frames must have the same variable that is column names. For example:
months2 <- c("apr", "mar", "feb", "jun") year2 <- c("2018","2018","2019","2019") temp2 <- c(38.75,37.68,28.56,41.74) delhi_temp2 <- data.frame(months2,year2,temp2) colnames(delhi_temp2) <- c("months","year","temp") delhi_temp2
merge_delhi_temp <- merge(delhi_temp,delhi_temp2,by="year") merge_delhi_temp
Data reshaping is the first step of any data analysis project. It is also called data formatting and data cleaning.
In this article, we looked at the functions in base R that allow us to reshape and transform our data.
We also looked at the most popular data processing and transformation package, the
tidyr package. The
tidyr package is a part of the tidyverse collection of R packages. It is a must-have for beginner, intermediate and advanced R programmers.
Still, have some doubts about the data reshaping in R? Ask us and our TechVidvan experts will be happy to help you.
Keep Visiting 🙂