Data manipulation: dplyr. Part 1: select, count, arrange

dplyr package in R

Data manipulation is a process of cleaning, subtracting, adding, altering, or arranging data. In the era of clouds, big data, and enormously large datasets, data manipulation is an extremely useful skill for scientists, data engineers, or data analysts. There are variety of tools which can be used to manipulate data, such as pandas (Python), SQL, dplyr and tidyr (R), or good old Excel. Data manipulation is usually the first step of our adventure with dataset because good analysis cannot be done without tidy and clean dataset.

The first part of this tutorial is based on dplyr package for R. We will cover a few basic functions such as selecting columns, counting observations, and adding additional column with proportions or percentage.

If you do not already have a dplyr package on your computer, use the function install.packages() and library() to install and load the package:

First of all, we need to have a dataset. We will use dataset called ‘games.’ You can download it from Kaggle or from my github (https://raw.githubusercontent.com/statsoffnow/videogames/main/vgsales.csv). We will assign browser address as ‘url’, and next we will read the csv file as ‘games.’ Once it is done, we can take a look at the dataset with the function View(games).

Usually, the first question which needs to be addressed is how many observations we have. To check it we can look at the environment in R. Also, we can just write games which gives us a size of a tibble: 16,598 × 11, therefore we know we have 16,598 observations. Finally, we can use dplyr and write the following code:

This sign %>% is called the pipe operator. What it does is basically forwarding the value to the next function, thanks to this we do not have to use the name of our dataset and more complex codes look a lot cleaner.

When we operate on big data, rarely we need the whole dataset, more often we use just a fraction of all data. Therefore, our next function would be called select() which enables us to select those columns we need.

There are 11 columns in our initial dataset:

[1] "Rank" [2]"Name" [3]"Platform"[4]"Year" [5] "Genre" [6]"Publisher"[7]"NA_Sales" [8]"EU_Sales" [9] "JP_Sales" [10]"Other_Sales" [11]"Global_Sales"

Let’s assume that we need only column with a name of each game. To do so, we use the following code:

This gives us only one column. If you want to create a new dataset with only one column, you only need to add a name of your new dataset at the beginning. For instance, to name your dataset game_names, should you use the code:

It’s easy to find out, that if you need to select two variables, you only add another variable to the previous code. Like this:

Select function has one more useful option to use, namely starts_with(). Imagine that you have a dataset with questionnaire items, for example MMPI. Columns with the results are named MMPI_1, MMPI_2, MMPI_3, MMPI_4 and so on. In addition, you have variables with demographic information, other scales, etc. MMPI consists of over 500 items, you would not want to write ‘MMPI’ 500 times, would you? Here starts_with() function comes to save us. This function selects every column that starts with given characters, for instance starts_with(‘MMPI’) gives us all columns that has MMPI at the beginning.

In our dataset we can apply this function to choose columns which start with a letter ‘p’. Here is how it works:

As you can see, every time the output gives us only 10 first observations. We can change it by adding print(). For example, we want to see the first 17 observations; hence we use:

We know now how to select variables. Let’s do some math. Imagine that we want to count game platforms in our dataset. In other words, we want to answer the question ‘How many times each platform is used in our dataset?’

This code helps us to do so:

In the output, there is an additional column named ‘n’. This column indicates how many times the platform is included in this dataset. However, the output looks a little messy, platforms are presented in alphabetical order. This would be better to present them according to the number it was used. To do this, we make use of the column ‘n’ and the function arrange(). By default, arrange() function gives us ascending order, to do the opposite, we must add ‘desc’ command. The complete code is as follows: