Data manipulation: dplyr. Part 2: mutate, group_by, summarise

dplyr package in R

We will work on a database named ‘games’ which can be downloaded from here https://raw.githubusercontent.com/statsoffnow/videogames/main/vgsales.csv.

To use it in R, it’s enough to copy this code:

We already know dplyr functions such as select, count, and arrange (read about it here). It’s time to combine them with a few more functions. In addition to the crude quantity, it is good to know the proportion and percentage of observations. When we know the ‘n’, meaning number of times the platform appeared in the dataset, we can easily calculate both the proportion and percentage. We need to apply a function called mutate() which adds another column to our output. If we want to add proportion, we combine it with a basic R function prop.table() and the column ‘n’. The whole code looks like this:

However, it looks much better to use % instead of proportion. Percentage is just the proportion multiplied by 100, therefore we need to add ‘*100’ to the existing code:

What may bother us, is the name of the column with percentage. `prop.table(n) * 100` is not the most useful name, so we are going to change it. We will use the same code as above but this time we will add ‘percent’ before the prop.table() function. Look at the output to see what happens now:

Finally, we can arrange this output using the arrange() function we utilized earlier. This time, however, we want to arrange platform by the column we created which is ‘percent’. To do so, we need to add one more line to our previous code. In addition, we may use the function print() to see all platforms. There are 31 platforms, so we can use print(n=31)

Now all platforms add up to 100% and we know that DS and PS2 account for 26% of all platforms. However, imagine that we want the same output, but we want it to be presented for each publisher separately. Let’s do this!

The majority of this code will look the same, we only need to add the grouping function which is ‘group_by()’. We can also skip the arrange() and print() functions for now. Take a look, the number and the percentage of platforms are given for each publisher separately.

Another function crucial in data manipulation is a summarise() function which is used to summarize various quantities for given variables. Imagine that we want to know the average global sales for each platform. Firstly, we need to group our dataset by 31 platforms we have, and then add variable which gives us the mean global sales. We will name this new variable ‘Sales.’

We have a new variable named ‘Sales’ which shows us the mean global sales for each platform. If it’s convenient, we can arrange this variable to easily see which platform has the highest global sales. We’ll utilize the arrange() function, which we already know.

Now we can answer that GB (Game Boy) has the highest mean global sales, compared to all other platforms.