All posts

Visualising Economic Indicators with ggplot2

04 Dec 2018
Welcome to my first post! In this post, I will demonstrate one of my favourite packages in R, ggplot2. In particular, I will demonstrate the versatility of ggplot2 in showing time series data, using two economic indicators:


  1. Air passengers carried (domestic and international aircraft passengers of air carriers registered in a country)

  2. GDP per capita in 2017 US$



And before plotting, I will show how to prepare your data using tidyverse. I assume a basic knowledge of R.

We get the economic indicators data from the World Bank. The World Bank is a fantastic source for many datasets on a whole range of economic measures and indicators. We use the datasets GDP per capita (current US$) and Air transport passengers carried. You can download the datasets as csv files from the World Bank website. I got these datasets in November 2018.

All the code (and more!) used in this post can be found in this GitHub repository.

Throughout this post we will deal with some large numbers (over 150,000,000). To prevent R from displaying these using scientific notation we will put the following code snippet at the top of our script.

options(scipen = 999)


Read in the Data


Firstly, we will read in our two csv files using the read_csv() function from the readr package. Note that I have already renamed the csv files.

library(readr)
# read in data
air_transport_passengers_carried_raw <-
read_csv("data/air_transport_passengers_carried.csv", skip = 4)
gdp_per_capita_raw <- read_csv("data/gdp_per_capita.csv", skip = 4)


We specify the argument skip = 4, in order to skip the first four lines of the csv files and read in the data from line 5. This is so that we only read in the data that we want. Note that It’s always a good idea to store the initial dataframe that you read into your workspace as a source, or raw dataframe, and not to overwrite it.

We now have our two dataframes and are ready to prepare the data and then visualise.

Prepare our Data


Before we begin plotting the data, we need to tidy our dataframes to make plotting as time series data possible. As both of our csv files are in the same format, we perform the same data preparation steps to both:


  1. We remove superfluous columns: the column which gives the economic indicator code and the final column which is a blank column. We do this by using dplyr::select().

  2. Rename some of the columns so that they are easier to use in our R session. I prefer all lower case with underscores for column names.

  3. At this point, we have a dataframe with 264 observations and 61 variables, which is too large for plotting. Here we reduce the size of our dataframe:


    1. Firstly, there looks to be a lot of NA’s in our dataframe, especially for the 1960s. We use dplyr::select_if() to eliminate columns where every observation in a column is NA. We negate all(is.na) with “!” to select only those columns where there is at least one observation that is not NA.

    2. We can still see plenty of NAs in our data. We previously looked on a column basis, but now we look on a row basis. We now use tidyr::drop_na() to filter out any rows that contain at least one NA.



Note, we could have taken the approach of imputing missing data for some of the observations, instead of filtering out data, but this is beyond the scope of this post.

In addition, it is important to note that the order of our data cleansing steps is important. If we used tidyr’s drop_na before using select_if(~!all(is.na(.))), we would have been left with no dataset, as every row contained an NA as there was no data in the 1960s.

Now, we want to transform our dataframe from wide to long. To plot in ggplot2 it is ideal to have tidy data, where rows are observations and columns are variables.

To make our data tidy, we use tidyr::gather(). As per the documentation gather() “takes multiple columns and collapses into key-value pairs”. We pass the names of the new columns to key and value, in this case key = year and value = passengers and we finally pass the columns we want to populate as values, in this case the values representing passengers in each year.

We do all of this in one pipe. For the Air Passengers data this looks like:
# clean up air_transport_passengers_carried
library(dplyr)
library(tidyr)
air_transport_passengers_carried <- air_transport_passengers_carried_raw %>%
# remove superfluous columns
select(-c("Indicator Code", "X63")) %>%
# rename columns
rename(country = "Country Name", country_code = "Country Code",
indicator = "Indicator Name") %>%
# remove columns where all rows are na
select_if(~!all(is.na(.))) %>%
drop_na() %>% # drop any rows where there are any na
# transform our data from wide to long
gather(key = year, value = passengers, `1970`:`2017`)


and for the GDP per capita data this looks like:
library(dplyr)
library(tidyr)
# clean up dataframes, gdp_per_capita
gdp_per_capita <- gdp_per_capita_raw %>%
select(-c("Indicator Code", "X63")) %>%
rename(country = "Country Name",
country_code = "Country Code", indicator = "Indicator Name") %>%
select_if(~!all(is.na(.))) %>%
drop_na() %>%
gather(key = year, value = gdp_per_capita, `1960`:`2017`)


We are now ready to plot our data!

Plot 1: Economic Indicators Over Time, Coloured by Country



For all our plots, we are going to look at a subset of countries: France, India, Italy, Singapore and the UK. We do this by using dplyr::filter with %in% and then pass a list of countries. We then pipe this result (not storing it) into a ggplot2 expression.

We want to create a coloured line by country and we do this by using the group = country aesthetic, and colouring these groups by country using the col = country aesthetic. We then specify the geom (i.e., the type of chart we want) with geom_line() and geom_path().

At this stage, our code looks like:
library(dplyr)
library(ggplot2)
air_transport_passengers_carried %>%
filter(country %in% c("France",
"India",
"Italy",
"Singapore",
"United Kingdom")) %>%
# the group aesthetic is important to use geom_line and geom_point together
ggplot(aes(x = year, y = passengers, group = country, col = country)) +
# create a line plot with geom_line and geom_point
geom_line() +
geom_point()


We then add a title and axis labels and format our axis labels so that our large numbers have commas. We do these by using scales::comma. Our code now looks like:

library(dplyr)
library(ggplot2)
air_transport_passengers_carried %>%
filter(country %in% c("France",
"India",
"Italy",
"Singapore",
"United Kingdom")) %>%
# the group aesthetic is important to use geom_line and geom_point together
ggplot(aes(x = year, y = passengers, group = country, col = country)) +
# create a line plot with geom_line and geom_point
geom_line() +
geom_point() +
ggtitle("Number of Air Transport Passengers per Year (1970 - 2017)") +
ylab("Number of Passengers") +
xlab("Year") +
# use the scales package to display the y axis labels with comma
scale_y_continuous(labels = scales::comma)


We now add theme elements to prettify our plot. Theme elements format all non-data display. We can use any of the available ggplot themes. I personally like theme_minimal() which is very minimalistic and increases the data to ink ratio. After using this theme, we add some further theme elements to:


  1. rotate the x-axis labels by 45 degrees

  2. centre the plot title

  3. move the legend to the bottom of the plot



After the theme elements our final plot code looks like:

library(dplyr)
library(ggplot2)
# plot data, line graph over time
air_transport_passengers_carried %>%
filter(country %in% c("France",
"India",
"Italy",
"Singapore",
"United Kingdom")) %>%
# the group aesthetic is important to use geom_line and geom_point together
ggplot(aes(x = year, y = passengers, group = country, col = country)) +
# create a line plot with geom_line and geom_point
geom_line() +
geom_point() +
ggtitle("Number of Air Transport Passengers per Year (1970 - 2017)") +
ylab("Number of Passengers") +
xlab("Year") +
# use the scales package to display the y axis labels with comma
scale_y_continuous(labels = scales::comma) +
# choose the ggplot theme minimal
theme_minimal() +
# rotate the x-axis labels by 45 degrees
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
# centre the plot title
theme(plot.title = element_text(hjust = 0.5)) +
# position the legend at the bottom of the plot
theme(legend.position = "bottom")


Observe that we have to place the theme_minimal() code first and then add the other theme elements afterwards. If we put the theme_minimal() code last, then this would overwrite all our other theme elements.

We have the same code structure for the GDP per capita plot.

library(dplyr)
library(ggplot2)
# plot data, line graph over time
gdp_per_capita %>%
filter(country %in% c("France",
"India",
"Italy",
"Singapore",
"United Kingdom")) %>%
ggplot(aes(x = year, y = gdp_per_capita, group = country, col = country)) +
geom_line() +
geom_point() +
ggtitle("GDP per Capita (1970 - 2017)") +
ylab("GDP per Capita (US$)") +
xlab("Year") +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "bottom")


And our two graphs look like:




Instead of putting all our five time series (one for each country) on each graph, we can show them on five different graphs. We don’t have to create five graphs though, as we can use the facet_grid() function to create five different plots (facets), one for each country, using the same ggplot2 code.

At the end of our ggplot2 code we use + facet_grid(rows = vars(country), scales = "free_y”). The scales option allows us to specify that we do not want the same axis range for each of the facets. In other words, the axis scales are “free”.

Now, our total ggplot2 code is:

library(dplyr)
library(ggplot2)
# instead use facet_grid to show on different graphs
# air transport passenger numbers
air_transport_passengers_carried %>%
filter(country %in% c("France",
"India",
"Italy",
"Singapore",
"United Kingdom")) %>%
ggplot(aes(x = year, y = passengers, group = country, col = "red")) +
geom_line() +
geom_point() +
ggtitle("Number of Air Transport Passengers per Year (1970 - 2017) by Country") +
ylab("Number of Passengers") +
xlab("Year") +
scale_y_continuous(labels = scales::comma) +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "none") +
facet_grid(rows = vars(country), scales = "free_y")


Notice that here, I have decided to use the theme_light() theme, as I think it makes a better looking plot when used in combination with facet_grid().



We also do the same for GDP per capita (for this code, see the GitHub repository).



Plot 2: Plot Both Indicators Together, per Country


So far we have had one plot for Air Passengers, and another for GDP per capita, faceting by country. We now look to put both economic indicators on the same plot for a country. However, as the two economic indicators have different absolute values, by a considerable difference, we cannot plot them on the same chart (not even using dual axis). Therefore, we will set the 1970 value for both indicators to 1, so that we can plot them on the same chart. The plots will then show the change (growth) rather than the absolute value, relative to 1970.

The steps are the same for each of the countries. In the post we will only show the steps for UK, but the code for all the five countries is available in the GitHub repository.

We start by filtering the two dataframes for the UK:

library(dplyr)
# united kingdom
air_transport_passengers_carried_uk <- air_transport_passengers_carried %>%
filter(country == "United Kingdom")

gdp_per_capita_uk <- gdp_per_capita %>%
filter(country == "United Kingdom")


We now use dplyr::inner_join() to join the two _uk dataframes together. We use inner_join() so that we only have years in the dataframe where we have values for both indicators.

library(dplyr)
# join dataframes
uk_indicators <- air_transport_passengers_carried_uk %>%
inner_join(., gdp_per_capita_uk, by = c("year" = "year")) %>%
select(year, passengers, gdp_per_capita)


We now create the index. Here we use dplyr::mutate() and create the index using the value of the indicator at each row, divided by the value of the indicator at row 1 (1970).

library(dplyr)
# create index
uk_indicators <- uk_indicators %>%
mutate(passengers.index = passengers/passengers[1],
gdp_per_capita.index = gdp_per_capita/gdp_per_capita[1])


As before, for plotting in ggplot2, we want to transform our dataframe from wide to long. To make this easier, we first only select the columns we want (the year column and the index columns).

library(dplyr)
library(tidyr)
# transform dataframe
uk_indicators <- uk_indicators %>%
# select the year and index columns
select(year, passengers.index, gdp_per_capita.index) %>%
# transform the dataframe from wide to long
gather(key = indicator, value = index, -year)


Finally, for completeness, we add an additional column to the dataframe, which specifies the country (this will be of particular use later on).

uk_indicators$country <- "United Kingdom"


We repeat these steps for each of our chosen five countries. We have five dataframes:
uk_indicators, france_indicators, italy_indicators, india_indicators and singapore_indicators.

We plot these index on the same ggplot2 graph, using the same code structure as before. However, this time, the group and col aesthetics are indicator, and we add a plot subtitle. As a result we add the code to centre the plot subtitle (theme(plot.subtitle = element_text(hjust = 0.5))), so that it is consistent with the title.

library(dplyr)
library(ggplot2)
# plot index
# uk
uk_indicators %>%
ggplot(aes(x = year, y = index, group = indicator, col = indicator)) +
geom_point() +
geom_line() +
ggtitle("Number of Air Transport Passengers and GDP per Capita UK (1970 - 2017)",
subtitle = "Set 1970 to 1") +
ylab("Index") +
xlab("Year") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
theme(legend.position = "bottom")




Plot 3: Plot Both Indicators and All Countries


To plot all indicators and countries together we use the five dataframes we previously created (uk_indicators, france_indicators, italy_indicators, india_indicators and singapore_indicators) and join them all together using dplyr::bind_rows().

library(dplyr)
# Show all indexes
all_indicators <- uk_indicators %>%
bind_rows(., france_indicators,
india_indicators,
italy_indicators,
singapore_indicators)


This gives us a dataframe with the indexes (as created earlier) for every country in a single dataframe. We can now use this to plot the indexes for all countries on one graph. Again, we use the same ggplot2 code structure as before, and add the code + facet_grid(rows = vars(country)) to split the graph by country.

library(dplyr)
library(ggplot2)
# plot all indicators
# fixed axis
all_indicators %>%
ggplot(aes(x = year, y = index, col = indicator, group = indicator)) +
geom_point() +
geom_line() +
ggtitle("Air Transport Passengers and GDP per Capita Over Time (1970 - 2017)",
subtitle = "Set 1970 to 1") +
ylab("Index") +
xlab("Year") +
theme_light() +
theme(plot.title > element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "bottom") +
facet_grid(rows = vars(country))




However, as we have the same, fixed, y-axis range for all five countries, we lose detail on the France, Italy and UK plots. Having fixed y-axis allows us to compare between countries (it is clear the passengers for India has grown, and both air passengers and GDP per capita have grown for Singapore), but we lose detail on other countries. To overcome this, we change our facet_grid() code to facet_grid(rows = vars(country), scales = "free_y") to allow the y-axis ranges to change between each country plot.

library(dplyr)
library(ggplot2)
# free y axis for all country's
all_indicators %>%
ggplot(aes(x = year, y = index, col = indicator, group = indicator)) +
geom_point() +
geom_line() +
ggtitle("Air Transport Passengers and GDP per Capita Over Time (1970 - 2017)",
subtitle = "Set 1970 to 1") +
ylab("Index") +
xlab("Year") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "bottom") +
facet_grid(rows = vars(country), scales = "free_y")




On observation, it may be better to compare indexes between countries. We can easily do this by faceting by indicator, which only requires a slight change in our ggplot2 code. We change our col and group aesthetics to country, and change our facet_grid() code to facet_grid(rows = vars(indicator), scales = "free_y”).

Then we have the following nice looking plot:



which is built using the following code:

library(dplyr)
library(ggplot2)
# now change the facet from country to indicator
all_indicators %>%
# group and col aesthetics are country
ggplot(aes(x = year, y = index, col = country, group = country)) +
geom_point() +
geom_line() +
ggtitle("Air Transport Passengers and GDP per Capita Over Time (1970 - 2017)",
subtitle = "Set 1970 to 1") +
ylab("Index") +
xlab("Year") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "bottom") +
# we facet by indicator
facet_grid(rows = vars(indicator), scales = "free_y")


This plot is a very nice plot which allows us to compare the changes in an economic indicator between countries relative to a point. Previous plots allowed us to compare two economic indicators between countries. To choose the right plot, you first need to determine what you need to show.

Plot 4: Plot Percentage Change as a Line Graph


For the final selection of plots, we change our focus to looking at percentage change in the value of an economic indicator each year.

We again start by preparing our data so that we can easily plot using ggplot2. Once again, the steps are the same for each country, here we only show the steps for the UK, but all the code is available on GitHub.


  1. We start by taking the Air Transport Passengers dataframe we created and inner join this to the GDP per capita dataframe.

  2. We select the columns that we want.

  3. We then use dplyr::mutate() to create two new columns, a percentage change for passengers and GDP per capita. We use the useful dplyr::lag() function here to calculate the percentage change between years. It is important that your data is ordered correctly (here in chronological year order) when using the lag function.

  4. We select the year and our newly created _perc columns.

  5. As usual, we transform our data into tidy data (transforming our dataframe from wide to long).



library(dplyr)
library(ggplot2)
uk_indicators_perc <- air_transport_passengers_carried_uk %>%
# inner join dataframes together, explicitly stating the join
inner_join(., gdp_per_capita_uk, by = c("year" = "year")) %>%
select(year, passengers, gdp_per_capita) %>%
# use lag with the option n = 1 to get the value in the previous row
mutate(passengers_perc = passengers/lag(passengers, 1) - 1,
gdp_per_capita_perc = gdp_per_capita/lag(gdp_per_capita, 1) - 1) %>%
select(year, passengers_perc, gdp_per_capita_perc) %>%
# transform dataframe from wide to long
gather(key = indicator, value = perc_change, -year)



  1. And finally, we add another column, which specifies the country.



uk_indicators_perc$country <- "United Kingdom"


Once we have done this for all countries, we again union the dataframes using bind_rows:

library(dplyr)
# Show all indexes perc facet_grid
all_indicators_perc <- uk_indicators_perc %>%
bind_rows(., france_indicators_perc,
india_indicators_perc,
italy_indicators_perc,
singapore_indicators_perc)


We now have the yearly percentage change, for both indicators, for all countries, in one dataframe.

We can now plot the data using ggplot2, following our same code structure as before, with the y-axis able to be free.

library(dplyr)
library(ggplot2)
# plot all indicators
# free y axis
all_indicators_perc %>%
ggplot(aes(x = year, y = perc_change, col = indicator, group = indicator)) +
geom_point() +
geom_line() +
ggtitle("Air Transport Passengers and GDP per Capita Over Time (1970 - 2017)") +
ylab("Percentage Change (%)") +
xlab("Year") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "bottom") +
facet_grid(rows = vars(country), scales = "free_y")




Note that when running this code in RStudio you should get the following warning message:

Warning messages:
1: Removed 10 rows containing missing values (geom_point).
2: Removed 2 rows containing missing values (geom_path).


This is because, by definition, we have no percentage change values for 1970 (in the dataframe the values are NA), but we have left those rows in our dataframe. You should always pay attention to warning messages (and not suppress them), but in this case, we can read the warning message and continue with no further action.

Plot 5: Plot Percentage Change as a Bar Chart


It is always worthwhile trying a few different plots and changing a couple of things to get the best plot. The line graph we created above is a very useful chart, but if we want to compare the difference in percentage changes by country for each year, we may prefer a bar chart.

To use a bar chart, we first filter our all_indicators_perc dataframe by an indicator and then pipe this result in our ggplot2 code:

library(dplyr)
all_indicators_perc %>%
filter(indicator == "gdp_per_capita_perc")


We change our geom and use geom_col(). As we are no longer showing a line graph, we no longer need to specify the group or col aesthetics, but instead need to specify the fill aesthetic.

Our full ggplot2 code is:

library(dplyr)
library(ggplot2)
# bar chart fixed y axis
# gdp_per_capita_perc
all_indicators_perc %>%
filter(indicator == "gdp_per_capita_perc") %>%
ggplot(aes(x = year, y = perc_change, fill = perc_change)) +
geom_col() +
ggtitle("GDP per Capita Over Time (1970 - 2017)") +
ylab("Percentage Change (%)") +
xlab("Year") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none") +
facet_grid(rows = vars(country))




I have chosen a fixed y-axis for all five graphs, which allow us to see, in which year, for which country the percentage changes were the biggest.

And doing the same for Air Transport Passengers:

library(dplyr)
library(ggplot2)
# air_transport_passengers_perc
all_indicators_perc %>%
filter(indicator == "passengers_perc") %>%
ggplot(aes(x = year, y = perc_change, fill = perc_change)) +
geom_col() +
ggtitle("Air Transport Passengers Over Time (1970 - 2017)") +
ylab("Percentage Change (%)") +
xlab("Year") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none") +
facet_grid(rows = vars(country))




For these five countries, air transport passengers numbers have grown in most years.

Conclusion


In this post we’ve used two economic indicators to demonstrate a few ways of plotting the time series using ggplot2, demonstrating ggplot2’s versatility and ease of use. In the course of doing so, we’ve also seen how to prepare our data using tidyverse to make plotting in ggplot2 possible.

If you see any errors in the code, please let me know. If you have any comments, then please leave them below, and don't forget that you can find all the code for this post (and more!) at the post Github repository. You can also find me on Twitter @stevo_marky.

Session and Package Information


# get session and package info
sessionInfo()


R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.2

other attached packages:
[1] bindrcpp_0.2.2 ggplot2_3.1.0 dplyr_0.7.6 readr_1.1.1 tidyr_0.8.1


Notes


The World Bank data is used in compliance with their Terms of Use. Datasets used are: The World Bank: GDP per capita (current US$) and The World Bank: Air transport, passengers carried.

Leave a comment