Introduction

This training material is designed to provide an introduction to trade analysis in R. There are 8 sections in total. The first four sections will provide you with some good basics for R, whereas sections 5, 6, 7 and 8 will focus on trade analysis. If you know how to program with R and are interested in trade analysis, we suggest you start from Section 5.

International trade is a key ingredient to economic growth, employment and poverty reduction. Being able to analyse trade data sets can help countries to understand better their trade with other countries and devise better policies. With the development of technologies now, there is more available data and easier access to it. This means that policymakers can analyse many different aspects of trade than before. In our guide we explain how to perform analysis in R for gravity models (Section 5), Non-tariff Measures (NTMs) indicators (Section 6 & 7) and regulatory distance (Section 8). For these sections, basic statistical knowledge is required in order to be able to follow.

Gravity models are the backbone of applied international trade analysis and have been used in countless research papers. They are of particular interest for policy researchers because they provide insight into the relationship between trade flows and economic size and inversely with trade costs. This allows them to capture patterns in international trade and production. Policy researchers can use them to estimate the trade impacts of various trade-related policies, from traditional tariffs to new “behind-the-border” measures. If interested more in gravity models, we strongly suggest to review ESCAP’s original guide in STATA or the recent version in R.

NTMs have increased in number over the decade and currently have a bigger influence on trade than tariffs. Therefore, understanding their behavior and being able to assess how much of an influence they have on a country’s trade is crucial for policymakers. They can relate to the entire list of imports or be computed per product type. An example of an NTM is a temporary geographic prohibition of poultry imports or cattle from disease-affected countries. In this guide we will show how to compute three indicators using R: coverage ratio, frequency index and prevalence score. The coverage ratio (CR) measures the percentage of trade value subject to NTMs, the frequency index (FI) indicates the number of products in percentage to which NTMs apply, and the prevalence score (PS) is the mean number of NTMs per product(s). We discuss this in further detail in Section 6 and 7.

Regulatory distance (Section 8) is an indicator which measures the similarity between NTM policies across member States and sectors. This helps policymakers to assess the status quo of NTM-related regional integration and to benchmark progress. We display this information using multidimensional scaling, which shows how far or close countries are according to the regulatory distance. This helps visualize the distance between regional trade agreements.

What is R?

Let’s first begin with a short introduction on what is R?

R is an object orientated programming language and free software environment. You can perform data analytics, statistical and econometric models and visualization. Object-oriented programming is a type of computer programming in which:
- the programmer defines the type of an element or data set
- the programmer defines the type of functions to be applied on the element or data set
We will explain the different types of elements in R in Section 1. Every time we introduce a new function, we will highlight in blue the explanation.

To download R, we must go on the CRAN website and click on the version which applies to you - Linux, Mac or Windows.

We choose Windows:

Once this is downloaded, click “Next” through everything and “Finish” at the end, and you’re done!

We can open R and run the code print("Hello!") by writing it in the console and pressing “Enter” like this:

You can see the output below the code.

We can select a new script, by clicking on “File” and choosing “New script” from the menu. We can print the phrase “Hello!” by typing print("Hello!") and right clicking “Run line or selection” or using the shortcut key “Control + R” or “Command + Enter” for Mac. The output will shows up in the console in red.

A better program to use R in, is R Studio. We can download it from the R Studio website. For the moment you can just choose the first option which is for free. We will use this version for the entire guide. Once again select for the operating system you have. We choose Windows.

Below is a simple demonstration of how to open a new script in R and run a sample code. As you can see the output shows up in the console (bottom left corner).

You can also run code by highlighting it with your mouse and pressing “Control + Enter” or “Command + Enter” for Mac.

Below is a small demonstration on creating variables, how to view plots, packages, help files and animations.

A package in R is a collection of functions which the user can call and run/use. R has the base package which is always automatically loaded, however packages developed by external users have to be downloaded and installed. Furthermore, in the beginning of each new R session, they have to be called from the library. Section 4.2 will demonstrate how to download the ggplot2 package, which is a very powerful package for making various types of plots and animations.

To load data, we can click on the “Import data set” in the upper right corner and select the type of file we want to upload. This will prompt a window to pop up where we select the folder in which the data is stored:

The usual way however is to set a working directory and take files from there. We will learn more about this in Section 2.

Now, let’s go into some code. No video is needed, you can straight away copy the code!

You will find quizzes in each section, which are not rated, but are simply designed to help you train and learn. Take your time with each section and make sure to complete all quiz questions. We strongly encourage for you to complete these quizzes as the questions resemble the survey questions to complete at the end for your certificate.

All the R code in the text is meant for you to run also on your own, provided that you have changed all the necessary paths. We recommend that you do each subsection at once in order to not lose track. If however you decide to do a section or subsection over two different session, we recommend you save your script and working environment. Each section introduces new comments and uses new data sets. The entire guide can take you from 2 weeks to a month if you devote a few hours every day, depending on your level (e.g. do you have programming experience or not).

When you start codes from the guide, save them in a script! This way it is easy to pick up where you left off! You can use this file which contains all the R codes from this guide.

We have started an R Cheat Sheet for you with the main commands, do include more commands as this will be especially helpful in the beginning to not forget what each command does! We will highlight in blue the explanations for any new command introduced.

1. R Basics

1.1 Declaring variables and arithmetics in R

Let’s start with the basic arithmetics like addition, subtraction, multiplication and division.

Let’s first create some variables, in this case “a” and “j” and their values are 34 and 88, respectively. We run this in R:

a <- 34
j <- 88

(Remember you run code by either writing it in the console and pressing “Enter” or by writing a script where you highlight it and press “Control/Command + Enter” or “Command + Enter” for Mac.)

Notice that we use <- to say that “a” is equal to 34.

The equal sign = is the same as the <- operator. It is recommended however to use <-, in order to not confuse = with ==. == is a logical operator (we will use these in Section 2.3) whereas = is for variable or data creation. (There are more reasons why it is advised to use <- when creating variables or data sets. To see the differences between <- and =, {this link}(https://stackoverflow.com/questions/1741820/what-are-the-differences-between-and-assignment-operators-in-r) presents a good discussion.)

a <- 34
a = 34 

We can do basic arithmetic operations with our variables, such as addition and division shown below:

a + j
## [1] 122
a/j
## [1] 0.3863636

Notice that after each line you see the answer behind ## [1]. For the first code we do not have any output, as we just declare two variables, but as we estimate models or do checks with data sets, without declaring objects, we will get output. Let’s continue, don’t worry this will become clearer as we go further!

Attention! If you happen to run the following code below in the console, R will expect something after the plus. In order to escape this, you need to press the button “Escape”/“Esc”.

a + 

If we want to access the value of a variable, we can do so by just running its name.

a
## [1] 34

We can take logarithms with the log command:

log(a)
## [1] 3.526361

The default logarithm function in R is the natural logarithm.

Quiz 1.1

Let’s now practice! Make sure when doing those to have R so if not sure about the correct answer - you can always test it in R! We will not score you on these quizzes, the idea is to train!

What is the correct way to declare a variable called “export” equal to the logarithm of 1000 (log())? (Hint: Look above how we declared a new variable and then combine this with how we take logarithm.)

A. log(1000)

#Answer A is incorrect because it only gives the log value of 1000, but we want to declare the variable export as the logarithm of 1000, therefore the correct answer is C. export <- log(1000). 

B. export log(1000)

#Answer B is incorrect because it only lists the variable next to the log(1000), however it does not tell R that export is equal to log(1000). In order to tell R that export is equal to log(1000), we must include "<-", therefore the correct answer is C. export <- log(1000). 

C. export <- log(1000)

#Answer C is correct. Good job! :) 

1.2 Class types in R

Variables in R can be of different types, for example a number such as “1011” or character such as “abc”. The type of a variable is referred to as class in R. We check the type of class using the class command. The different types of classes in R have different properties.

The variable a we create is a number and therefore numeric:

a <- 34
class(a)
## [1] "numeric"

We check the help file of a function to know what it does with a “?” in front like this:

?Classes_Details

You can find all the different classes in R by running the code and scrolling down to “Basic classes”.

Remember the help files can be found on the right of your console.

We would suggest to always have a look at the options a new command has found in its help file. This makes it clear what the new command can do and informs you if you want to change any default settings.

We create a new variable j and put a country name, the class in this case is a “character”.

j <- "China, PRC"
class(j)
## [1] "character"

If we wish now to add the two variables a + j, this gives an error now:

a + j

Error in a + j : non-numeric argument to binary operator

Error messages in R can be very helpful because they indicate why a code is not running and help us build better code. For instance, in the case above, R tells you non-numeric argument to binary operator, which signifies that you cannot mathematically add a “non-numeric” object (in this case j <- "China, PRC"). If you’re uncertain about what the error message means, you can always google it on and add “stackoverflow”.

1.3 Generating distributions

We can get integers from number A to number B such as below:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

We can also generate numbers from the random normal distribution with the rnorm command. We can check the help file once again like this:

?rnorm

Remember to check by yourself the help file of new functions in R. As we go along with this guide, we will explain what new functions do, but for details it is advised to always check the help file.

We can use the above function rnorm to generate 100 numbers with mean of 1 and a standard deviation of 5.

y <- rnorm(100,1,5)

We can represent this in a histogram with the hist command.

hist(y)

We can create an additional distribution variable “x”, which multiplies “y” by 2 and adds 100 random numbers with mean 0 and a standard deviation of 1.

x <- y * 2 + rnorm(100,0,1)

We can visualize the relationship between our two variables “x” and “y” by creating a plot with the plot command. We use the main parameter to add a title, col is for the color of the observations and option 2 is for red.

plot(x,y, main="My first R plot", col=2)

We can also create a prettier colorful plot, worthy of journal submission. By setting col to 1:10, we select all colors with codes from 1 until 10.

plot(x,y, main="My first R plot", col=1:10)

The following plots next are examples with different colors. Note “#000000” is for black.

plot(x,y, col="red")

plot(x,y, col="#000000") #black

`#black is referred to as a comment. Comments are helpful to give extra explanations or questions. When the computer runs the code, it ignores them. You can make comments by including a “#”.

Quiz 1.2

Let’s practice now what we learned so far!

Which answer shows two new variables x and y being generated and a third line plotting them in green with the title “My second R plot!”. (Hint: Check how we did so just now. You can use col=“green”.)

A. x <- rnorm(100,2,5)
    y <- x + rnorm(100,1,5)
    plot(x,y, main="My second R plot", col="green")

#This is correct. Good job!

B. x
    y
    plot(x,y, main="My second R plot", col="green")

#Answer B is incorrect because it does not specify what x and y should be equal to. The code for the plot is correct though. Therefore the correct answer is A.

C. x <- rnorm(100,2,5)
    y <- x + rnorm(100,1,5)
    plot(x,y)

#Answer C is incorrect because the plot does not specify the title and the points are black, therefore the correct answer is A.

2. Loading data and subsetting

In this section we will learn how to load and operate data. Later, Section 3 will expand on subsetting data sets and show you how to make loops.

2.1 Basic data loading

For this section you can download the data here.

We can load the CSV data from the website link with:

trade_data <- read.csv("https://r.tiid.org/R_tute_data2007-2017.csv")

It is advised to work in R by setting a working directory. Set your working directory from which your data frames will be imported and to which the resulting files will be exported by using the function setwd. This is a very important step, as it will allow you to import files from this location without having to indicate the full path to this directory. And RStudio will save all output and backup files into that directory automatically.

You can also copy the path from your “File explorer”. Note for PC users: when inputting the path to any directory in RStudio, it is necessary to replace all backslashes \ with forward slashes /, or you will get an error message.

setwd("C:/path/to_your/working_directory")

You can check your working directory using getwd.

getwd()

## [1] "C:/path/to_your/working_directory"

Once you have set your working directory, we can read the CSV file in R using the command read.csv by just referencing the name of the CSV file’s name. (Remember you can run ?read.csv to see all the options of this R function.)

trade_data <- read.csv("R_tute_data2007-2017.csv")

It is possible to run into the following error, if so don’t forget to put “.csv”, or make sure the location of your file is correct.

Error in file(file, “rt”) : cannot open the connection
In addition: Warning message:
In file(file, “rt”) :
cannot open file ‘C:/path/to_your/working_directory/r_tute_data2007-2017.csv’: No such file or directory

The trade_data file we uploaded on R is a spreadsheet, in R called a “data.frame”.

class(trade_data)
## [1] "data.frame"

We check the variable names of the data set with the names command:

names(trade_data)
## [1] "reporter" "flow"     "partner"  "year"     "value"

We can also open the data set by clicking on it. Remember from the Introduction, your variables and data sets are located in the upper right corner.

From the names function we know that our trade_data has a reporter column which shows the country reporting the trade values and partner is the destination country. flow specifies whether it is an export or an import. year represents the year the trade value was recorded.

Let’s now get into some data subsetting!

2.2 Basic data subsetting

We check the first few rows of a data set with the head command:

head(trade_data)
##   reporter flow partner year      value
## 1      213    I     186 2008  156130202
## 2      213    E     174 2008  117661208
## 3      213    E     172 2008   31986252
## 4      213    E     134 2008 1507966544
## 5      213    I     181 2008    2407260
## 6      213    I     182 2008   80414681

The head function is very useful when we just want to get a small glimpse of the data - for instance what are the variables included. If you need to see more than just the first rows, you can also open it by clicking on it’s name.

If we want to check a specific number of rows, we can do so by specifying this number as in below:

head(trade_data,10)
##    reporter flow partner year       value
## 1       213    I     186 2008   156130202
## 2       213    E     174 2008   117661208
## 3       213    E     172 2008    31986252
## 4       213    E     134 2008  1507966544
## 5       213    I     181 2008     2407260
## 6       213    I     182 2008    80414681
## 7       213    I     676 2008      991884
## 8       213    E     258 2008   107246580
## 9       614    I     110 2008 15302709034
## 10      213    I     311 2008

This way we can create a smaller subset of the data with the first 10 rows to perform initial tests:

first10 <- head(trade_data,10)

There is also another way to access row data. We can select the entire first row also like this:

first10[1,]
##   reporter flow partner year     value
## 1      213    I     186 2008 156130202

Notice that we use square brackets because this is a data set. If you run the above code but with () you will get an error:

first10(1,)

Error in first10(1, ) : could not find function “first10”.

This is because we use the round brackets () to specify the options for functions and square brackets [] to specify the options for data sets.

If we want to select the first five rows, both ways give the same result. Notice in the second way 1:5 means from row 1 to row 5 inclusive.

head(first10, 5)
##   reporter flow partner year      value
## 1      213    I     186 2008  156130202
## 2      213    E     174 2008  117661208
## 3      213    E     172 2008   31986252
## 4      213    E     134 2008 1507966544
## 5      213    I     181 2008    2407260
first10[1:5,]
##   reporter flow partner year      value
## 1      213    I     186 2008  156130202
## 2      213    E     174 2008  117661208
## 3      213    E     172 2008   31986252
## 4      213    E     134 2008 1507966544
## 5      213    I     181 2008    2407260

We can also get specific rows, by specifying them inside c(). For instance the first and the fifth row only:

first10[c(1,5),]
##   reporter flow partner year     value
## 1      213    I     186 2008 156130202
## 5      213    I     181 2008   2407260

In fact, when referring to the data set like this, before the comma is the specification for the row(s) and after - for the column(s).

For example this gives you the first column:

first10[,1]
##  [1] 213 213 213 213 213 213 213 213 614 213

We can get the first and third column (reporter and partner columns respectively) by index:

first10[,c(1,3)]
##    reporter partner
## 1       213     186
## 2       213     174
## 3       213     172
## 4       213     134
## 5       213     181
## 6       213     182
## 7       213     676
## 8       213     258
## 9       614     110
## 10      213     311

Since the columns have names, we can replace the indices with the column/variable names:

first10[,c("reporter","partner")]
##    reporter partner
## 1       213     186
## 2       213     174
## 3       213     172
## 4       213     134
## 5       213     181
## 6       213     182
## 7       213     676
## 8       213     258
## 9       614     110
## 10      213     311

There is a third way to select a column from a data set: by specifying the name of the data set and including a $ before the name of the column. For example, in this case our data set is called first10 and we want the column named “flow”:

first10$flow
##  [1] I E E E I I I E I I
## Levels: E I

To summarize these three ways are all equivalent in getting the reporter column:

first10[,1]
##  [1] 213 213 213 213 213 213 213 213 614 213
first10[,"reporter"]
##  [1] 213 213 213 213 213 213 213 213 614 213
first10$reporter
##  [1] 213 213 213 213 213 213 213 213 614 213

The usage of either of these depends on personal preference, but they are all equivalent. In cases where variable names are helpful to be shown in the code, the second and third way are preferred.

If we had a column/variable name from a CSV file with a space in it, we have to include the quotation marks:

first10$'reporter code' 

These two lines are equivalent:

first10$'flow'
##  [1] I E E E I I I E I I
## Levels: E I
first10$"flow"
##  [1] I E E E I I I E I I
## Levels: E I

We can also check the class for a column:

class(first10$flow)
## [1] "factor"

In this case “flow” is a factor. A factor is a variable in R with limited amount of different values, referred to as “levels” in R. A factor is also known as a categorical variable.

Let’s combine the knowledge about selecting rows and columns. We can select just the first five reporters:

first10[1:5,"reporter"]
## [1] 213 213 213 213 213

Or the first 5 reporters and partners, the two codes are equivalent:

first10[1:5,c(1,3)]
##   reporter partner
## 1      213     186
## 2      213     174
## 3      213     172
## 4      213     134
## 5      213     181
first10[1:5,c("reporter", "partner")]
##   reporter partner
## 1      213     186
## 2      213     174
## 3      213     172
## 4      213     134
## 5      213     181

Quiz 2.1

Let’s do a quick practice! Remember to keep R open so you can run the possible answers and see which one is correct.

Which answer:
- creates a data set called mydata with the first 20 rows selected from the trade_data
- and then creates an object called element which chooses the 5th row element from the 2nd column?

(Hint: Check how we got the first10 data and how to select elements. Hint 2: Remember, when selecting elements from a data frame - row comes before column, hence if we want row a and column b, this would be: data[a,b])

A. mydata <- head(trade_data,20)
    element <- mydata[2,5]

#Answer A is incorrect because we choose our element from the second row `mydata[5,2]`, fifth column, instead of fifth row, second column `mydata[2,5]`. Hence the correct is Answer B. 

B. mydata <- head(trade_data,20)
    element <- mydata[5,2]

#Answer B is correct! 

C. mydata <- head(trade_data)
    element <- mydata[7,2]

#Answer C is incorrect because we don't specify the first 20 rows to choose in the head() command. Just running the head function without specifying 20 rows, will only give us six rows.  Hence the correct is Answer B. 

2.3 Logical operators

Logical operators are used in programming to check whether statements are TRUE or FALSE. They will be very helpful when we expand on data subsetting in the next section.

If we want to check if a value is equal to another value, we use ==. Let’s check if the value of “a” (which we just declared above) is 34:

a == 34
## [1] TRUE

If we check for 35, it gives FALSE.

a == 35 
## [1] FALSE

We can also check if a value is not equal with !=:

a != 34
## [1] FALSE

Notice how for 34, we get FALSE as the value of a is 34. For 35, we get TRUE because our value is different from 35.

a != 35
## [1] TRUE

Other logical operators are >: greater than, <: smaller than, >=: greater or equal and <=: smaller or equal.

a < 34
## [1] FALSE
a <= 34
## [1] TRUE

We can use logical operators also with data:

one2five <- 1:5
three2seven <- 3:7

We can now check if the number one is in the sequence:

one2five == 1
## [1]  TRUE FALSE FALSE FALSE FALSE
three2seven == 1
## [1] FALSE FALSE FALSE FALSE FALSE

If we use the equal logical operator to compare the two sequences we created, it gives us FALSE. This is because it compares row by row from the first to the last number.

one2five==three2seven
## [1] FALSE FALSE FALSE FALSE FALSE

Fortunately, there is another way to check if any of the numbers in one2five are in three2seven. We use the %in% operator.

one2five %in% three2seven
## [1] FALSE FALSE  TRUE  TRUE  TRUE

Logical operators are very helpful when subsetting data. For example, let’s select only the row where the reporter value is 614 from our previous data first10. We do so by specifying the data set from which we want to subset and include square brackets, in this case first10[]. Then we specify the condition and put a comma. This means that it applies to all columns. If we wanted the value of a particular column, we could specify it after the comma.

first10[first10$reporter==614,]
##   reporter flow partner year       value
## 9      614    I     110 2008 15302709034

Just by itself the chunk of the code evaluates logically whether the statement is true for values only for the column where the condition is specified. In this case for the reporter column.

first10$reporter==614
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

We can also select only the rows with exports (flow==“E”).

first10[first10$flow=="E",]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 3      213    E     172 2008   31986252
## 4      213    E     134 2008 1507966544
## 8      213    E     258 2008  107246580

We can have the criteria above and also select rows with value > 100,000,000 by using &:

first10[first10$flow=="E"&
          first10$value>100000000,]
##      reporter flow partner year value
## NA         NA <NA>      NA   NA  <NA>
## NA.1       NA <NA>      NA   NA  <NA>
## NA.2       NA <NA>      NA   NA  <NA>
## NA.3       NA <NA>      NA   NA  <NA>

The & operator stands for “and”.

Make sure to include the & operator after the first line of code, otherwise R does not know that the next line is part of the same expression.

This give us an error because column “value” is factor.

class(first10$value)
## [1] "factor"

We can change the value column from being factor to numeric like this:

first10$value <- as.numeric(as.character(first10$value))

When converting from class factor to numeric, we first need to convert to the character class first then to numeric, otherwise it will take the index in the factor dictionary rather than the actual value. We convert to the character class by using the as.character command. Then to convert the values to numeric - we use as.numeric.

Now the class is numeric:

class(first10$value)
## [1] "numeric"

So we can subset by value with the > operator:

first10[first10$flow=="E"&
          first10$value>100000000,]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 4      213    E     134 2008 1507966544
## 8      213    E     258 2008  107246580

We can also subset the exports value for more than 100 million for partner 174:

first10[first10$flow=="E"&
          first10$value>100000000&
          first10$partner==174,]
##   reporter flow partner year     value
## 2      213    E     174 2008 117661208

Same as above, but partner is 174 or 134:

first10[first10$flow=="E"&
          first10$value>100000000&
          first10$partner %in% c(174,134),]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 4      213    E     134 2008 1507966544

Select any flows of any value of partner 174 or 134, where “|” is the “or” operator:

first10[first10$partner==174|first10$partner==134,]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 4      213    E     134 2008 1507966544

The “or” (|) operator means we select both values if they exist.

In fact, these two lines of code are equivalent in this case:

first10[first10$partner %in% c(174,134),]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 4      213    E     134 2008 1507966544
first10[first10$partner==174|first10$partner==134,]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 4      213    E     134 2008 1507966544

Quiz 2.2

Okay, that was a lot. This is a very important part, so let’s try to practice it!

Which answer correctly selects all the conditions listed below correctly:
- import “I” flow
- trade values larger than (>) a 1000
- select both partners with codes 186 and 181 ?

(Hint: We just did something similar, look at the code!)

A. first10[first10$flow=="I"&
                    first10$value>1000&
                    first10$partner %in% c(186,181),]

#Answer A is correct. Good job! :). 

B. first10[first10$partner==186|first10$partner==181,]

#Answer B is incorrect because it only selects the partners correctly (186 and 181), but does not select the import ("I") flow and the trade value to be higher than 1000. Hence, answer A is the correct answer. 

This concludes Section 2! From now on we will work with different data sets and in order to have more space on R, you can always run the rm command in order to delete the data sets from your environment in R that are not in use. In this case we delete the first10 data set:

rm(first10)

3. More data subsetting and loops in R

3.1 Data subsetting continued

In this section we will dig into more data subsetting and work with the trade_data from the previous section. You will also learn how to make loops!

Once again (if not already done), we load the data here:

trade_data <- read.csv("R_tute_data2007-2017.csv")

Make sure you are working in your previous directory! You can check this with the command getwd.

We can check the number of rows with nrow.

nrow(trade_data)
## [1] 1244330

In this case there are 1244330 rows.

An important note is: if you have such big CSV files, do not open with Excel, as Excel’s limit is 1 048 576 rows and the full file will not be loaded. Afterwards when you want to open in R, it will only load the 1 048 576 rows.

Once again we check the variables in this data set:

names(trade_data)
## [1] "reporter" "flow"     "partner"  "year"     "value"

We can check the first values for a specific column:

head(trade_data$year)
## [1] 2008 2008 2008 2008 2008 2008

We can also see the unique years in the data set with unique:

unique(trade_data$year)
##  [1] 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

In the partner column world is “1” and China is “924”.

Let’s find the row that shows the total Chinese exports to the world in 2016:

trade_data[trade_data$flow=="E"& #selects exports
trade_data$reporter==924&        #selects China as reporter
trade_data$year==2016&           #selects the year as 2016
trade_data$partner==1, ]         #selects World as partner
##         reporter flow partner year       value
## 1019644      924    E       1 2016 2.13659E+12

We set global options to get rid of scientific notations by running the following command:

options(scipen = 999)

We can make the trade value numeric:

trade_data$value <- as.numeric(as.character(trade_data$value))

A warning is issued - that there are missing values now. Depending on the function, some cannot work with missing values, so it is essential to substitute them with estimated values or exclude them. Due to the limited scope of this guide, we will for now ignore the missing values and exclude them. (If interested on this topic, there are methods such as imputation, where if given the correct circumstances we can “impute” or estimate the missing values.)

We can exclude them from our data set in the following way:

trade_data <- trade_data[!is.na(trade_data$value),]

Notice how in the above line of code, we first specify our data set, because this is what we want to change and then we specify the command is.na onto the variable value. is.na checks if there are missing values. We include in front of is.na a !, which means that we exclude the missing values.

We can now check the value in billion by dividing the previous function by a billion. Notice after the comma we specify the value column.

trade_data[trade_data$flow=="E"&
trade_data$reporter==924&
trade_data$year==2016&
trade_data$partner==1,"value"
]/1000000000
## [1] 2136.59

We can also sum up all the Chinese exports values that are not to the world to see if the total number matches the number above (2136.59). We first need to subset the data to not include the world. We can do so by making the partner not equal != to the world.

head(trade_data[trade_data$flow=="E"&
trade_data$reporter==924&
trade_data$year==2016&
trade_data$partner!=1,
])
##         reporter flow partner year        value
## 1016163      924    E     293 2016   6081495000
## 1016164      924    E     672 2016   1236859000
## 1016166      924    E     744 2016   1320129000
## 1016184      924    E     399 2016    217025000
## 1016193      924    E     156 2016  27854748000
## 1016255      924    E     405 2016 102783000000

Using this we can create a subset data with export values in billion:

CN_X_2016<-trade_data[trade_data$flow=="E"&
trade_data$reporter==924&
trade_data$year==2016&
trade_data$partner!=1,
"value" #added value because that's what we're interested in
]
sum(CN_X_2016)/1000000000
## [1] 8012.502

We can notice this number is greater than what we got above (2136.59). This is because the value for China includes economic regions and hence country values are summed twice or more. In order to know which code is for a single country or an economic region, we can use IMF’s data dictionary. Click here to download.

codes <- read.csv("imf_codes.csv")
head(codes)
##   code iso2                  country isCountry                 region
## 1  110 XR29 Advanced Economies (IMF)         0
## 2  512   AF              Afghanistan         1           Asia Pacific
## 3  605   F1                   Africa         0
## 4  799  F19     Africa not specified         1 Africa and Middle East
## 5  914   AL                  Albania         1                 Europe
## 6  612   DZ                  Algeria         1 Africa and Middle East
##          income_level                 AP_region LLDC LDC SIDS
## 1                                                 NA  NA   NA
## 2          low_income South and South-West Asia    1   1    0
## 3                                                 NA  NA   NA
## 4                                                  0   0    0
## 5 upper_middle_income                              0   0    0
## 6 upper_middle_income                              0   0    0

In this case we need only the columns: code, country, isCountry, region.

We can overwrite the original codes data frame with only those columns:

codes <- codes[, c("country","code", "isCountry", "region" )]

We can also change the order:

codes <- codes[, c("code","country", "isCountry", "region" )]

Now, let’s change trade_data so it only includes countries and no economic regions.

How can we do so? There is a column called isCountry which if equal to 1 signifies that the code belongs to a country. We also would like to have world trade values. How can we add now those two conditions?

  1. First we filter: codes$isCountry==1 by selecting only rows where isCountry==1.
  2. Then we add the “or” condition ( | ) where we select rows with code column equal to 1.
  3. After the comma, we select only the code column from the data frame. (If left empty, we will get the whole data set - try it! codes[codes$isCountry==1|codes$code==1,])

We use this and create a new data subset of the data set codes:

cntry_w <- codes[codes$isCountry==1|codes$code==1,"code"]

So cntry_w is a list of IMF codes that are real countries (not aggregates like advanced economies) as well as 1 to represent world total.

We can use this data set to expand the filters of CN_X_2016 and compose a data set of only Chinese exports to countries and the world in 2016. We do so by specifying trade_data$partner %in% cntry_w, meaning we take country codes only from cntry_w.

CN_X_2016<-trade_data[trade_data$flow=="E"& #select only exports
trade_data$reporter==924& #from china
trade_data$year==2016& #in 2016
trade_data$partner!=1& #where country is not world (code for world is 1)
trade_data$partner %in% cntry_w, #which is in the list of codes that we filtered earlier
"value" #variable we're interested in
]
sum(CN_X_2016)/1000000000
## [1] 2135.245

Now we get 2135.25 which is closer to 2136.59, than 8012.50. The difference is due world regions which are not specified as countries.

Quiz 3.1

Time for a quick recap! Don’t forget to run the codes in R to check.

What is the correct line of code to select all the exports from China (924) to the world (1) for 2016? (Hint: Look at what we did previously for 2016.)

A. trade_data[trade_data$reporter==924&
    trade_data$year==2016&
    trade_data$partner==1,
    "value"]

#Answer A is incorrect because we did not filter on exports (`trade_data$flow="E"&`). Answer D includes this, hence it is the correct answer. 

B. trade_data[trade_data$flow="E"&
    trade_data$reporter=924&
    trade_data$year=2016&
    trade_data$partner=1,
    "value"]

#Answer B is incorrect because we use "=" instead of "==". Remember "==" is for checking values whereas "=" is for declaring them. (For example trade = 100 declares a new variable trade with value 100, but trade == 1000, checks whether the already existing variable trade is of value 1000.) Hence, Answer D is the correct answer. 

C. trade_data[flow=="E"&
    reporter==924&
    year==2016&
    partner==1,
    "value"]

#Answer C is incorrect because the column variables of our dataset are not objects in R, therefore R cannot find them just by name - we have to specify they are from the trade_data dataset - e.g. trade_data$flow. Hence Answer D is correct.

D. trade_data[trade_data$flow=="E"&
    trade_data$reporter==924&
    trade_data$year==2016&
    trade_data$partner==1,
    "value"]

#Answer D is correct. Good job! :). 

3.2 Creating plots

Now let’s try to plot only the year and value from the Chinese exports data set.

First, we create a data set containing China’s total exports to the world (1) for all years available:

CN_X <- trade_data[trade_data$flow=="E"&
trade_data$reporter==924&
trade_data$partner==1,
]
head(CN_X)
##        reporter flow partner year         value
## 24180       924    E       1 2008 1428870000000
## 148613      924    E       1 2009 1202050000000
## 273046      924    E       1 2010 1578440000000
## 397479      924    E       1 2011 1899280000000
## 521912      924    E       1 2012 2050090000000
## 646345      924    E       1 2013 2210660000000

We can use the plot command to show China’s exports to the world per year:

plot(CN_X[,c("year", "value")])

Change the values into trillions:

CN_X$value <- CN_X$value/1000000000000

Looks better now:

plot(CN_X[,c("year", "value")])

Let’s add a title and y-axis title:

plot(CN_X[,c("year", "value")],
main="China's exports",
xlab="",
ylab="USD, trillions")

Notice, we use the main command to add a title. xlab and ylab are used for naming the x and y axis respectively.

Points don’t make sense, let’s use a line instead:

plot(CN_X[,c("year", "value")],
main="China's exports",
xlab="",
ylab="USD, trillions",
type="l") #type line

We use a line by setting type = l (“l” as in line, not a 1), the default is p for points. (You can check ?plot for details.)

We can also look at China’s imports and plot this alongside with a legend. Have a look at the code first, the explanation is below.

CN_M <- trade_data[trade_data$flow=="I"&
trade_data$reporter==924&
trade_data$partner==1,
]
CN_M$value <- CN_M$value/1000000000000


plot(CN_X[,c("year", "value")],
main="China's exports and imports",
xlab="",
ylab="USD, trillions",
type="l",
ylim=c(1,2.5), #adding y axis limits
col=2, #red color
lty=1 #solid line, see ?par for details
)
lines(CN_M[,c("year", "value")],col=4, lty=2) #blue color, dashed line

legend("bottomleft",
legend = c("Exports", "Imports"),
col=c(2,4),
lty=c(1,2)
)

We first create the data set for Chinese imports CN_M. We filter the trade_data by selecting only imports: trade_data$flow=="I", the reporter as China: trade_data$reporter==924 and the partner as the world: trade_data$partner==1. We put the values of this data set in trillions: CN_M$value <- CN_M$value/1000000000000.

We combine both exports CN_X and imports CN_M into one plot by specifying the options for the first line and afterwards specifying lines() which adds a second line. We specify first the plot for CN_X and select the variables years and trade value: CN_X[,c("year", "value")]. We include a title main="China's exports and imports" and a y-axis label ylab="USD, trillions". We specify that we want a line type="l". We can tell R what coordinate system to use for the x-axis and y-axis, with xlim and ylim, respectively. We specify the color with col and a solid line with lty=1. ?par contains all the plot options available. (?plot will also provide you with some options, but the full list is available in par.)

For the second line we specify the data set variables to be used: CN_M[,c("year", "value")], we make the line blue: col=4 and dashed: lty=2.

We add a legend with the command legend. We specify where in our plot we want the legend (e.g. bottom left in this case), provide the names of the items in the legend: legend = c("Exports", "Imports") and include the same specifications from the plot function for the colors: col=c(2,4) and the type of the line: lty=c(1,2). Notice we join the all of the listed specifications with c, as we want them to appear together. In Section 4, we will delve more into plots.

Let’s now show in a plot the top 10 countries China exported to in 2016.

We start by creating a table with Chinese exports in 2016 to individual countries:

CN_X_countries <-
trade_data[trade_data$flow=="E"&
trade_data$reporter==924&
trade_data$year==2016&
trade_data$partner!=1,]

head(CN_X_countries)
##         reporter flow partner year        value
## 1016163      924    E     293 2016   6081495000
## 1016164      924    E     672 2016   1236859000
## 1016166      924    E     744 2016   1320129000
## 1016184      924    E     399 2016    217025000
## 1016193      924    E     156 2016  27854748000
## 1016255      924    E     405 2016 102783000000

We want to have the names of the countries in our plot. To do this, we merge the CN_x_countries data with the codes data by using the merge command.

CNX_merged <- merge(CN_X_countries,
codes, by.x="partner",
by.y="code")
CNX_merged <- CNX_merged[, c("country", "value", "isCountry","region")]

head(CNX_merged)
##                    country         value isCountry        region
## 1    Export earnings: fuel  167185000000         0
## 2 Export earnings: nonfuel 1969410000000         0
## 3 Advanced Economies (IMF) 1403500000000         0
## 4            United States  389714000000         1 North America
## 5           United Kingdom   56645733000         1        Europe
## 6                  Austria    2245591000         1        Europe

To use the merge command we first specify the data sets we want to merge - in this case CN_X_countries with codes. We specify the names of the columns by which the data sets should be merged, since they have different names but essentially contain the same factor/categorical data. by.x is for the first data set and by.y - for the second. (If the two data sets had the same name, we do not need to specify by, by.x or by.y.)

Finally, we only keep the variables country, value, isCountry, region. isCountry we use to exclude non-countries and region - to use the same color for the same regions.

We exclude non-countries:

CNX_merged <- CNX_merged[CNX_merged$isCountry==1,]

We sort the data with the order command. This sorts the data frame based on value in ascending order:

CNX_merged <- CNX_merged[order(CNX_merged$value),]
head(CNX_merged)
##                 country   value isCountry                      region
## 65           Montserrat  110000         1 Latin America and Carribean
## 60            Greenland  993000         1               North America
## 11           San Marino 1506000         1                      Europe
## 178               Nauru 1644000         1                Asia Pacific
## 174       Faroe Islands 1737000         1                      Europe
## 68  St. Kitts and Nevis 4850000         1 Latin America and Carribean

This sorts the data in descending order:

CNX_merged <- CNX_merged[order(-CNX_merged$value),]
head(CNX_merged)
##                country        value isCountry        region
## 4        United States 389714000000         1 North America
## 101   Hong Kong, China 293997000000         1  Asia Pacific
## 19               Japan 129617000000         1  Asia Pacific
## 105 Korea, Republic of  95815599000         1  Asia Pacific
## 10             Germany  66044117000         1        Europe
## 116           Viet Nam  62040928000         1  Asia Pacific

We subset the top 10 export countries in CNX_merged_top10. We illustrate the data with a pie chart with the pie command.

CNX_merged_top10 <- head(CNX_merged,10)

pie(CNX_merged_top10$value,
CNX_merged_top10$country)

We can also make a bar plot with the barplot command. We specify the color to be the same for the same regions: col=CNX_merged_top10$region.

barplot(CNX_merged_top10$value,
names=CNX_merged_top10$country,
col=CNX_merged_top10$region)

Let’s do one more example before going to the loop subsection. We can do the same graphics for the Philippines (566) for 2016.

We first subset export data for the Philippines for 2016 not to the world.

PH_X_countries <-
trade_data[trade_data$flow=="E"&
trade_data$reporter==566&
trade_data$year==2016&
trade_data$partner!=1,]

head(PH_X_countries)
##         reporter flow partner year       value
## 1081095      566    E     112 2016   475800328
## 1081100      566    E     158 2016 11674107669
## 1081102      566    E     542 2016  2095041077
## 1081103      566    E     544 2016      684932
## 1081108      566    E     343 2016      429046
## 1081109      566    E     433 2016    14130931

We merge the exports to countries with the codes data set:

PHX_merged <- merge(PH_X_countries,
codes, by.x="partner",
by.y="code")
PHX_merged <- PHX_merged[PHX_merged$isCountry==1, c("country", "value", "isCountry","region")]

head(PHX_merged)
##          country      value isCountry        region
## 4  United States 8670654131         1 North America
## 5 United Kingdom  475800328         1        Europe
## 6        Austria   67115695         1        Europe
## 7        Belgium  399641671         1        Europe
## 8        Denmark   35623543         1        Europe
## 9         France  726782604         1        Europe

We sort the data frame based on value in descending order:

PHX_merged <- PHX_merged[order(-PHX_merged$value),]

head(PHX_merged)
##              country       value isCountry        region
## 18             Japan 11674107669         1  Asia Pacific
## 4      United States  8670654131         1 North America
## 97  Hong Kong, China  6582766881         1  Asia Pacific
## 190            China  6192432905         1  Asia Pacific
## 109        Singapore  3700590527         1  Asia Pacific
## 10           Germany  2293130299         1        Europe

We select the top 10 export destinations and create a bar plot.

PHX_merged_top10 <- head(PHX_merged,10)

barplot(PHX_merged_top10$value,
names=PHX_merged_top10$country,
col=PHX_merged_top10$region,
main="PHL")

We can select the country name of the Philippines based on its code 566:

codes[codes$code==566,"country"]
## [1] Philippines
## 245 Levels: Advanced Economies (IMF) Afghanistan Africa ... Zimbabwe

So now we can include the country name in the bar plot:

barplot(PHX_merged_top10$value,
names=PHX_merged_top10$country,
col=PHX_merged_top10$region,
main=codes[codes$code==566,"country"],
las=2) # makes label text perpendicular to axis

The option las = 2 specifies where the title should be, 2 is for “always perpendicular to the axis”. Once again, more options can be found in ?par.

This does not fit very well, so we specify the par command with mar which controls the margins. You have to play with it until it fits.

par(mar=c(11,8,1,1))
barplot(PHX_merged_top10$value,
names=PHX_merged_top10$country,
col=PHX_merged_top10$region,
main=codes[codes$code==566,"country"],
las=2)

We can save this as a png file with the png command.

png(filename="exports.png") #give the file a name
par(mar=c(11,8,1,1)) #specify margins
barplot(PHX_merged_top10$value,
names=PHX_merged_top10$country,
col=PHX_merged_top10$region,
main=codes[codes$code==566,"country"],
las=2) # makes label text perpendicular to axis

dev.off()
## png
##   2

The dev.off() code is very important as without it, your plot will not be saved. (In fact, the plot is not created in your working directory until you run dev.off().)

Quiz 3.2

Practice by making a top 10 export destination bar plot of your own country. Add a title, change units, add y axis name and try rotating country names 90 degrees. (Hint: Check the “codes” data set to locate your country and check how we did this for China.)

Your plot should resemble the plot for the Philippines.

#First, check "codes" for your country's code. For the moment we assume your country is ABC and the code is XYZ. If you want to replicate this, make sure you replace ABC with your country's name and XYZ - with the code.
ABC_X_countries <- trade_data[trade_data$flow=="E"&
trade_data$reporter==XYZ&
trade_data$year==2017&
trade_data$partner!=1,]

# we merge the trade data with the codes data set in order to have the trade value matched to the reporter's country name, region and isCountry
ABC_X_merged <- merge(ABC_X_countries,
codes, by.x="partner",
by.y="code")
ABC_X_merged <- ABC_X_merged[ABC_X_merged$isCountry==1, c("country", "value", "isCountry","region")]

# order all data in descending order, remeber NOT to forget the "-" (minus) otherwise you will get the data in ascending order
ABC_X_merged <- ABC_X_merged[order(-ABC_X_merged$value),]

# we get the top 10 countries
ABC_X_merged_top10 <- head(ABC_X_merged,10)

# we construct the bar plot, don't forget to put the name of your country instead of "ABC"
par(mar=c(11,8,1,1))
barplot(ABC_X_merged_top10$value,
names=ABC_X_merged_top10$country,
col=ABC_X_merged_top10$region,
main="ABC", las=2) 

3.3 Loop with countries

This is a loop:

for (c in 1:3){
print(c*5)
}
## [1] 5
## [1] 10
## [1] 15

We specify that a variable c should be between 1 and 3 inclusive. We then multiply this value with 5, each time this is the done, the loop moves to the next item. This is called an iteration. When c=1, it multiplies c with 5, prints a result (i.e. 5) and then iterates to c=2, and like this until c=3 is multiplied with 5, where we have defined the upper bound.

Study well the syntax and the usage of brackets. Once done, do the quiz below!

Quiz 3.3

What is the correct specification if we want to loop through numbers from 2 to 5 and print out those values times 2? (Hint: Study first these code and look at the above code. Hint 2: Run each one to see what happens.)

  1. for (c in 2:5){
        print(c)
        }

#Answer A is incorrect because we only print the c value, without multiplying it by 2 as required. Therefore answer B is the correct answer. 
  1. for (c in 2:5){
        print(c*2)
        }

#Answer B is correct. Good job! 
  1. for (c in 2:5){
        print(c*2)

#Answer C is incorrect because we are missing a bracket at the end! Notice the syntax of a for loop: for (condition) {code to be executed}. Hence, the correct answer is Answer B.
  1. for c in 2:5 {
        print(c*2)
        }

#Answer D is incorrect for a similar reason as C, but here we have failed to put the condition inside brackets. Answer B is the same but in brackets, hence it is the correct answer. 

3.4 Loops continued

We can use a loop to create bar plots for all countries all at once! This is great because it saves us time and space! In fact, that is why loops are very popular.

So let’s make a loop that creates bar plots for different countries all at once. We will use the exact same code as the China and Philippines examples. The Philippines example is shown below. Check the comments in the code to remind yourself how we make the bar plot!

# create data subset with Philippine exports to countries in 2016
PH_X_countries <-
trade_data[trade_data$flow=="E"&
trade_data$reporter==566&
trade_data$year==2016&
trade_data$partner!=1,]

# give the barplot a name and save as png
png(filename="exports.png") #give the file a name
par(mar=c(11,8,1,1)) #specify margins
#bar plot specifications
barplot(PH_X_countries$value,
names=PH_X_countries$partner,
main=codes[codes$code==566,"country"],
las=2) # makes label text perpendicular to axis

dev.off()

So in order to automate this to produce several bar plots we need to make sure in each loop iteration we 1) create a new data subset and 2) update the name of the new bar plot.

Let’s inspect these two points separately, we start with the first one. How would a loop which creates each time a different subset of data per country look like? Let’s look at the case below for the Philippines:

PH_X_countries <-
trade_data[trade_data$flow=="E"&
trade_data$reporter==566&
trade_data$year==2016&
trade_data$partner!=1,]

The line of code selecting the country (trade_data$reporter==566) is the one telling R to make a subset for that country, therefore we need to change this line in order to have a new subset of data in each iteration.

Since we want a different country each time, we need a list of all the unique country codes. We take the unique values of the country codes from the codes data set.

unique_ctry_codes <- codes[codes$isCountry==1,"code"]
head(unique_ctry_codes)
## [1] 512 799 914 612 859 614

Now let’s learn how to iterate to create different data sets!

Let’s inspect the code below together line by line. We do this for the first five codes - for(i in 1:5) and iterate over i. This way we create five CN_X_countries data sets. Inside the specifications for the data set CN_X_countries, we select exports - trade_data[trade_data$flow=="E". We then iterate on i in trade_data$reporter==unique_ctry_codes[i]. This means that every time the reporter from the trade_data is equal to one of the codes in unique_ctry_codes, we create the CN_X_countries data set. Putting the square brackets for the list variable unique_ctry_codes gives us access to its values. Finally, we select for the year 2016 - trade_data$year==2016 and say that everyone should be included apart from the world - trade_data$partner!=1.

for(i in 1:5){
CN_X_countries <-
trade_data[trade_data$flow=="E"&
trade_data$reporter==unique_ctry_codes[i]& #
trade_data$year==2016&
trade_data$partner!=1,]
}

So now that we know how to iterate to create different data sets, let’s find out how to iterate in order to get different bar plots!

Once again, this is the code for the bar plot.

png(filename="exports.png")
par(mar=c(11,8,1,1))
barplot(PH_X_countries$value,
names=PH_X_countries$partner,
main=codes[codes$code==566,"country"],
las=2)

dev.off()

What do we have to change in order to have a different bar plot for each different data set?

We need to have the file name (filename="exports.png") changed as we want to save the bar plots under different names. We also want the title (main=codes[codes$code==566,"country"]) changed as well. The rest of the parameters depend on the data subset, so they change with the change in data subset. (Hence, we do not need to change them.)

Let’s start with the second one - changing the bar plot title. We can change this the same way we changed trade_data$reporter==566 to trade_data$reporter==unique_ctry_codes[i]. In this case we change main=codes[codes$code==566,"country"] to main=codes[codes$code==unique_ctry_codes[i],"country"].

For the file name, it is a bit more complicated as we not only need to have the name of the country (codes[codes$code==unique_ctry_codes[i],"country"]) but also .png.

We can use paste0 to join the country name with “.png” and convert this to character type. This would change the code from filename="exports.png" to filename=paste0( codes[codes$code==unique_ctry_codes[i],"country"], ".png" ).

We combine the subset code with the new code for the bar plot and test our loop:

for(i in 1:5){
CN_X_countries <-
trade_data[trade_data$flow=="E"&
trade_data$reporter==unique_ctry_codes[i]&
trade_data$year==2016&
trade_data$partner!=1,]

png(filename=paste0(
codes[codes$code==unique_ctry_codes[i],"country"], ".png"
))
par(mar=c(11,8,1,1))
barplot(CN_X_countries$value,
names=CN_X_countries$partner,
main=codes[codes$code==unique_ctry_codes[i],"country"],
las=2)

dev.off()
}

Error in plot.window(xlim, ylim, log = log, …) :
need finite ‘xlim’ values
In addition: Warning messages:
1: In min(w.l) : no non-missing arguments to min; returning Inf
2: In max(w.r) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf

This gives an error and most likely means that at least one of our data sets created is empty and therefore the graph cannot be created. Don’t worry! We can fix this. Let’s start by inspecting the different data sets created. How can we check if a data set is empty? We can check the number of rows for each data set with nrow and print this using the print command:

for(i in 1:5){
CN_X_countries <-
trade_data[trade_data$flow=="E"&
trade_data$reporter==unique_ctry_codes[i]&
trade_data$year==2016&
trade_data$partner!=1,]

print(nrow(CN_X_countries))
}
## [1] 135
## [1] 0
## [1] 130
## [1] 137
## [1] 98

We see that the second data set has zero rows. When we inspect the code, we notice that this corresponds to “Africa not specified” and therefore not a country and not in our trade_data. When a code not belonging to a country is found, the data set CN_X_countries is still created but is empty (nrow(CN_X_countries) is 0). In order to skip this data set and move to the next one, we can create an if statement to check if the data set is empty: if (nrow(CN_X_countries)==0){...}. If this is true, we iterate to the next value in line - if (nrow(CN_X_countries)==0){next}. In R the next command skips the current iteration of a loop and goes to the next one.

We can update the above loop as such:

for(i in 1:5){
CN_X_countries <-
trade_data[trade_data$flow=="E"&
trade_data$reporter==unique_ctry_codes[i]&
trade_data$year==2016&
trade_data$partner!=1,]

if (nrow(CN_X_countries)==0){next}

print(nrow(CN_X_countries))
}
## [1] 135
## [1] 130
## [1] 137
## [1] 98

Now all the data sets have observations, so we can add the if statement to the loop where we subset the data and create bar plots:

for(i in 1:5){
CN_X_countries <-
trade_data[trade_data$flow=="E"&
trade_data$reporter==unique_ctry_codes[i]&
trade_data$year==2016&
trade_data$partner!=1,]

if (nrow(CN_X_countries)==0){next}

png(filename=paste0(
codes[codes$code==unique_ctry_codes[i],"country"], "2.png"
))
par(mar=c(11,8,1,1))
barplot(CN_X_countries$value,
names=CN_X_countries$partner,
main=codes[codes$code==unique_ctry_codes[i],"country"],
las=2)

dev.off()
}

What if we want to have bar plots with the top 10 importers? The logic is the same, however we have two more steps! We have to put the trade value in descending order and merge the data trade values with the names and regions of the partners.

Let’s do this for only the first 5 elements from the unique_ctry_codes data set. We begin by setting the loop and creating a CN_X_countries data set in each loop. We merge this data set with the codes data set to get the country names and regions. We order the merged data and select the top 10 exporter destinations in highest value. Finally we create a bar plot.

Let’s see the loop first and the created bar plots. Don’t worry, a more detailed explanation follows after.

f <- 5
for(i in 1:f){

CN_X_countries <-
trade_data[trade_data$flow=="E"&
trade_data$reporter==unique_ctry_codes[i]&
trade_data$year==2016&
trade_data$partner!=1,]
if (nrow(CN_X_countries)==0){next}

#we merge reporter countries' exports to the codes data set

CNX_merged <- merge(CN_X_countries,
codes, by.x="partner",
by.y="code")
CNX_merged <- CNX_merged[CNX_merged$isCountry==1, c("country", "value", "isCountry","region")]

#sort the data in descening order

CNX_merged <- CNX_merged[order(-CNX_merged$value),]

#select top 10 export destinations

CNX_merged_top10 <- head(CNX_merged,10)

png(filename=paste0(
codes[codes$code==unique_ctry_codes[i],"country"], ".png"
))
par(mar=c(11,8,1,1))
barplot(CNX_merged_top10$value,
names=CNX_merged_top10$country,
col=CNX_merged_top10$region,
main=codes[codes$code==unique_ctry_codes[i],"country"],
las=2)
dev.off()
}

Let’s inspect the code in detail.

Once again our loop goes from 1 until 5 (f <- 5 for(i in 1:f){...}), but we only get 4 barplots. This is because one of our five elements is not a country. To control for empty datasets we use an if statement and move to the next iteration if an empty dataset is found - if (nrow(CN_X_countries)==0){next}.

From then on, we proceed as before. We merge the exports data with the codes data in order to have the corresponding country names: CNX_merged <- merge(CN_X_countries, codes, by.x="partner", by.y="code"). We select the variables we want: CNX_merged <- CNX_merged[CNX_merged$isCountry==1, c("country", "value", "isCountry","region")] , order the data: CNX_merged <- CNX_merged[order(-CNX_merged$value),] and choose the top 10 country export destinations: CNX_merged_top10 <- head(CNX_merged,10).

If you want bar plots for all countries available - equate f <- nrow(unique_ctry_codes). (Giving a variable for our loop upper bound gives us the liberty to change this value easily.)

Good job! In the Section 4 we will learn how to make even better plots!

Quiz 3.4

Which lines would you change in the above loop to have 15 elements through which we want to loop and show top 5 export destinations? (Hint: have a good look at the code above! Hint 2: try out the options below and see what you get!)

A. f <- 15
    CNX_merged_top5 <- head(CNX_merged,5)

#Answer A is incorrect because we forgot to change the specification of the barplot. Answer C is the same as A, except it adds the correct barplot specifications, hence the correct answer in this case.  

B. CNX_merged_top5 <- head(CNX_merged,5)
    barplot(CNX_merged_top5$value,
    names=CNX_merged_top5$country,
    col=CNX_merged_top5$region,
    main=codes[codes$code==unique_ctry_codes[i],"country"],
    las=2)

#Answer B is incorrect because we forgot to change the value for f! Answer C has this, hence it is the correct answer. 

C. f <- 15     CNX_merged_top5 <- head(CNX_merged,5)
    barplot(CNX_merged_top5$value,
    names=CNX_merged_top5$country,
    col=CNX_merged_top5$region,
    main=codes[codes$code==unique_ctry_codes[i],"country"],
    las=2)

#Answer C is correct. Good job!

4. Graphs

4.1 Plot basics

For this session, please download this. This is the link to the original data.

Let’s upload the data set. Since it is a CSV file, we use the read.csv command. We add the argument stringsAsFactors = FALSE because we do not want to read the string values as factors. (Remember you can use ?read.csv to see all possible arguments.)

df <- read.csv("sample_data.csv",
              stringsAsFactors = FALSE) #add this so that characters are not read as factors

Let’s see the column names and the class of each.

names(df)
## [1] "iso3"           "year"           "country"        "gdp_per_capita"
## [5] "pop"            "fertility"
class(df$country)
## [1] "character"
head(df)
##   iso3 year   country gdp_per_capita      pop fertility
## 1  ARG 1968 Argentina       6434.954 23261278     3.049
## 2  ARG 1969 Argentina       6954.764 23605987     3.056
## 3  ARG 1970 Argentina       7056.848 23973058     3.073
## 4  ARG 1971 Argentina       7335.759 24366439     3.104
## 5  ARG 1972 Argentina       7329.921 24782949     3.148
## 6  ARG 1973 Argentina       7407.367 25213388     3.203

We can see that this data set contains GDP/capita (gdp_per_capita), population (pop) and fertility rate (fertility) for each country per year.

Let’s begin with the relationship between GDP/capita versus population by plotting a scatter plot. Remember for the plot(x,y), we first include the x-axis.

plot(df$pop,df$gdp_per_capita)

Quiz 4.1

How do we get the latest year in the data and how do we plot the relationship between the GDP/capita and population for only the latest year?

Your graph should look like this:

#We can find the latest year by running the code:
max(df$year)

#This gives 2016. To get the plot for 2016 for population and GDP/capita we first subset the data for the year 2016:
df2016 <- df[df$year==2016,]

#Then we run the following code:
plot(df2016$pop,df2016$gdp_per_capita)

4.2 Plots continued

When the scales are different, we take the logarithm. In this case we use base 10 for graphs as it is easier to interpret them:

df2016 <- df[df$year==2016,]
plot(log(df2016$pop, base=10),df2016$gdp_per_capita)

A better package to use in R for making graphs is ggplot2:

#install.packages("ggplot2")
library("ggplot2")

The first line, which is commented out (#install.packages("ggplot2")), is the code to install ggplot2. Uncomment the line by deleting the # and install ggplot2 if you don’t have it by running install.packages("ggplot2"). In general, R has quite a few packages already installed, but some are not there.

The second line - library("ggplot2") makes sure that the package is in use. Do not forget to run such a code for any R package that you have installed yourself for any new session. Otherwise you will get such an error:

Error in ggplot(XXX) : could not find function “ggplot”

Let’s now try to create a plot with ggplot.

We first specify in ggplot the data set to be used, this case df. The next option aes is for specifying the variables and any other options related to them. geom_point is used for a scatter plot. Here is an example of a ggplot with the population and GDP/capita:

ggplot(df,aes(pop,gdp_per_capita)) + # says which data set and variables to use
  geom_point() #what to do with that data and variables

The same but for 2016 only:

ggplot(df[df$year==2016,],aes(pop,gdp_per_capita)) +
  geom_point()

Notice what the difference is with the previous code. (Answer: df[df$year==2016,])

We can also plot GDP/capita against fertility rate in 2016:

ggplot(df[df$year==2016,],aes(gdp_per_capita,fertility)) +
geom_point() 

We can add a fitted line with the command geom_smooth and specify the method for the fitted line. In this case we add a linear model lm with the formula “y = x + c” or y~x:

ggplot(df[df$year==2016,],aes(gdp_per_capita,fertility)) +
  geom_point() +
geom_smooth(method='lm',formula=y~x) 

We plot the logarithm of the GDP/capita against fertility rate in 2016 to control for skewness:

ggplot(df[df$year==2016,],aes(log(gdp_per_capita),fertility)) +
  geom_point()

(A reminder: we use logarithm for the GDP/capita variable to make the difference between larger points less, e.g. respond to skewness towards large values. At other times, logs are used if we want to show percent change.)

The problem is the x-axis units are now not easily interpreted. We can fix this by scaling the x-axis to log 10:

ggplot(df[df$year==2016,],aes(gdp_per_capita,fertility)) +
  geom_point() +
  scale_x_log10()

We can omit the scientific notation using options(scipen = 999):

options(scipen = 999)

ggplot(df[df$year==2016,],aes(gdp_per_capita,fertility)) +
  geom_point() +
  scale_x_log10()

(If you have already used options(scipen = 999) in your current session, you will not see any changes because you have already told R to omit the scientific notations.)

We can scale the points proportional to the population size with the specification size=pop:

ggplot(df[df$year==2016,],
       aes(gdp_per_capita,fertility,size=pop)) +
  geom_point() +
  scale_x_log10()

Make the color different for countries with color=iso3:

ggplot(df[df$year==2016,],
       aes(gdp_per_capita,fertility,size=pop,colour=iso3)) +
  geom_point() +
  scale_x_log10()

Quiz 4.2

A good exercise to do by yourself would be to match the ISO3 codes with the regions (eg. Europe, Africa, etc.) and make the color per region, rather than country. (Hint: use the data set codes from Section 3 and the function merge.)

Your plot should look like this:

#In the "codes" data set we can find the code of a country alongside its region. So let's add the region column to our `df` data set like this:
df_merged <- merge(df, codes)

#Then we just run the ggplot code:
ggplot(df_merged[df_merged$year==2016,],
       aes(gdp_per_capita,fertility,size=pop,colour=region)) +
  geom_point() +
  scale_x_log10()

4.3 Further specifications for graphs: adding axes titles and animation

Let’s continue! We can exclude the legend and add axes titles by specifying labs:

ggplot(df[df$year==2016,],
       aes(gdp_per_capita,fertility,size=pop,colour=iso3)) +
  geom_point(show.legend = FALSE) + #exclude the legend
  scale_x_log10() +
  labs(x="GDP per capita (constant 2010 USD)",       #add x axis title
       y="Fertility rate, total (births per woman)") #add y axis title

We add the time dimension by specifying the entire data set ‘df’ (and not just for the year 2016):

ggplot(df,
       aes(gdp_per_capita,fertility,size=pop,colour=iso3)) +
  geom_point(show.legend = FALSE) +
  scale_x_log10() +
  labs(x="GDP per capita (constant 2010 USD)",
       y="Fertility rate, total (births per woman)")

What is more, with ggplot we can also make graph animations. We need to install two additional packages as mentioned below. (Make sure to uncomment these lines.)

#install.packages('gifski')
library("gifski")
#install.packages('tweenr')
library("tweenr")
# install.packages("devtools")
# devtools::install_github('thomasp85/transformr')
# devtools::install_github('thomasp85/gganimate')
library("gganimate")
#install.packages('png')
library("png")

Making an animated graph and saving it as a GIF:

animated_ggplot <- ggplot(df,
       aes(gdp_per_capita,fertility,size=pop,colour=iso3)) + # adding size=pop
  geom_point(show.legend = FALSE) +
  scale_x_log10() +
  labs(x="GDP per capita (constant 2010 USD)",
       y="Fertility rate, total (births per woman)") +
  transition_time(year)

anim_save("animated_ggplot.gif", animated_ggplot)

If you make this on your R Studio, don’t forget to click on the “Viewer” tab on the right side to see it!

The command transition_time(year) makes an animation that shows the different states of the data during the different years.

We can add a title (Year) to the graph by specifying in labs - title="Year: {frame_time}":

animated_ggplot2 <- ggplot(df,
       aes(gdp_per_capita,fertility,size=pop,colour=iso3)) + # adding size=pop
  geom_point(show.legend = FALSE) +
  scale_x_log10() +
  labs(x="GDP per capita (constant 2010 USD)",
       y="Fertility rate, total (births per woman)",
       title="Year: {frame_time}") +
  transition_time(year)

anim_save("animated_ggplot2.gif", animated_ggplot2)

We can set up the height and width options:

options(gganimate.dev_args = list(width = 780, height = 440))

You can take an gganimate image and render it into an animation with the command animate. We can see all the options of animate like this:

?animate 

Or you can also google the question adding “stackoverflow”.

This is an example where we slow down animated_ggplot2:

animate(animated_ggplot2, nframes=350, fps=20) # to make it go slower

#increase number of total frames or decrease frames per second

Quiz 4.3

Let’s do one last practice!

Make a gganimate graph but with region as the color. (Hint: Check what we did for Quiz 4.2!)

Your gganimate plot should look like this:

# we merge our trade data with the codes data
df_merged <- merge(df, codes)

# construct the animated ggplot but we specify region as our color
animated_ggplot3 <- ggplot(df_merged,
       aes(gdp_per_capita,fertility,size=pop,colour=region)) +
  geom_point(show.legend = FALSE) +
  scale_x_log10() +
  labs(x="GDP per capita (constant 2010 USD)",
       y="Fertility rate, total (births per woman)",
       title="Year: {frame_time}") +
  transition_time(year)

# put the same specifications for the animation as before
animate(animated_ggplot3, nframes=350, fps=20) 

5. Gravity models

Please keep in mind that from this section onward we will learn how to analyse trade data in R. Specifically, this section is designed to teach you how to estimate gravity models in R and will thus assume you are familiar with the theory behind them and their use. If you are interested in an in-depth paper about the different types of gravity models - check ESCAP’s original guide in STATA or the recent version in R. UN ESCAP also has an online gravity tool.

Let’s quickly remind ourselves the use of gravity models:

Gravity models link trade flows with economic size and inversely with trade costs. Hence, gravity models can capture patterns in international trade and production. Policy researchers can use them to estimate the trade impacts of various trade-related policies, from traditional tariffs to new “behind-the-border” measures. (From ESCAP’s guide on gravity models found here.)

In this chapter we will cover the most basic form of the gravity model - the linear model. We will show you how to obtain STATA-like robust clustered standard errors. We will finish the analysis with the Poisson Pseudo-Maximum Likelihood Estimator (PPML).

5.1 Linear gravity model

5.1.1 Linear regression

Let’s start with the most basic gravity model: linear regression using the lm command. (lm stands for linear model.)

Let’s first generate an x and y value by using the rnorm function. (Check Section 1.2 to remind yourself what this function does or run the code ?rnorm.)

y <- rnorm(100,1,5)
x <- y * 2 + rnorm(100,0,1)

Inside the brackets, on the left side of the tilde (~), is the output variable and on the right side the input variable(s). In this case we estimate the equation y = x + c, where c is a constant which we find once the program is ran. reg1 is the name we have given our regression.

reg1 <- lm(y~x)
reg1
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept)            x
##     0.02308      0.49186

We can access regression output with the function summary. This gives us all the basic output one needs from a regression, such as the values of the estimates, the standard errors, p-values, etc. (Remember summary can be used on both data sets and model objects in R.)

Let’s run summary for reg1. The first line of output gives us the model specification, e.g. Call: lm(formula = y ~ x). Afterwards, descriptive statistics are provided for the Residuals, such as the minimum (Min) and maximum values (Max), the median (Median) and the first (1Q) and third quartile (3Q). For instance the Median value in this case is 0.04081.

Next we get the Coefficients. We have the coefficient value under Estimate, after it the standard error (Std. Error), the t-statistic (t value) and the p-value (Pr(>|t|)). For reg1 the coefficient of the intercept is 0.017038 and the x coefficient is 0.486928. The x coefficient is significant as shown under column Pr(>|t|).

Below the Coefficients information, is a legend for the significance level Signif. codes. For example, 0.001 ‘**’ means that ** provide 0.1% of significance.

Finally, we have information on the Residual standard error, the R squared Multiple R-squared and Adjusted R-squared, and the F-statistic and its p-value. For reg1 the R squared is 0.9892.

(This summary output will vary for the type of model used. The more complicated the model, the more information will be provided generally speaking.)

summary(reg1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.35458 -0.26922  0.05095  0.31059  0.99177
##
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)
## (Intercept) 0.023083   0.051397   0.449               0.654
## x           0.491862   0.005053  97.342 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5133 on 98 degrees of freedom
## Multiple R-squared:  0.9898, Adjusted R-squared:  0.9897
## F-statistic:  9475 on 1 and 98 DF,  p-value: < 0.00000000000000022

The attributes function shows you the full list of attributes you can pull out from the summary function used on reg1:

attributes(summary(reg1))
## $names
##  [1] "call"          "terms"         "residuals"     "coefficients"
##  [5] "aliased"       "sigma"         "df"            "r.squared"
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"
##
## $class
## [1] "summary.lm"

For example you can just pull out the \(R^2\) alone:

summary(reg1)$r.squared
## [1] 0.9897634

Quiz 5.1

Let’s practice!

Which answer correctly:
- declares two new variables with specifications x <- rnorm(100,2,5) and y <- x + rnorm(100,1,5)
- calculates the regression equation y = x + c
- asks R for the regression’s output summary?

(Hint: y = x + c, c is a constant which we will find out once the program is ran, don’t worry about it, it should not be in your code!)

A. x <- rnorm(100,2,5)
    y <- x + rnorm(100,1,5)
    reg2 <- lm(y~x)
    summary(reg2)

#This is correct. Good job!

B. reg2 <- lm(y~x)
    summary(reg2)

#Answer B is incorrect because it fails to include the declaration of x and y, therefore the correct answer is A.

C. x <- rnorm(100,2,5)
    y <- x + rnorm(100,1,5)

#Answer C is incorrect because it only declares the variables x and y but does not specify the regression nor expression for the summary, therefore the correct answer is A.

5.1.2 Linear gravity model continued

Let’s work with some real trade data! We begin by importing the data in R with read.csv. You can download it here.

gd <- read.csv("short_data.csv",
               stringsAsFactors = FALSE)

We see the list of variables in the data set with the name command.

names(gd)
##  [1] "X"                 "reporter"          "partner"
##  [4] "year"              "export_ij"         "import_ij"
##  [7] "trade_ij"          "export_i"          "import_i"
## [10] "trade_i"           "export_j"          "import_j"
## [13] "trade_j"           "ctcij"             "ntcij"
## [16] "tariffrateij_sa"   "tariffrateij_wa"   "tariff_ij_sa"
## [19] "tariff_ij_wa"      "tariffrateji_sa"   "tariffrateji_wa"
## [22] "tariff_ji_sa"      "tariff_ji_wa"      "ga_tariff_ijji_sa"
## [25] "ga_tariff_ijji_wa" "dist"              "contig"
## [28] "comlang_off"       "comlang_ethno"     "colony"
## [31] "comcol"            "curcol"            "col45"
## [34] "smctry"            "landlocked_i"      "landlocked_j"
## [37] "landlocked_ij"     "rta"               "gdp_i"
## [40] "gdpgrowth_i"       "gdppc_i"           "gdppcgrowth_i"
## [43] "rgdp_i"            "rgdppc_i"          "gdp_j"
## [46] "gdpgrowth_j"       "gdppc_j"           "gdppcgrowth_j"
## [49] "rgdp_j"            "rgdppc_j"          "dtf_i"
## [52] "tab_i"             "startbiz_i"        "credit_i"
## [55] "invest_i"          "tax_i"             "contract_i"
## [58] "dtf_j"             "tab_j"             "startbiz_j"
## [61] "credit_j"          "invest_j"          "tax_j"
## [64] "contract_j"        "lsci_i"            "lsci_j"
## [67] "lpi_i"             "lpi_j"

Gravity models link trade flows with economic size and inversely with trade costs. For our first gravity model in R, let’s check the relationship between the exports (export_ij) and distance between the exporting and importing countries (dist):

lm(gd$export_ij ~ gd$dist)
##
## Call:
## lm(formula = gd$export_ij ~ gd$dist)
##
## Coefficients:
## (Intercept)      gd$dist
##  1094335800       -73723

As we saw in the beginning of Section 5.1, we can create an object in R to store the results:

reg1 <- lm(gd$export_ij ~ gd$dist)

If you just call reg1 by itself, you won’t see much. Remember, for more detailed information we use summary:

summary(reg1)
##
## Call:
## lm(formula = gd$export_ij ~ gd$dist)
##
## Residuals:
##          Min           1Q       Median           3Q          Max
##  -1090653864   -758408424   -507452840   -200191922 409695392687
##
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)
## (Intercept) 1094335800   16216752   67.48 <0.0000000000000002 ***
## gd$dist         -73723       1932  -38.16 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5307000000 on 374236 degrees of freedom
##   (137806 observations deleted due to missingness)
## Multiple R-squared:  0.003875,   Adjusted R-squared:  0.003872
## F-statistic:  1456 on 1 and 374236 DF,  p-value: < 0.00000000000000022

Trade values frequently contain zeros. How do we see how many rows in gd have zero in the column export_ij?

nrow shows the total number of rows:

nrow(gd)
## [1] 512044

The code below should give us the values that are zero, but because there are also missing values in export_ij, it shows the number of rows of zeros and missing values combined:

nrow(gd[gd$export_ij==0,])
## [1] 120815

A better way is to use length which gives the length of a variable list:

length(which(gd$export_ij==0))
## [1] 6409

We use the function which to signify that we want to count only the observations where gd$export_ij==0 is TRUE.

length(which(gd$export_ij==0)) is better than nrow(gd[gd$export_ij==0,]) because it counts only the zeros and excludes the missing values (NA). This is why the number of observations is less with length.

We can find the length of gd:

length(gd)
## [1] 68

We get 68 number of variables.

The number of observation are:

length(gd$export_ij)
## [1] 512044

We can check the number of missing values:

length(gd$export_ij[is.na(gd$export_ij)])
## [1] 114406

Let’s assume the missing values of export_ij are zero. How do we change their value from missing to zero?

Remember we need to select the correct condition, which in this case is is.na(gd$export_ij) but also tell R for which variable it should change the values - export_ij:

gd[is.na(gd$export_ij),"export_ij"] <- 0

Double checking this worked:

length(gd$export_ij[is.na(gd$export_ij)])
## [1] 0

Yes it did! Now we have 0 rows in export_ij which are missing.

Let’s regress once again distance with exports to compare:

reg2 <- lm(gd$export_ij ~ gd$dist)
summary(reg2)
##
## Call:
## lm(formula = gd$export_ij ~ gd$dist)
##
## Residuals:
##          Min           1Q       Median           3Q          Max
##   -939688372   -624180945   -404594794   -132356225 409761225877
##
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)
## (Intercept) 943289085   13155595   71.70 <0.0000000000000002 ***
## gd$dist        -65972       1497  -44.07 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4698000000 on 478830 degrees of freedom
##   (33212 observations deleted due to missingness)
## Multiple R-squared:  0.004039,   Adjusted R-squared:  0.004037
## F-statistic:  1942 on 1 and 478830 DF,  p-value: < 0.00000000000000022

There is a difference!

We can try to regress the logs of the same variables in order to check the percentage change:

reg3 <- lm(log(gd$export_ij) ~ log(gd$dist))

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, …) : NA/NaN/Inf in ‘y’

We get an error, because we cannot take the logarithm of zero. For the purposes of the exercise, we replace all zero export_ij with 0.0001. (A quick reminder: that the default of the log function in R is natural logarithm unless changed!)

gd[gd$export_ij==0,"export_ij"] <- 0.0001
reg3 <- lm(log(gd$export_ij) ~ log(gd$dist))
summary(reg3)
##
## Call:
## lm(formula = log(gd$export_ij) ~ log(gd$dist))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -32.756  -3.341   3.907   7.381  19.156
##
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)
## (Intercept)  36.37332    0.15809   230.1 <0.0000000000000002 ***
## log(gd$dist) -3.09398    0.01816  -170.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.51 on 478830 degrees of freedom
##   (33212 observations deleted due to missingness)
## Multiple R-squared:  0.05717,    Adjusted R-squared:  0.05717
## F-statistic: 2.904e+04 on 1 and 478830 DF,  p-value: < 0.00000000000000022

This result says that for every 1% increase in distance, export_ij is associated with 3.09% decrease. (This is also known as the elasticity of exports with respect to distance.)

We can reproduce the above, but instead of specifying gd$ in front of every variable, we can also specify the data source as one of the parameters in the lm function. Both ways work and provide the same results, it is up to personal preference which one to choose.

reg4 <- lm(log(export_ij) ~ log(dist), data = gd)
summary(reg4)
##
## Call:
## lm(formula = log(export_ij) ~ log(dist), data = gd)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -32.756  -3.341   3.907   7.381  19.156
##
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)
## (Intercept) 36.37332    0.15809   230.1 <0.0000000000000002 ***
## log(dist)   -3.09398    0.01816  -170.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.51 on 478830 degrees of freedom
##   (33212 observations deleted due to missingness)
## Multiple R-squared:  0.05717,    Adjusted R-squared:  0.05717
## F-statistic: 2.904e+04 on 1 and 478830 DF,  p-value: < 0.00000000000000022

Quiz 5.2

So far so good. What is the coefficient for log(export_ij) if we run log(trade_ij) ~ log(dist) + log(export_ij)? (Hint: Make sure the new variable that is included - trade_ij does not have any missing values or zeros as you will take the logarithm of it. If it is the case, do the same as we did with export_ij.)

A. 0.30
B. 0.43
C. 0.55
D. 0.67

#We can see that `trade_ij` has zeros. We can change that the same way we changed it for `export_ij`:
gd[gd$trade_ij==0,"trade_ij"] <- 0.0001

#Now we can run the following gravity model:
lm(log(trade_ij) ~ log(dist) + log(export_ij), data = gd)

#We can see that the coefficient for log(export_ij) is 0.2976 or 0.30, hence A is the correct answer.

5.1.3 Linear Gravity model with more variables and interaction terms

Let’s create a new regression with more variables:

reg5 <- lm(log(export_ij) ~ log(dist) +
             log(gdp_i) +
             log(gdp_j) +
             rta +
             log(ga_tariff_ijji_sa)
             , data = gd)
summary(reg5)
##
## Call:
## lm(formula = log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) +
##     rta + log(ga_tariff_ijji_sa), data = gd)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -33.334  -1.688   1.713   4.409  19.737
##
## Coefficients:
##                          Estimate Std. Error  t value            Pr(>|t|)
## (Intercept)            -62.352427   0.273433 -228.036 <0.0000000000000002 ***
## log(dist)               -2.516327   0.018142 -138.703 <0.0000000000000002 ***
## log(gdp_i)               2.489102   0.006209  400.867 <0.0000000000000002 ***
## log(gdp_j)               1.381105   0.006204  222.598 <0.0000000000000002 ***
## rta                      1.322291   0.042079   31.424 <0.0000000000000002 ***
## log(ga_tariff_ijji_sa)  -0.224746   0.242981   -0.925               0.355
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.107 on 301412 degrees of freedom
##   (210626 observations deleted due to missingness)
## Multiple R-squared:  0.4224, Adjusted R-squared:  0.4224
## F-statistic: 4.409e+04 on 5 and 301412 DF,  p-value: < 0.00000000000000022

We can export the coefficients to a clipboard so you can “Ctrl/Cmd + V” (paste) in excel:

write.table(
  summary(reg5)$coefficients #you select just coefficients, it will also work
  , "clipboard", sep="\t", row.names=FALSE)

We can also add interaction terms in order to capture joint variable effects by specifying the interaction in I operator:

reg6 <- lm(log(export_ij) ~ log(dist) +
             log(gdp_i) +
             log(gdp_j) +
             rta +
             I(rta*log(gdp_j)) + #put interaction in I operator
             I(dist^2) + #first order polynomial  - non-linear transformation
             log(ga_tariff_ijji_sa)
           , data = gd)
summary(reg6)
##
## Call:
## lm(formula = log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) +
##     rta + I(rta * log(gdp_j)) + I(dist^2) + log(ga_tariff_ijji_sa),
##     data = gd)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -34.160  -1.726   1.697   4.412  20.456
##
## Coefficients:
##                                 Estimate        Std. Error  t value
## (Intercept)            -63.9832459652895   0.3271008450554 -195.607
## log(dist)               -2.9782272151947   0.0263709439899 -112.936
## log(gdp_i)               2.5488480880571   0.0062645188343  406.871
## log(gdp_j)               1.5342760391309   0.0067795539273  226.309
## rta                     23.4357915076500   0.4081369407485   57.421
## I(rta * log(gdp_j))     -0.8966090092956   0.0163094395377  -54.975
## I(dist^2)                0.0000000062601   0.0000000002558   24.469
## log(ga_tariff_ijji_sa)  -1.1094688930275   0.2419365488330   -4.586
##                                    Pr(>|t|)
## (Intercept)            < 0.0000000000000002 ***
## log(dist)              < 0.0000000000000002 ***
## log(gdp_i)             < 0.0000000000000002 ***
## log(gdp_j)             < 0.0000000000000002 ***
## rta                    < 0.0000000000000002 ***
## I(rta * log(gdp_j))    < 0.0000000000000002 ***
## I(dist^2)              < 0.0000000000000002 ***
## log(ga_tariff_ijji_sa)           0.00000452 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.063 on 301410 degrees of freedom
##   (210626 observations deleted due to missingness)
## Multiple R-squared:  0.4295, Adjusted R-squared:  0.4295
## F-statistic: 3.242e+04 on 7 and 301410 DF,  p-value: < 0.00000000000000022

5.1.4 STATA-like robust clustered standard errors

Estimates from Ordinary least squares (OLS) are consistent, unbiased, and efficient if these three necessary and sufficient conditions are met:
1). The standard errors are homoskedastic.
2). There is orthogonality between the errors.
3). None of the explanatory variables is a linear combination of the other explanatory variables.

In practice however, it is not uncommon to have cases where we have heteroskedastic standard errors. In order to control for this we use robust standard errors. Robust standard errors are arbitrary patterns of heteroskedasticity in the data and therefore a simple and effective way of fixing violations of the second OLS assumption.

In the gravity model context, errors are likely to be correlated by country pair. So it is important to allow for clustering by country pair. To do this, it is necessary to specify a clustering variable that separately identifies each country pair independently of the direction of trade. An example is distance, which is unique to each country pair but is identical for both directions of trade.

(Once again for the full details of this model - check ESCAP’s gravity model guide in R.)

Having this in mind, for linear gravity models, robust clustered standard errors are a way to obtain better estimates. In R we can use the function lm_robust from the estimatr package.

Let’s first install and then activate the package. Uncomment the code to install it by removing the # and running install.packages("estimatr").

#install.packages("estimatr")
library(estimatr)

You can obtain STATA-like robust clustered standard errors by using the lm_robust. We specify the equation log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) + rta + log(ga_tariff_ijji_sa) as usual. What is different from the lm function is that here we can specify which variable to be our cluster and the type of standard errors we want. In the example below, we chose our cluster to be distance: clusters = dist and to obtain STATA-like robust standard errors: se_type = "stata". (As usual for the full details of the specifications of lm_robust run ??lm_robust and select the help page titled estimatr::lm_robust)

reg_robust <- lm_robust(log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) + rta + log(ga_tariff_ijji_sa),
          data = gd, clusters = dist, se_type = "stata")
summary(reg_robust)
##
## Call:
## lm_robust(formula = log(export_ij) ~ log(dist) + log(gdp_i) +
##     log(gdp_j) + rta + log(ga_tariff_ijji_sa), data = gd, clusters = dist,
##     se_type = "stata")
##
## Standard error type:  stata
##
## Coefficients:
##                        Estimate Std. Error  t value
## (Intercept)            -62.3524    0.86939 -71.7201
## log(dist)               -2.5163    0.05655 -44.4980
## log(gdp_i)               2.4891    0.01948 127.7587
## log(gdp_j)               1.3811    0.01766  78.2119
## rta                      1.3223    0.10906  12.1240
## log(ga_tariff_ijji_sa)  -0.2247    0.64297  -0.3495
##                                                      Pr(>|t|) CI Lower CI Upper
## (Intercept)            0.000000000000000000000000000000000000  -64.057  -60.648
## log(dist)              0.000000000000000000000000000000000000   -2.627   -2.405
## log(gdp_i)             0.000000000000000000000000000000000000    2.451    2.527
## log(gdp_j)             0.000000000000000000000000000000000000    1.346    1.416
## rta                    0.000000000000000000000000000000001213    1.109    1.536
## log(ga_tariff_ijji_sa) 0.726687161813804882726230971456971020   -1.485    1.036
##                           DF
## (Intercept)            12607
## log(dist)              12607
## log(gdp_i)             12607
## log(gdp_j)             12607
## rta                    12607
## log(ga_tariff_ijji_sa) 12607
##
## Multiple R-squared:  0.4224 ,    Adjusted R-squared:  0.4224
## F-statistic:  4266 on 5 and 12607 DF,  p-value: < 0.00000000000000022

Let’s compare these results to reg5 which has the exact same specifications except the standard errors are not robust nor clustered. We see that the significance of the coefficients remains the same under both methods except for the variable rta. This is due to the fact that lm_robust produces higher standard error values than lm as they are robust.

summary(reg5)
##
## Call:
## lm(formula = log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) +
##     rta + log(ga_tariff_ijji_sa), data = gd)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -33.334  -1.688   1.713   4.409  19.737
##
## Coefficients:
##                          Estimate Std. Error  t value            Pr(>|t|)
## (Intercept)            -62.352427   0.273433 -228.036 <0.0000000000000002 ***
## log(dist)               -2.516327   0.018142 -138.703 <0.0000000000000002 ***
## log(gdp_i)               2.489102   0.006209  400.867 <0.0000000000000002 ***
## log(gdp_j)               1.381105   0.006204  222.598 <0.0000000000000002 ***
## rta                      1.322291   0.042079   31.424 <0.0000000000000002 ***
## log(ga_tariff_ijji_sa)  -0.224746   0.242981   -0.925               0.355
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.107 on 301412 degrees of freedom
##   (210626 observations deleted due to missingness)
## Multiple R-squared:  0.4224, Adjusted R-squared:  0.4224
## F-statistic: 4.409e+04 on 5 and 301412 DF,  p-value: < 0.00000000000000022

Quiz 5.3

What is the value for the standard errors of log(export_ij) if we run log(trade_ij) ~ log(dist) + log(export_ij) with STATA-like robust standard errors? (Hint: Check how we ran the command above and change the equation accordingly!)

A. 0.0015
B. 0.0197
C. 0.0201
D. 0.1756

# We run the command:
lm_robust(log(trade_ij) ~ log(dist) + log(export_ij), data = gd, clusters = dist, se_type = "stata")

# The value for the standard errors for log(export_ij) is 0.001470578 or 0.0015, hence answer A is the correct answer.

5.2 Poisson Pseudo-Maximum Likelihood Estimator (PPML)

The PPML model is often used in trade analysis because it can include trade values equal to zero, have fixed effects and easy interpretation. We will go over the basics of the PPML model in R. Once again, see for more information ESCAP’s gravity model guide in R.

The PPML function in R can be found in the gravity package:

#install.packages("gravity")
library(gravity)

Here we use export_ij as an output variable, we specify the distance variable in our data (dist in this case) and add any additional variables (e.g. partner GDP: gdp_j and reporter GDP: gdp_i).

ppml("export_ij", dist="dist", c('gdp_j' ,'gdp_i' ), data=gd)
##
## Call:  y_ppml ~ dist_log + gdp_j + gdp_i
##
## Coefficients:
##         (Intercept)             dist_log                gdp_j
## 28.0022547089188549  -1.0674026797574185   0.0000000000003008
##               gdp_i
##  0.0000000000002836
##
## Degrees of Freedom: 435085 Total (i.e. Null);  435082 Residual
##   (43746 observations deleted due to missingness)
## Null Deviance:       1423000000000000
## Residual Deviance: 724400000000000   AIC: NA

As mentioned, one of the advantages of PPML is that it can handle zeros so let’s change export_ij back to zero:

gd[gd$export_ij==0.0001,"export_ij"] <- 0
ppml("export_ij", dist="dist", c('gdp_j' ,'gdp_i' ), data=gd)
##
## Call:  y_ppml ~ dist_log + gdp_j + gdp_i
##
## Coefficients:
##         (Intercept)             dist_log                gdp_j
## 28.0022547089189793  -1.0674026797574607   0.0000000000003008
##               gdp_i
##  0.0000000000002836
##
## Degrees of Freedom: 435085 Total (i.e. Null);  435082 Residual
##   (43746 observations deleted due to missingness)
## Null Deviance:       1423000000000000
## Residual Deviance: 724400000000000   AIC: NA

In this case there is no change between the coefficients of the PPML model with 0.0001 and 0. This is most likely due to the fact that the sample with zeros is relatively small.

Quiz 5.4

Let’s do a quick practice! We run the exact same model as above, but add rta to the list of variables. What is the coefficient for rta? Make sure you have run all the necessary code from this section!

A. 1.549107
B. 25.417498
C. -0.795194
D. 1.103045

# Run the following code if not already ran:
gd[gd$export_ij==0.0001,"export_ij"] <- 0

# Next we specify the PPMl model as follows:
ppml("export_ij", dist="dist", c('gdp_j' ,'gdp_i', 'rta' ), data=gd)

# Once you run all the codes above, you see that the coefficient of `rta` is 1.103045, so D is the correct answer.

6. Non-Tariff Measures (NTMs) and incidence indicators, a simple case

Non-tariff measures (NTMs) are defined as policy measures other than ordinary customs tariffs that can potentially have an economic effect on international trade in goods, changing quantities traded, prices or both. NTMs include technical measures, such as sanitary or environmental protection measures, as well as others traditionally used as instruments of commercial policy, e.g. quotas, price control, exports restrictions, contingent trade protective measures, etc.

Here you can find UNCTAD’s classification of NTMs (version 2012). The classification of NTMs encompasses 16 chapters (A to P), and each individual chapter is divided into groupings with depth up to four levels. Although a few chapters reach the four-digit level of disaggregation, most of them stop at three digits. For analysis where we have a mixture of three digit levels and four digit levels, we include a zero after the third digit, e.g. “A11” becomes “A110”. All chapters reflect the requirements of the importing country on its imports, with the exception of measures imposed on exports by the exporting country (chapter P). A table with all the chapters is provided below.

The simplest way to study NTMs is to calculate incidence indicators based on the intensity of the policy instruments, and measure the degree of regulation without considering its impact on trade or the economy. Three commonly used incidence indicators are:
- coverage ratio
- frequency index
- prevalence score

These indicators are based upon inventory listing of observed NTMs. The coverage ratio (CR) measures the percentage of trade subject to NTMs, the frequency index (FI) indicates the percentage of products to which NTMs apply, and the prevalence score (PS) is the average number of NTMs applied to products. In this section we will study how to calculate them for all imports and per NTM type.

You can find here UNCTAD’s book where the indicators (coverage ratio, frequency index and prevalence score) are explained on page 92. Data for NTMs can be found here.

(The most recent publication of UNCTAD (2019) presents different options into calculating the NTM indicators. We use the simples one, but if interested you can find all the alternatives here.)

For now, we will use Lao PDR’s NTM data instead. We use Lao PDR as an example because all imports and NTMs are to the world. This makes the calculations easier since only total imports from trade partners of Lao PDR are required for calculations. For cases when NTMs are applied to only some individual countries or groups, while the same formula is applied - the analysis is a bit more complicated, and is deferred to Section 7 for the case of Viet Nam.

In the table below you can find the two data sets we will use. The link to download is in the third column.

Description File name Link to download
List of all the products of Lao PDR which have an NTM lao_ntm.csv here
List of all the imports of Lao PDR lao_imp.csv here

Let’s first load into R the NTM data for Lao PDR. (This is the original source.) This data represents a list of all the products of Lao PDR which have an NTM.

lao_ntm <- read.csv("lao_ntm.csv",
                    stringsAsFactors = FALSE)

Let’s download the dplyr package and activate it, if not yet done:

#install.packages("dplyr") #uncomment if not already installed
library("dplyr")

If you get warning (you should), there is a conflict of names for some functions in multiple packages loaded for example:

The following objects are masked from ??package:base??:
intersect, setdiff, setequal, union

In case you need to call a specific function from a specific package, use ::, like this:

base::intersect()

dplyr is a very useful and powerful package when performing data processing in R. A very good and intuitive introduction to this package can be found on this link.

The as_data_frame command from dplyr makes a tidy table. By using the following formula, we don’t actually change anything in the data, it simply looks nicer when calling the object by itself.

lao_ntm <- as_data_frame(lao_ntm)

Let’s check how the first few rows look like in Lao PDR’s NTM data:

head(lao_ntm)
## # A tibble: 6 x 5
##   reporter partner ntmcode   hs6 ntm_1_digit
##   <chr>    <chr>   <chr>   <int> <chr>
## 1 LAO      WLD     A140    10121 A
## 2 LAO      WLD     A830    10121 A
## 3 LAO      WLD     A840    10121 A
## 4 LAO      WLD     C300    10121 C
## 5 LAO      WLD     F610    10121 F
## 6 LAO      WLD     F650    10121 F

(Remember you can open the entire data set by locating the data set name on your right in the section “Environment” and clicking it.)

All the reporter values are LAO and all the partner - WLD for world. As already mentioned Lao PDR’s NTM data contains only the imports from the world. Afterwards we have the four-level codes for NTMs, e.g. A140 which is a special authorization for sanitary and phytosanitary (SPS) reasons and the product code (hs6), e.g. 10121 which is live horses. This tells us that when Lao PDR imports live horses, it needs an authorization. In the final column we have the 1 digit code for NTMs, e.g. A for SPS measures. (HS 6-digit codes or product codes refer to the Harmonized System (HS) for classifying goods in a six-digit format. You can find more about the HS system here.)

Since HS 6-digit codes are product codes and to make the analysis easier to follow, we will use the terms “HS6 code” and “product code” interchangeably everywhere.

Let’s now load in R Lao PDR’s imports:

lao_imp <- read.csv("lao_imp.csv", stringsAsFactors = FALSE)

Let’s once again have a look at the first rows.

head(lao_imp)
##   Partner.ISO Commodity.Code Trade.Value..US..
## 1         WLD          10121              1430
## 2         WLD          10190              5100
## 3         WLD          10221            243284
## 4         WLD          10229          32125452
## 5         WLD          10231             22480
## 6         WLD          10239           8630620

In this case we have the partner, the Commodity.Code or product code, and the trade value. WLD is the only partner.

unique(lao_imp$Partner.ISO)
## [1] "WLD"

Let’s rename the variables to something easier to use:

names(lao_imp) <- c("partner","hs6","value")

As already mentioned, Lao PDR’s import data is all from the world and therefore unique in product types, as evident also below:

length(unique(lao_imp$hs6))
## [1] 3132
nrow(lao_imp)
## [1] 3132

(Notice that the amount of unique product types is the same as the amount of rows, hence all product types mentioned in lao_imp are unique.)

Let’s now use these data sets to calculate the incidence indicators, starting with the coverage ratio.

6.1 Coverage ratio

Let’s begin by calculating the coverage ratio, which shows the percentage trade value subject to NTMs imported by a specific country. The formula for the coverage ratio is as follows:

\(CR_i = \frac{\sum^{hs}_{k=1} NTM_{ik} X_{ik}}{\sum^{hs}_{k=1} X_{ik}} 100\).

where \(NTM_{ik}\) is a binary variable, which shows if there is an NTM or not for a traded good \(k\) and \(X_{ik}\) is trade value of good \(k\) for country \(i\).

We have only two terms in this formula for which we need the values - \(NTM_{ik}\) and \(X_{ik}\). We already have the trade value \(X_{ik}\) in the lao_imp data set, so we do not need to do any further calculations for that. For \(NTM_{ik}\) (indicator whether a product imported has an NTM or not), we will extract all the product codes (hs6) from the lao_ntm data. We will count the number of times each product code has an NTM. We will equate \(NTM_{ik}\) to one (\(NTM_{ik} = 1\)) if the count number for NTMs is bigger than 0, otheriwse we equate it to 0 (\(NTM_{ik} = 0\)).

Let’s do this step by step.

So we need to get \(NTM_{ik}\), which signifies if a product code has an NTM or not. We first use the lao_ntm data. This summarizes the number of measures for each product (hs6) code:

lao_ntm %>% group_by(hs6) %>% count()
## # A tibble: 5,205 x 2
## # Groups:   hs6 [5,205]
##      hs6     n
##    <int> <int>
##  1 10121    10
##  2 10129    13
##  3 10130    13
##  4 10190    13
##  5 10221    10
##  6 10229    13
##  7 10231    10
##  8 10239    13
##  9 10290    13
## 10 10310    10
## # ... with 5,195 more rows

% % operators of such type (e.g. %>%) from dplyr are called pipes. %>% says implement this function on the data set. Pipes are considered helpful as they are very easy to use.

So all lao_ntm %>% group_by(hs6) %>% count() does (doesn’t change data, only shows, we do not use <- yet):
1. takes the lao_ntm data set
2. group_by(hs6): groups it by hs6 codes , i.e. multiple rows with the same product code (hs6) are grouped together
3. count(): counts the number of rows where the same product (hs6) code is present

We use the count function in order to sum up all the number of times the same product shows up. Notice that there is no object in between the brackets of count this time. This is because we want to count the last variable mentioned in the pipe, i.e. hs6, so we do not have to refer to it.

Let’s now create a data set:

ntm_hs_freq <- lao_ntm %>% group_by(hs6) %>% count()

head(ntm_hs_freq)
## # A tibble: 6 x 2
## # Groups:   hs6 [6]
##     hs6     n
##   <int> <int>
## 1 10121    10
## 2 10129    13
## 3 10130    13
## 4 10190    13
## 5 10221    10
## 6 10229    13

At this stage, ntm_hs_freq is not correct because it has type “P” NTMs or export related measures. (Remember we are interested in Lao PDR’s imports, not the goods it exports, so in this case export related measures to Lao PDR’s exports are not needed.)

We can subset lao_ntm to exclude the “P” measures and count the product codes with count:

ntm_hs_freq <- lao_ntm[lao_ntm$ntm_1_digit!="P",] %>%  #as long as you leave this last on the line, it will continue to run it acroos multiple lines
  group_by(hs6) %>%
  count()

head(ntm_hs_freq)
## # A tibble: 6 x 2
## # Groups:   hs6 [6]
##     hs6     n
##   <int> <int>
## 1 10121     6
## 2 10129     9
## 3 10130     9
## 4 10190     9
## 5 10221     6
## 6 10229     9

If you start the line with %>%, the code will not run, but if you end with %>%, it will wait for the next line like an open bracket.

So let’s quickly sum up what we did so far. We created a data set ntm_hs_freq which contains all the product codes which have an NTM in column hs6. Column n contains the number of NTMs applied to a specific product code. Now we need to match these observations to trade values. To do this we merge lao_imp with ntm_hs_freq.

We merge lao_imp with ntm_hs_freq by the product codes. Because the column names match, we can just specify by:

coverage <- merge(lao_imp,ntm_hs_freq,by="hs6",
                  all.x = TRUE)
head(coverage)
##     hs6 partner    value n
## 1 10121     WLD     1430 6
## 2 10190     WLD     5100 9
## 3 10221     WLD   243284 6
## 4 10229     WLD 32125452 9
## 5 10231     WLD    22480 6
## 6 10239     WLD  8630620 9

Adding all.x = TRUE ensures that every row in the lao_imp is kept, otherwise we miss out imports that have zero NTMs.

We will use the columns of the coverage data set to calculate the coverage ratio. Remember for the formula we need the value (\(X_{ik}\)) and the dummy variable \(NTM_{ik}\), signifying whether there is a NTM or not. We will use the number of NTMs n to signify whether there is an NTM or not.

We create a new column called NTMik (\(NTM_{ik}\)) with zeros:

coverage$NTMik <- 0

Remember, \(NTM_{ik} = 0\) if no NTMs are found for a product and if at least 1 - \(NTM_{ik} = 1\).

Hence if \(n > 0\), this means at least one NTM is used for that product code, so we change the value of NTMik to 1:

coverage[coverage$n>0,"NTMik"] <- 1

In the first part of the formula, we multiply the dummy NTMik by value (or \(X_{ik}\) in the formula):

head(coverage$NTMik*coverage$value)
## [1]     1430     5100   243284 32125452    22480  8630620

We sum those up:

sum(coverage$NTMik*coverage$value)
## [1] 4107068464

And divide by the total sum of value (\(X_{ik}\)) then multiply by 100 to get the percentage:

sum(coverage$NTMik*coverage$value)/sum(coverage$value)*100
## [1] 100

This means that 100% of the imported goods in Lao are subject to NTMs.

Now that you understand the code, let’s go over the steps we took:
1. Define ntm_hs_freq.
2. Merge ntm_hs_freq with lao_imp.
3. Create NTMik.
4. Calculate the formula.

6.1.1 Coverage ratio for SPS measures

Let’s calculate one more coverage ratio to make sure you understand it. Let’s calculate the coverage ratio for SPS (A) measures.

We first define the ntm_hs_freq data. We specify this time that we are only interested in SPS (A) measures, so we change !="P" to =="A":

ntm_hs_freq <- lao_ntm[lao_ntm$ntm_1_digit=="A",] %>%  #as long as you leave this last on the line, it will continue to run it acroos multiple lines
  group_by(hs6) %>%
  count()

ntm_hs_freq
## # A tibble: 1,139 x 2
## # Groups:   hs6 [1,139]
##      hs6     n
##    <int> <int>
##  1 10121     3
##  2 10129     4
##  3 10130     4
##  4 10190     4
##  5 10221     3
##  6 10229     4
##  7 10231     3
##  8 10239     4
##  9 10290     4
## 10 10310     3
## # ... with 1,129 more rows

(Small reminder: when we include count(), we tell the R pipe to count on the basis of the last mentioned variable, in this case we use group_by(hs6), therefore we want to count the number of same product codes mentioned.)

We merge lao_imp with ntm_hs_freq using hs6 code just as before:

coverage <- merge(lao_imp,ntm_hs_freq,
                  by="hs6",
                  all.x = TRUE)
head(coverage)
##     hs6 partner    value n
## 1 10121     WLD     1430 3
## 2 10190     WLD     5100 4
## 3 10221     WLD   243284 3
## 4 10229     WLD 32125452 4
## 5 10231     WLD    22480 3
## 6 10239     WLD  8630620 4

Adding all.x = TRUE ensures that every row in the lao_imp is kept, otherwise we miss out imports that have zero SPS measures.

We can create a new column NTMik of zeros, where we will include the NTM dummy variable values.

coverage$NTMik <- 0

We change the value of the new variable to 1 if n > 0 meaning a product has at least one SPS measure. This now doesn’t work because we have NA (missing values) because we merged x.all=TRUE before. Hence any NTM that is not an SPS measure (type “A”), has missing values as it is not part of ntm_hs_freq.

coverage[coverage$n>0,"NTMik"] <- 1

Error in [<-.data.frame(*tmp*, coverage$n > 0, "NTMik", value = 1) : missing values are not allowed in subscripted assignments of data frames

So we replace the missing values in “n” with zero by selecting those rows where is.na is true and we only select the “n” column:

coverage[is.na(coverage$n),"n"] <- 0
head(coverage)
##     hs6 partner    value n NTMik
## 1 10121     WLD     1430 3     0
## 2 10190     WLD     5100 4     0
## 3 10221     WLD   243284 3     0
## 4 10229     WLD 32125452 4     0
## 5 10231     WLD    22480 3     0
## 6 10239     WLD  8630620 4     0

So now when we check the n column, there are no rows with missing values:

coverage[is.na(coverage$n), ]
## [1] hs6     partner value   n       NTMik
## <0 rows> (or 0-length row.names)

We assign a value of 1 for all \(n > 0\).

coverage[coverage$n>0,"NTMik"] <- 1

Now we can calculate the NTM SPS measure once again by summing the product of NTMik multiplied by value (\(X_{ik}\)) and dividing by the total sum of value. We then multiply by 100 to get the percentage!

sum(coverage$NTMik*coverage$value)/
  sum(coverage$value)*100
## [1] 15.57878

This number means that 15.6% of trade value that Lao PDR imported had at least one SPS measure.

Quiz 6.1

Try now by yourself to calculate the coverage ratio for technical measures. (Hint: the code is similar to above but technical measures belong to Chapter B.)

Which answer is the correct coverage ratio?

(Hint 2: You need to change lao_ntm[lao_ntm$ntm_1_digit=="A",] to lao_ntm[lao_ntm$ntm_1_digit=="B",] in ntm_hs_freq.)

A. 22%

#In order to get the B measures, we need to filter on "ntm_1_digit" and equate it to B. 22% is incorrect because once we run the rest of the code as it is, we get 32%. Hence Answer B is correct!

#Correct code to run:
ntm_hs_freq <- ntm_hs_freq <- lao_ntm[lao_ntm$ntm_1_digit=="B",] %>%
     group_by(hs6) %>%
     count()

coverage <- merge(lao_imp,ntm_hs_freq,
                  by="hs6",
                  all.x = TRUE)

coverage$NTMik <- 0
coverage[is.na(coverage$n),"n"] <- 0
coverage[coverage$n>0,"NTMik"] <- 1

sum(coverage$NTMik*coverage$value)/
  sum(coverage$value)*100
## [1] 32.20915

B. 32%

#Answer B is correct. Good job!   

6.2 Frequency Index

The frequency index is somewhat similar to the coverage ratio, except that we do not multiply with the trade value, but with a dummy signifying whether there is a trade value or not. Hence, the percentage we get shows the share of types of imported goods that are subject to NTMs. The formula is as follows:

\(FI_i = \frac{\sum^{hs}_{k=1} NTM_{ik} D_{ik}}{\sum^{hs}_{k=1} D_{ik}} 100\).

where \(NTM_{ik}\) is a binary variable, which shows if there is an NTM or not for a traded good \(k\) and \(D_{ik}\) is also a binary variable showing whether there is any trade value associated with a product \(k\) or not.

We set up the NTM product code frequencies. Once again we exclude export related measures (“P”).

ntm_hs_freq <- lao_ntm[lao_ntm$ntm_1_digit!="P",] %>%
  group_by(hs6) %>%
  count()

In order to get the frequency index, we merge the lao_imp data with ntm_hs_freq.

freq_index <- merge(lao_imp,
                    ntm_hs_freq,
                    by="hs6",
                    all.x = TRUE)

We create a new variable Dik which equals 1 when country \(i\) imports any quantity of product \(k\) and 0 otherwise:

freq_index$Dik <- 0

How do we populate this value with 1 if Lao PDR imported anything and zero otherwise?

Dik should be 1 if Lao PDR imported at least 1 dollar.

Let’s start with finding the opposite - all the rows where the value is NA (missing) meaning Lao PDR didn’t import anything:

head(freq_index[is.na(freq_index$value),])
## [1] hs6     partner value   n       Dik
## <0 rows> (or 0-length row.names)

To show the rows that are not missing, we write “!” in front is.na:

head(freq_index[!is.na(freq_index$value),])
##     hs6 partner    value n Dik
## 1 10121     WLD     1430 6   0
## 2 10190     WLD     5100 9   0
## 3 10221     WLD   243284 6   0
## 4 10229     WLD 32125452 9   0
## 5 10231     WLD    22480 6   0
## 6 10239     WLD  8630620 9   0

We assign 1 only for those with non missing values:

freq_index[!is.na(freq_index$value),"Dik"] <- 1

By definition we cannot assign a 1 for missing trade values even if \(n > 0\) because we do not know their values.

Now let’s calculate \(NTM_ik\). Remember NTMik (\(NTM_ik\)) shows us if there is at least one NTM on that product code. Once again we create a new variable called NTMik. We fill it up initially with zeros and set it equal to 1 only if \(n > 0\).

freq_index$NTMik <- 0

freq_index[freq_index$n>0,"NTMik"] <- 1

We divide by the total sum of Dik then multiply by 100.

sum(freq_index$NTMik*freq_index$Dik)/
  sum(freq_index$Dik)*100
## [1] 100

This means that 100% of imported products are covered by at least one NTM.

6.2.1 Frequency Index for SPS measures

Next, we calculate the Frequency Index for SPS measures. We first create the ntm_hs_freq data set and merge it with the lao_imp. We create and set Dik to 0.

ntm_hs_freq <- lao_ntm[lao_ntm$ntm_1_digit=="A",] %>%
     group_by(hs6) %>%
     count()

freq_index <- merge(lao_imp,
                    ntm_hs_freq,
                    by="hs6",
                    all.x = TRUE)

freq_index$Dik <- 0

How do we populate this value with 1 if Lao PDR imported something and leave it zero otherwise?

D_ik is 1 if Lao imported at least 1 dollar.

We assign D_ik equal to 1 only for trade values with no missing values:

freq_index[!is.na(freq_index$value),"Dik"] <- 1

NTMik shows us if there is at least one NTM on that code. We create and set the value to 0 for NTMik.

freq_index$NTMik <- 0

This is different from the above example with the frequency index calculated for all NTMs because some products now have no SPS measures so n (number of NTMs per product code) is missing (NA). Hence we replace the NAs with zero and equate NTMik to 1 only when \(n > 0\).

freq_index[is.na(freq_index$n),"n"] <- 0

freq_index[freq_index$n>0,"NTMik"] <- 1

We calculate the formula.

sum(freq_index$NTMik*freq_index$Dik)/
  sum(freq_index$Dik)*100
## [1] 18.3908

Therefore 18.39% of product types face at least 1 SPS measure.

Quiz 6.2

Let’s do a quick practice before going to prevalence scores.

What is the frequency index for Lao for NTMs B and C?

A. 31%
B. 54%
C. 12%
D. 17%

#We only need to change the line of the code with which we build the "ntm_hs_freq" data. We change it to:
ntm_hs_freq <- lao_ntm[lao_ntm$ntm_1_digit=="B"|lao_ntm$ntm_1_digit=="C",] %>%
     group_by(hs6) %>%
  count()

#Afterwards, we run everything as it is and it gives us 31.41762 or 31%, which means that Answer A is the correct answer.
freq_index <- merge(lao_imp,
                    ntm_hs_freq,
                    by="hs6",
                    all.x = TRUE)

freq_index$Dik <- 0
freq_index[!is.na(freq_index$value),"Dik"] <- 1
freq_index$NTMik <- 0
freq_index[is.na(freq_index$n),"n"] <- 0
freq_index[freq_index$n>0,"NTMik"] <- 1

sum(freq_index$NTMik*freq_index$Dik)/
  sum(freq_index$Dik)*100

6.3 Prevalence score for SPS measures

The prevalence score (PS) is the average number of NTMs applied to products:

\(PS_i = \frac{\sum^{hs}_{k=1} \#NTM_{ik} D_{ik}}{\sum^{hs}_{k=1} D_{ik}}\).

where \(\#NTM_{ik}\) is the number of NTMs per product \(k\) and \(D_{ik}\) is a binary variable showing whether there is any trade value associated with a product \(k\) or not.

We will use the freq_index from the previous section on frequency index as we already have n which is \(\#NTM_{ik}\) and Dik - \(D_{ik}\). Remember, SPS measures are type “A”.

# create NTM product frequencies
ntm_hs_freq <- lao_ntm[lao_ntm$ntm_1_digit=="A",] %>%
     group_by(hs6) %>%
     count()

# merge with imports
freq_index <- merge(lao_imp,
                    ntm_hs_freq,
                    by="hs6",
                    all.x = TRUE)
#create Dik
freq_index$Dik <- 0
freq_index[!is.na(freq_index$value),"Dik"] <- 1

#define missing n as 0
freq_index[is.na(freq_index$n),"n"] <- 0

The formula is as follows:

sum(freq_index$n*freq_index$Dik)/
  sum(freq_index$Dik)
## [1] 0.8125798

This means that there are on average 0.8 SPS NTMs applied to imported products in Lao.

This gives all what we calculated right now per country: UNCTAD link

Quiz 6.3

What is the prevalence score for all types of NTMs (except export related)?

A. 30
B. 20
C. 3.8
D. 1.8

#Once again we can use the frequency index code data, but this time we change the `ntm_hs_freq` to:
ntm_hs_freq <- lao_ntm[lao_ntm$ntm_1_digit!="P",] %>%
     group_by(hs6) %>%
  count()

#We can use the frequency code as before, but use the prevalence formula at the end:
freq_index <- merge(lao_imp,
                    ntm_hs_freq,
                    by="hs6",
                    all.x = TRUE)

freq_index$Dik <- 0
freq_index[!is.na(freq_index$value),"Dik"] <- 1
freq_index$NTMik <- 0
freq_index[is.na(freq_index$n),"n"] <- 0
freq_index[freq_index$n>0,"NTMik"] <- 1


sum(freq_index$n*freq_index$Dik)/sum(freq_index$Dik)

#We get 3.818008 or 3.8, meaning on average imported good in Lao are subject to 3.8 NTMs, hence Answer C is correct.

7. Coverage ratio, Frequency index and Prevalence score for Viet Nam

In this section we will calculate once again the NTM indicators - coverage ratio, frequency index and prevalence score for Viet Nam.

Let’s first activate the packages we need. There is commented code in case you have not installed them already.

#install.packages("dplyr")
library("dplyr")
#install.packages("readstata13")
library("readstata13")
#install.packages("tidyr")
library("tidyr")

Let’s download all the data we need first. Once again for the calculation of the incidence indicators for NTMs we only need the import data for Viet Nam and the list with all NTMs applied to products. In the table below you can find the two data sets we will use. The link to download is in the third column.

Description File name Link to download
WITS trade data for Viet Nam 2015-2017 trade_data_VNM_WITS_2015-2017.csv here
List of all products which have an NTM ntm_hs6_2016 v.12.dta here

7.1 Uploading and cleaning the WITS trade data for Viet Nam in R

Let’s upload the data in R. (We downloaded this data from the WITS website.)

trade_data_VNM <- read.csv("trade_data_VNM_WITS_2015-2017.csv",
                     stringsAsFactors = FALSE)
head(trade_data_VNM)
##   reporter partner   hs6 year flow   value
## 1      VNM     AUS 10129 2016    5 211.878
## 2      VNM     AUS 10129 2017    5 681.523
## 3      VNM     MYS 10129 2017    5  49.167
## 4      VNM     WLD 10129 2015    5 249.600
## 5      VNM     WLD 10129 2016    5 899.994
## 6      VNM     WLD 10129 2017    5 730.690

We only include products NOT higher than 980000 as HS chapters 98 and 99 are not harmonized, meaning that they are reserved for national use only and therefore are not comparable between countries. To do this we make the product code variable numeric.

trade_data_VNM$hs6 <- as.numeric(trade_data_VNM$hs6)
trade_data_VNM <- trade_data_VNM[!(trade_data_VNM$hs6 > 980000),]

We change the reporter name for Romania from “ROM” to “ROU” in order to be coherent with the NTMs data. We will later in Section 7.4 merge the two data sets and calculate the coverage ratio, frequency index and prevalence score.

trade_data_VNM$reporter[trade_data_VNM$reporter=="ROM"] <- "ROU"

7.2 Uploading and cleaning the Non-Tariff Measures (NTMs) data for Viet Nam

We load the NTM data and select the reporter to be Viet Nam. (If interested, you can find the original NTMs data here.) We select the columns with countries imposing an NTM, the partner affected, the NTM code at 4 digits and 1 digit, and the product (HS) code. We rename the NTM code from ntmcode to NTM and change StartDate to year for easier use.

Keep in mind, this is a big file, so it might take a few minutes to import.

ntms_VNM <- read.dta13("ntm_hs6_2016 v.12.dta")
ntms_VNM <- ntms_VNM[ntms_VNM$reporter=="VNM", ] ## select Viet Nam as reporter
ntms_VNM <-  ntms_VNM %>% select("partner", "ntmcode",  "hs6", "StartDate", "ntm_1_digit") %>% # select only certain variables
  rename(NTM = ntmcode, year = StartDate) #and rename names
head(ntms_VNM)
##      partner  NTM    hs6 year ntm_1_digit
## 2525     WLD A140 010121 2004           A
## 2754     WLD A150 010121 2004           A
## 3219     WLD A310 010121 2004           A
## 3514     WLD A810 010121 2004           A
## 6095     WLD A890 010121 2004           A
## 7193     WLD P130 010121 2004           P

We drop product codes beyond 980000, just as we did for the trade data. The reason is that HS chapters 98 and 99 are not harmonized; they are reserved for national use only and therefore are not comparable between countries.

ntms_VNM$hs6 <- as.numeric(ntms_VNM$hs6)
ntms_VNM <- ntms_VNM[!(ntms_VNM$hs6 > 980000),]

7.2.1 Make NTM data binary by separating WLD and EUN into individual countries

Next, we create the variable EUN which contains all the ISO codes of countries from the European Union and we create WLD - for all the countries in the world partaking in trade. We use these variables to create binary country relationships in the NTM data set.

We first start with the European Union member States and create manually the variable. We write up all the European countries and with paste join them together into one variable, separated by a comma.

EUN <- c('AUT', 'BEL', 'BGR', 'HRV', 'CYP', 'CZE', 'DNK', 'EST', 'FIN', 'FRA', 'DEU', 'GRC', 'HUN', 'IRL', 'ITA', 'LVA', 'LTU', 'LUX', 'MLT', 'NLD', 'POL', 'PRT', 'ROM', 'SVK', 'SVN', 'ESP', 'SWE', 'GBR')
EUN <- paste(EUN,collapse=",")

We find the unique partners from the trade data and call them All_members and exclude ISO3 codes “WLD” and “EUN”, as we want to have only individual member States.

All_members <- unique(trade_data_VNM$partner)
All_members <- All_members[!(All_members %in% c("WLD","EUN"))]
All_members <- paste0(All_members, collapse=",")

All_members now contains all the available trade world members and we can replace “WLD” with it.

We replace the ISO3 codes “WLD” and “EUN” with the lists of individual member States. We replace “WLD” with All_members and “EUN” with all the countries in the European Union EUN. We do this with gsub which replaces all matches. We separate the different lists of countries into rows using separate_rows.

ntms_VNM$partner  <- gsub("WLD", All_members, ntms_VNM$partner)
ntms_VNM$partner  <- gsub("EUN", EUN, ntms_VNM$partner)

ntms_VNM <- ntms_VNM %>% separate_rows(partner, sep = ",")
nrow(ntms_VNM)
## [1] 2726605
head(ntms_VNM)
##   partner  NTM   hs6 year ntm_1_digit
## 1     AUS A140 10121 2004           A
## 2     MYS A140 10121 2004           A
## 3     ZAF A140 10121 2004           A
## 4     LAO A140 10121 2004           A
## 5     NZL A140 10121 2004           A
## 6     THA A140 10121 2004           A

Since All_members contains “VNM”, we remove when partner equals “VNM”.

ntms_VNM <- ntms_VNM[ntms_VNM$partner !="VNM",]

We now take imports only and hence exclude “P” which stands for export related NTMs.

ntms_VNM <- ntms_VNM[ntms_VNM$ntm_1_digit !="P",]
head(ntms_VNM)
##   partner  NTM   hs6 year ntm_1_digit
## 1     AUS A140 10121 2004           A
## 2     MYS A140 10121 2004           A
## 3     ZAF A140 10121 2004           A
## 4     LAO A140 10121 2004           A
## 5     NZL A140 10121 2004           A
## 6     THA A140 10121 2004           A

Next, we remove the duplicates and keep only the unique combinations of NTMs, product code and partner. We do the usual - first group the variables NTMs, product code and partner and count how many duplicates there are. As we are not interested in the number of duplicates per combination, we only select the NTMs, product code and partner.

This method below is much quicker than using unique. We will afterwards use this data to calculate the coverage ratio, frequency index and prevalence score.

ntms_VNM <- ntms_VNM %>% group_by(NTM, hs6, partner) %>% count() %>%
  select(NTM, hs6, partner)
head(ntms_VNM)
## # A tibble: 6 x 3
## # Groups:   NTM, hs6, partner [6]
##   NTM     hs6 partner
##   <chr> <dbl> <chr>
## 1 A140  10121 AGO
## 2 A140  10121 AND
## 3 A140  10121 ARE
## 4 A140  10121 ARG
## 5 A140  10121 AUS
## 6 A140  10121 AUT

7.3 Create import data

We subset the trade data and select the trade flow to be 5 - we only take the imports. We select only the unique partners from the NTMs data. We check the number of rows.

VNM_import <- subset(trade_data_VNM,
                     trade_data_VNM$flow==5&
                     trade_data_VNM$partner %in% unique(ntms_VNM$partner),
                     select = c(partner, hs6, year, value))
nrow(VNM_import)
## [1] 208941
head(VNM_import)
##   partner   hs6 year    value
## 1     AUS 10129 2016  211.878
## 2     AUS 10129 2017  681.523
## 3     MYS 10129 2017   49.167
## 7     ZAF 10129 2015  249.600
## 8     ZAF 10129 2016  688.116
## 9     AUS 10221 2015 8301.276

We use the above data set to calculate the average value for combinations of the partner, the product code and the year. The function summarize(value = mean(value)) in this case takes the average value of any duplicates of the combinations and includes this number under “value”.

VNM_import <- VNM_import %>% group_by(partner, hs6, year) %>% summarize(value = mean(value))
nrow(VNM_import)
## [1] 208941
head(VNM_import)
## # A tibble: 6 x 4
## # Groups:   partner, hs6 [5]
##   partner    hs6  year   value
##   <chr>    <dbl> <int>   <dbl>
## 1 AGO      30749  2017  154.
## 2 AGO      50790  2017    7.84
## 3 AGO     120799  2017   12.6
## 4 AGO     150420  2016  207.
## 5 AGO     230120  2015  601
## 6 AGO     230120  2016 2308.

The number of rows of the data does not change so we can assume there were not any duplicates.

Before we go and allocate for which year we have missing values, we create a variable where we can keep the years in, as later when we gather the data together, we will use it.

VNM_import_years <- unique(VNM_import$year)
head(VNM_import_years)
## [1] 2017 2016 2015

We next want to include zeros instead of the missing years in order to calculate the average value over years for a product code and partner country. We do this to adjust for year-specific fluctuations. We first spread the values across the three available years using spread. Whenever we have a missing value, we include a 0 instead.

VNM_import <- spread(VNM_import, year, value)
VNM_import[is.na(VNM_import)] <- 0
head(VNM_import)
## # A tibble: 6 x 5
## # Groups:   partner, hs6 [6]
##   partner    hs6 `2015` `2016`   `2017`
##   <chr>    <dbl>  <dbl>  <dbl>    <dbl>
## 1 AGO      30749      0     0    154.
## 2 AGO      50790      0     0      7.84
## 3 AGO     120799      0     0     12.6
## 4 AGO     150420      0   207.     0
## 5 AGO     230120    601  2308. 13380.
## 6 AGO     230400      0     0    544.

We now have all the years present for all HS codes. We make year and value separate columns with gather. The function gather collects a set of column names and places them into a single “key” column.

VNM_import <- gather(VNM_import, year, value, as.character(VNM_import_years), factor_key=FALSE)
head(VNM_import)
## # A tibble: 6 x 4
## # Groups:   partner, hs6 [6]
##   partner    hs6 year     value
##   <chr>    <dbl> <chr>    <dbl>
## 1 AGO      30749 2017    154.
## 2 AGO      50790 2017      7.84
## 3 AGO     120799 2017     12.6
## 4 AGO     150420 2017      0
## 5 AGO     230120 2017  13380.
## 6 AGO     230400 2017    544.

We then take an average over the years.

VNM_import <- VNM_import %>% group_by(partner, hs6) %>% summarize(value=mean(value))
head(VNM_import)
## # A tibble: 6 x 3
## # Groups:   partner [1]
##   partner    hs6   value
##   <chr>    <dbl>   <dbl>
## 1 AGO      30749   51.4
## 2 AGO      50790    2.61
## 3 AGO     120799    4.18
## 4 AGO     150420   69.1
## 5 AGO     230120 5429.
## 6 AGO     230400  181.

Now we have the import data ready, so we can go onto calculating the coverage ratio, frequency index and prevalence score.

7.4 Calculating the Coverage ratio, Frequency index and Prevalence score

Let’s now calculate the coverage ratio, frequency index and prevalence score for Viet Nam.

We create a summary data matrix where we will store the latest year for which Viet Nam’s imports are recorded in WITS and the coverage ratios, frequency indices and prevalence scores as a whole and by sector. The sectors we use are agriculture, natural resources and manufacturing. “agro” stands for agriculture, “NR” for natural resources and “manu” for manufacturing, e.g. CR_agro is the coverage ratio for agriculture for Viet Nam.

We include in the data set the latest year as before defined.

summary_data_VNM <- data.frame(matrix(vector(),1,13,
                                  dimnames = list(c(),
                                                  c("latest_year",
                                                    "CR",
                                                    "FI",
                                                    "PS",
                                                    "CR_agro",
                                                    "CR_NR",
                                                    "CR_manu",
                                                    "FI_agro",
                                                    "FI_NR",
                                                    "FI_manu",
                                                    "PS_agro",
                                                    "PS_NR",
                                                    "PS_manu"
                                                    )
                                  )),
                           stringsAsFactors=F)

summary_data_VNM[1,"latest_year"] <- max(VNM_import_years) 

7.4.1 Coverage ratio

Let’s now calculate the coverage ratio. Remember this is the formula:

\(CR_i = \frac{\sum^{hs}_{k=1} NTM_{ik} X_{ik}}{\sum^{hs}_{k=1} X_{ik}} 100\).

where \(NTM_{ik}\) indicates if there is an NTM or not for a traded good and \(X_{ik}\) is the trade value.

We start by just selecting the product (hs6) code and the partner from the NTM data. Then we group all combinations where the product code and partner are the same and we keep only the product code and partner name. (We exclude the count “n” as we do not need it in this case.) We add a column called NTM with values for all rows as 1 - this signifies that there is an NTM for each combination of HS 6-digit code and partner.

VNM_ntm_import_CR <- data.frame(ntms_VNM) %>% select(hs6, partner)

VNM_ntm_import_CR <- VNM_ntm_import_CR %>% group_by(hs6, partner) %>% summarise(count = n()) %>% select(hs6, partner)

VNM_ntm_import_CR$NTM <- 1

head(VNM_ntm_import_CR)
## # A tibble: 6 x 3
## # Groups:   hs6 [1]
##     hs6 partner   NTM
##   <dbl> <chr>   <dbl>
## 1 10121 AGO         1
## 2 10121 AND         1
## 3 10121 ARE         1
## 4 10121 ARG         1
## 5 10121 AUS         1
## 6 10121 AUT         1

Next we combine the data set we just created above with VNM_import in order to include the trade values. If an NTM is missing, we put the value 0.

CR_ntm_data <- merge(VNM_import,VNM_ntm_import_CR, by=c("partner", "hs6"), all.x = TRUE )

CR_ntm_data$NTM[is.na(CR_ntm_data$NTM)] <- 0

head(CR_ntm_data)
##   partner    hs6       value NTM
## 1     AGO  30749   51.361000   1
## 2     AGO  50790    2.613333   0
## 3     AGO 120799    4.183333   1
## 4     AGO 150420   69.090000   1
## 5     AGO 230120 5429.471667   1
## 6     AGO 230400  181.450000   1

Notice, this time for the overall coverage ratio, we do include all.x = TRUE, because some HS codes do not have an NTM.

Finally, we sum all the values where an NTM is present and we divide by the total sum of all the values. We multiply by 100 to have a percentage. We include this number in our summary data set.

CR <- sum(CR_ntm_data$NTM*CR_ntm_data$value)/sum(CR_ntm_data$value)*100
CR
## [1] 62.06449
summary_data_VNM[1, "CR"] <- CR

7.4.2 Frequency Index

Remember from Section 6, the frequency index is similar to the coverage ratio, except that instead of multiplying by the trade value, we multiply by a dummy variable set to show whether there was any trade value or not (\(D_{ik}\)):

\(FI_i = \frac{\sum^{hs}_{k=1} NTM_{ik} D_{ik}}{\sum^{hs}_{k=1} D_{ik}} 100\).

where \(NTM_{ik}\) is a binary variable, which shows if there is an NTM or not for a traded good \(k\) and \(D_{ik}\) is also a binary variable showing whether there is any trade value associated with a product \(k\) or not.

Hence, we can use the above created data but add a new dummy variable representing whether there was trade or not per HS code.

Let’s copy this data set and create a new variable called “valueDummy” to represent \(D_{ik}\). We equate it to 1 if there is a value, otherwise it is 0.

FI_ntm_data <- CR_ntm_data
FI_ntm_data$valueDummy <- 0
FI_ntm_data[!is.na(FI_ntm_data$value),"valueDummy"] <- 1
head(FI_ntm_data)
##   partner    hs6       value NTM valueDummy
## 1     AGO  30749   51.361000   1          1
## 2     AGO  50790    2.613333   0          1
## 3     AGO 120799    4.183333   1          1
## 4     AGO 150420   69.090000   1          1
## 5     AGO 230120 5429.471667   1          1
## 6     AGO 230400  181.450000   1          1

The formula is the same as the one for the coverage ratio, except now we substitute the trade value with the trade dummy.

FI <- sum(FI_ntm_data$NTM*FI_ntm_data$valueDummy)/sum(FI_ntm_data$valueDummy)*100
FI
## [1] 50.81422
summary_data_VNM[1, "FI"] <- FI

This means that 51 % of imported products are covered by at least one NTM.

7.4.3 Prevalence score

Let’s now calculate the prevalence score. Remember this is the formula:

\(PS_i = \frac{\sum^{hs}_{k=1} \#NTM_{ik} D_{ik}}{\sum^{hs}_{k=1} D_{ik}}\).

where \(\#NTM_{ik}\) is the number of NTMs per product \(k\) and \(D_{ik}\) is a binary variable showing whether there is any trade value associated with a product \(k\) or not.

First we start by counting the HS 6-digit codes per partner.

VNM_ntm_PS <- ntms_VNM %>% group_by(hs6, partner) %>% summarise(count = n())
head(VNM_ntm_PS)
## # A tibble: 6 x 3
## # Groups:   hs6 [1]
##     hs6 partner count
##   <dbl> <chr>   <int>
## 1 10121 AGO         5
## 2 10121 AND         5
## 3 10121 ARE         5
## 4 10121 ARG         5
## 5 10121 AUS         5
## 6 10121 AUT         5

Next, we merge this data set with VNM_import according to the HS6 code and partner, and include the mean trade value. Afterwards, for any missing value in “count”, we put a 0.

PS_ntm_data <- merge(VNM_ntm_PS, VNM_import, by=c("hs6", "partner"), all.y = TRUE)

head(PS_ntm_data)
##     hs6 partner count      value
## 1 10129     AUS    NA  297.80033
## 2 10129     MYS    NA   16.38900
## 3 10129     ZAF    NA  312.57200
## 4 10221     AUS     5 5002.01867
## 5 10221     NZL     5   17.00333
## 6 10221     THA     5  393.93000
PS_ntm_data$count[is.na(PS_ntm_data$count)] <- 0

We substitute “value” with 1 if the value is greater than zero. If missing or 0, we put a 0.

PS_ntm_data$value[!(is.na(PS_ntm_data$value))&(PS_ntm_data$value>0)] <- 1
PS_ntm_data$value[is.na(PS_ntm_data$value)|PS_ntm_data$value==0] <- 0
head(PS_ntm_data)
##     hs6 partner count value
## 1 10129     AUS     0     1
## 2 10129     MYS     0     1
## 3 10129     ZAF     0     1
## 4 10221     AUS     5     1
## 5 10221     NZL     5     1
## 6 10221     THA     5     1

Next, we sum the multiplication of the values and the count, and divide by the sum of all the values. We store this in our summary_data.

PS <- sum(PS_ntm_data$value*PS_ntm_data$count)/sum(PS_ntm_data$value)
PS
## [1] 1.86705
summary_data_VNM[1, "PS"] <- PS

7.4.4 Coverage ratios, Frequency index and Prevalence scores by broad sectors

Broad product groups are defined by the Harmonized System (HS) at 2-digit: Agriculture corresponds to HS 1-24, Natural Resources to HS 25-27, and Manufacturing to 28-97. Study and run the code below and practice at the end with the quiz.

summary_data_VNM[1, "CR_agro"] <- sum(CR_ntm_data$NTM[CR_ntm_data$hs6<250000]*CR_ntm_data$value[CR_ntm_data$hs6<250000])/sum(CR_ntm_data$value[CR_ntm_data$hs6<250000])*100
summary_data_VNM[1, "CR_NR"] <- sum(CR_ntm_data$NTM[CR_ntm_data$hs6>=250000&CR_ntm_data$hs6<280000]*CR_ntm_data$value[CR_ntm_data$hs6>=250000&CR_ntm_data$hs6<280000])/sum(CR_ntm_data$value[CR_ntm_data$hs6>=250000&CR_ntm_data$hs6<280000])*100
summary_data_VNM[1, "CR_manu"] <- sum(CR_ntm_data$NTM[CR_ntm_data$hs6>=280000]*CR_ntm_data$value[CR_ntm_data$hs6>=280000])/sum(CR_ntm_data$value[CR_ntm_data$hs6>=280000])*100

summary_data_VNM[1, "FI_agro"] <- sum(FI_ntm_data$NTM[FI_ntm_data$hs6<250000]*FI_ntm_data$valueDummy[FI_ntm_data$hs6<250000])/sum(FI_ntm_data$valueDummy[FI_ntm_data$hs6<250000])*100
summary_data_VNM[1, "FI_NR"] <- sum(FI_ntm_data$NTM[FI_ntm_data$hs6>=250000&FI_ntm_data$hs6<280000]*FI_ntm_data$valueDummy[FI_ntm_data$hs6>=250000&FI_ntm_data$hs6<280000])/sum(FI_ntm_data$valueDummy[FI_ntm_data$hs6>=250000&FI_ntm_data$hs6<280000])*100
summary_data_VNM[1, "FI_manu"] <- sum(FI_ntm_data$NTM[FI_ntm_data$hs6>=280000]*FI_ntm_data$valueDummy[FI_ntm_data$hs6>=280000])/sum(FI_ntm_data$valueDummy[FI_ntm_data$hs6>=280000])*100

rm(CR, FI) # removes CR and FI data sets
write.csv(summary_data_VNM, "summary_data_VNM.csv", row.names = FALSE)

Quiz 7

Time for some practice! Now having seen how we calculate the coverage ratios and frequency indices, calculate the prevalence score for the same sectors (e.g. Agriculture corresponds to HS 1-24, Natural Resources to HS 25-27, and Manufacturing to 28-97).

What is the correct answer?

A. PS_agro = 14.36, PS_NR = 4.14, PS_manu = 0.51
B. PS_agro = 16.36, PS_NR = 4.40, PS_manu = 3.99
C. PS_agro = 5.36, PS_NR = 4.14, PS_manu = 5.51
D. PS_agro = 11.36, PS_NR = 1.40, PS_manu = 0.99

# We do the same as in the above section but using the prevalence score formula
summary_data_VNM[1, "PS_agro"] <- sum(PS_ntm_data$value[PS_ntm_data$hs6<250000]*PS_ntm_data$count[PS_ntm_data$hs6<250000])/sum(PS_ntm_data$value[PS_ntm_data$hs6<250000])
summary_data_VNM[1, "PS_NR"] <- sum(PS_ntm_data$value[PS_ntm_data$hs6>=250000&PS_ntm_data$hs6<280000]*PS_ntm_data$count[PS_ntm_data$hs6>=250000&PS_ntm_data$hs6<280000])/sum(PS_ntm_data$value[PS_ntm_data$hs6>=250000&PS_ntm_data$hs6<280000])
summary_data_VNM[1, "PS_manu"] <- sum(PS_ntm_data$value[PS_ntm_data$hs6>=280000]*PS_ntm_data$count[PS_ntm_data$hs6>=280000])/sum(PS_ntm_data$value[PS_ntm_data$hs6>=280000])

#Therefore Answer D is the correct anser.

8. Regulatory Distance

8.1 Calculating the regulatory distance

In this section we will calculate the regulatory distance between different country-pairs. Regulatory distance is an indicator to measure the similarity between NTM policies across member States. This way we can identify regional trade agreements (RTAs) amongst countries. We calculate the regulatory distance between two countries by summing all the combinations where the two countries have the same product-NTM combination and divide by the total number of product-NTM combinations that exist:

\(rd_{ij} = \frac{1}{N} \sum_k \sum_l |n_{ilk} - n_{jlk}|\).

In the formula above \(N\) is the total amount of unique combinations of product codes and NTMs, \(n_{ilk}\) is a dummy variable for country \(i\) applying NTM \(l\) to product \(k\). \(n_{jlk}\) is also a dummy variable, but for country \(j\) applying NTM \(l\) to product \(k\). (For more information on regulatory distance for NTMs, you can check this UNCTAD paper.)

The purpose of this subsection is to create a regulatory distance matrix which will contain all the paired distances between ESCAP countries. To do this, we will complete the following steps:

Step 1: Create the list with ESCAP member countries which have available data.
Step 2: Create an empty regulatory distance matrix, by including ESCAP country ISO3 codes as column and row names.
Step 3: Calculate \(N\) (the total amount of unique combinations of product codes and NTMs).
Step 4: Create a nested for loop to calculate the regulatory distance per country pair.

Let’s first begin by activating the necessary packages we need. There is commented code in case you have not installed them already.

#install.packages("dplyr")
library("dplyr")
#install.packages("readstata13")
library("readstata13")

Next, let’s get the data. In the table below you can find the two data sets we will use. The link to download is in the third column. We will use once again the same NTM data as before In Section 7.

Description File name Link to download
List of all products which have an NTM ntm_hs6_2016 v.12.dta here
List of all ESCAP member States Country Categories.csv here

Let’s now load the list of all products which have an NTM. Keep in mind this might take a few minutes.

ntm_data <- read.dta13("ntm_hs6_2016 v.12.dta")

Once the data is loaded, we exclude NTM codes which are missing. We only need the reporter, NTM code and product codes (HS 6-digit codes).

ntm_data <- ntm_data[!is.na(ntm_data$ntmcode)&ntm_data$ntmcode!="",]
ntm_data <- ntm_data[,c("reporter", "ntmcode", "hs6")]

We group the data by reporter, NTM and product code (hs6) and count the number of combinations in a new variable called count.

ntm_data <- ntm_data %>% group_by(reporter, ntmcode, hs6) %>%
  summarise(count = n())
head(ntm_data)
## # A tibble: 6 x 4
## # Groups:   reporter, ntmcode [1]
##   reporter ntmcode hs6    count
##   <chr>    <chr>   <chr>  <int>
## 1 AFG      A110    010110     1
## 2 AFG      A110    010190     1
## 3 AFG      A110    010210     1
## 4 AFG      A110    010290     2
## 5 AFG      A110    010410     1
## 6 AFG      A110    010420     1

We prepare the regulatory matrix by creating a list of countries for which we want the regulatory distance. Remember the regulatory matrix shows the distance between two countries and has as column and row names the ISO3 codes of the countries.

For the moment we will do this exercise only for ESCAP member States as it is quite time consuming. Readers interested in having the analysis for all available countries are recommended to only run avail_iso3s <- unique(ntm_data$reporter).

We load the file with countries ISO3 codes and select only the ESCAP member States.

countries <- read.csv("Country Categories.csv", stringsAsFactors = FALSE)

ESCAP <- countries$iso3[countries$escap==1]

Then we select the unique reporter ISO3 country codes from the NTM data. This list represents all the country codes for which we have available NTM data.

avail_iso3s <- unique(ntm_data$reporter)
head(avail_iso3s)
## [1] "AFG" "ARE" "ARG" "ATG" "AUS" "BEN"

We use intersect to find the rows that are both in avail_iso3s and ESCAP. Notice in the final round we have only ESCAP members as expected.

avail_iso3s <- intersect(avail_iso3s,ESCAP)
head(avail_iso3s)
## [1] "AFG" "AUS" "BRN" "CHN" "HKG" "IDN"

We create an empty regulatory distance matrix. For column size we use the length of avail_iso3s and add 1 for the reporter column. We populate the column names with reporter and the ISO3 codes with the option dimnames.

regulatory_distance_matrix <- data.frame(matrix(vector(),0,length(avail_iso3s)+1,
                                   dimnames = list(c(), c("reporter", avail_iso3s )
                                   )),
                            stringsAsFactors=F)

Now we can move on to calculating the regulatory distance formula:

\(rd_{ij} = \frac{1}{N} \sum_k \sum_l |n_{ilk} - n_{jlk}|\).

As \(N\) is a constant, let’s start with calculating it outside of the loop. Remember, \(N\) is the total number of NTM and product codes combinations in the entire NTM data set. Hence we can use group_by. We define this number under N.

N <- ntm_data %>% group_by(ntmcode, hs6) %>% count()
N <- nrow(N)

We are ready to fill in our regulatory distance matrix with values. In order to do so, we loop through each country in ESCAP and fill in the regulatory_distance_matrix created above. Let’s go through the nested loop together.

We first record all the NTMs (ntmcode) and product codes (hs6) for the first reporter country country_a <- ntm_data[ntm_data$reporter==avail_iso3s[g],c("ntmcode", "hs6")]. We create an additional column where we mark all these combinations with 1: country_a$country_a <- 1. We include the ISO3 country code of the current country under the reporter column: regulatory_distance_matrix[g,"reporter"] <- avail_iso3s[g].

We create a second for statement because we want all possible combinations of country A with the remaining countries. In the second loop, we record all the NTMs (ntmcode) and product codes (hs6) for the second country country_b <- ntm_data[ntm_data$reporter==avail_iso3s[k],c("ntmcode", "hs6")]. We create an additional column where we mark all these combinations with 1 country_b$country_b <- 1. We merge both data sets: merged <- merge(country_a, country_b, by=c("ntmcode", "hs6"), all = TRUE) and include all rows for both data sets. If in the joined data, country A has a combination, which country B doesn’t have - we put a 0: merged[is.na(merged)] <- 0. We create a variable to represent the absolute difference between the indicators for NTMs for both countries: merged$abs_diff <- abs(merged$country_a-merged$country_b). We sum the absolute difference between the two countries and divide the sum by N (the total amount of NTM and product code combinations in the NTM data set): rd <- sum(merged$abs_diff)/N. The resulting value is the regulatory distance rd, which we include in our regulatory distance matrix: regulatory_distance_matrix[g,avail_iso3s[k]] <- rd.

To make this code more efficient, we include the if statement which skips calculating the distance for A and B, if the distance between B and A is already calculated: if (!is.na(regulatory_distance_matrix[k,avail_iso3s[g]])){next }. (This way we end up with a upper triangular matrix.)

Once you are ready, run the codes below. Keep in mind that it might take a few minutes.

for (g in 1:length(avail_iso3s)){
  country_a <- ntm_data[ntm_data$reporter==avail_iso3s[g],c("ntmcode", "hs6")]
  country_a$country_a <- 1
  regulatory_distance_matrix[g,"reporter"] <- avail_iso3s[g]

  for (k in 1:length(avail_iso3s)){

     if (!is.na(regulatory_distance_matrix[k,avail_iso3s[g]])){next }

  country_b <- ntm_data[ntm_data$reporter==avail_iso3s[k],c("ntmcode", "hs6")]
  country_b$country_b <- 1
  merged <- merge(country_a, country_b, by=c("ntmcode", "hs6"), all = TRUE)
  merged[is.na(merged)] <- 0
  merged$abs_diff <- abs(merged$country_a-merged$country_b)
  rd <- sum(merged$abs_diff)/N
  regulatory_distance_matrix[g,avail_iso3s[k]] <- rd

   }
}

Now we fill in the missing values and create a CSV file. For instance, if we have a missing value for country A and country B if (is.na(regulatory_distance_matrix[k,avail_iso3s[g]])), we fill it in with the value from country B and country A regulatory_distance_matrix[k,avail_iso3s[g]] = regulatory_distance_matrix[g,avail_iso3s[k]].

for (g in 1:length(avail_iso3s)){
  for (k in 1:length(avail_iso3s)){
    if (is.na(regulatory_distance_matrix[k,avail_iso3s[g]])){
      regulatory_distance_matrix[k,avail_iso3s[g]] <- regulatory_distance_matrix[g,avail_iso3s[k]]
      }
  }
}

write.csv(regulatory_distance_matrix, "regulatory_distance_matrix_ESCAP.csv")

Quiz 8

Let’s do a quick practice! Make sure you have run the code on your own.

What is the regulatory distance between Kazakhstan (KAZ) and China (CHN)?

A. 0.12
B. 0.22
C. 0.32
D. 0.02

# You need to run the code and change the path for the "write.csv" command. Once you open it, you can locate that KAZ and CHN have 0.32280137848764 or 0.32 as regulatory distance, hence Answer C is the correct answer. You can find the full table here: https://r.tiid.org/regulatory_distance_matrix_ESCAP.csv

8.2 Multidimensional scaling

Using multidimensional scaling (MDS) we can display the information from the regulatory distance matrix by performing principal coordinates analysis. We can therefore show how far or close countries are according to the regulatory distance. This way we can illustrate regional trade agreements (RTAs). Explaining MDS and principal coordinates analysis goes beyond the scope of this guide. As the commands are ready-made and easy to use in R, it is not necessary to have prior knowledge of the methodology. However, readers further interested in MDS and principal coordinates analysis are encouraged to check the help page for the magrittr package for useful references and more information.

First, we download and activate the necessary packages as shown below.

#install.packages("magrittr")
library(magrittr)
#install.packages("dplyr")
library(dplyr)
#install.packages("ggpubr")
library(ggpubr)

We first need to render the regulatory distance matrix in the correct format. We copy the regulatory distance matrix and call it regulatory_distance_matrix_MDS. Then we make the row names equal to the reporter column and exclude the reporter column.

regulatory_distance_matrix_MDS <- regulatory_distance_matrix
rownames(regulatory_distance_matrix_MDS) <- regulatory_distance_matrix_MDS$reporter
regulatory_distance_matrix_MDS <- regulatory_distance_matrix_MDS %>% select (-c(reporter))

We perform principal coordinates analysis with the command cmdscale and make a “tibble”. We call the new columns Dim.1 and Dim.2.

regulatory_distance_matrix_MDS <- regulatory_distance_matrix_MDS %>% cmdscale() %>% as_tibble()

colnames(regulatory_distance_matrix_MDS) <- c("Dim.1", "Dim.2")

head(regulatory_distance_matrix_MDS)
## # A tibble: 6 x 2
##     Dim.1    Dim.2
##     <dbl>    <dbl>
## 1  0.0168  0.0158
## 2 -0.0163 -0.00568
## 3  0.0149  0.0111
## 4 -0.271  -0.0340
## 5  0.0159  0.0107
## 6  0.0123  0.0171

Dim.1 and Dim.2 represent the countries’ positions on a two-dimensional map such that the distances between the countries are approximately equal to the dissimilarities. The actual values do not mean anything, what is important are the distances between countries.

We can plot this and see the relative distance between countries.

We use ggscatter to plot a scatter plot. We add to our scatter plot a horizontal geom_hline and vertical lines geom_vline to show the horizontal and vertical axis, respectively.

scatterplot <- ggscatter(regulatory_distance_matrix_MDS, x = "Dim.1", y = "Dim.2",
          label = regulatory_distance_matrix$reporter,
          size = 1,
          repel = TRUE)
scatterplot + geom_hline(yintercept=0, linetype="dashed", color = "red") +
  geom_vline(xintercept = 0, linetype="dotted",
                color = "blue")

Naturally, most of the ASEAN countries are close to each other (Myanmar, Borneo, Malaysia, Thailand and Singapore). Russia, Kazakhstan and Kyrgyzstan are in the Eurasian Economic Union (EAEU). We can also see that China is quite far away, which is expected as China has one of the most NTM regulations in the world.

Conclusion

This concludes our R guide. Remember Practice, practice, practice! :)

In this training module we demonstrated how to use R to conduct analysis of various aspects of international trade and trade policy, as well as how to evaluate intensity of NTM use by countries. NTMs, as policy instruments, can directly contribute to sustainable development. The analysis presented in this recent ESCAP report shows that almost half of NTMs in Asia and the Pacific directly address SDGs.

R can be used to identify those NTMs that directly address the Sustainable Development Goals (SDGs). To learn more about the methodology of linking NTMs to SDGs and about the R code that can be used to identify NTMs that directly address SDGs, please refer to the ESCAP’s training module titled R for linking non-tariff measures to the Sustainable Development Goals. After you go through it, you may try to utilize knowledge acquired in the present training module to calculate NTM intensity indicators (coverage ratio, frequency index and prevalence score), as well as regulatory distance index only for those NTMs that directly address SDGs.

Acknowledgements

The United Nations ESCAP Training on using R for trade analysis was prepared under the overall supervision of Mia Mikic, Director, Trade, Investment and Innovation Division (TIID) of the United Nations Economic and Social Commission for Asia and the Pacific (ESCAP) and Yann Duval, Chief of Trade Policy and Facilitation (TPFS), TIID, ESCAP. This training module was developed by Alexey Kravchenko, Economic Affairs Officer, TPFS, TIID, ESCAP and Hrisyana Doytchinova Consultant, TPFS, TIID, ESCAP. The authors are grateful for the extensive review and translation of the module provided by Maria Semenova, and for the testing feedback received from Quynh Ahn Luu, Pedro Romão, Pascal Mann and Andrea Dalla Rosa.