Introduction

This training is titled R for linking non-tariff measures to the Sustainable Development Goals and is second in the two-module E-learning series on R for trade and trade policy analysis published by the Trade, Investment and Innovation Division of the Economic and Social Commission for Asia and the Pacific (TIID ESCAP).

Non-Tariff Measures (NTMs) are policy measures - other than ordinary customs tariffs - that can potentially have an economic effect on international trade in goods, changing quantities traded, or prices or both.1 By definition they are not inherently protectionist, as while they can impose additional costs on the movement of goods, they also purport to address specific and legitimate non-trade objectives, some of which are relevant to the achievement of thе long-standing internationally agreed upon development priorities, which were solidified in the Sustainable Development Goals (SDGs) in 2015.2

In an effort to understand and describe the linkages that exist between NTMs imposed on internationally traded goods and the SDGs, ESCAP and UNCTAD jointly developed an SDG-HS-NTM concordance matrix based on their methodology outlined in the ESCAP working paper ‘Exploring linkages between non-tariff measures and the Sustainable Development Goals: A global concordance matrix and application to Asia and the Pacific’.3 The methodology and the excel file containing the concordance matrix are available here.

The matrix can be used to conduct analysis of global, regional or country-level data on non-tariff measures collected and stored in the UNCTAD’s global database on NTMs (TRAINS) available at https://trains.unctad.org/. The database contains information on more than 65,000 regulations imposed by over 100 countries. The data was collected following UNCTAD Guidelines to collect data on official non-tariff measures 2016 4 and classified in accordance with the International Classification of Non-Tariff Measures (ICNTM;5 version of 2012),6 developed by the Multi-Agency Support Team (MAST) Group.7

You can download an Excel file with a complete NTM data set dated as of May 2019 from here. An example of such analysis was conducted by ESCAP and its results were presented in Asia-Pacific Trade and Investment Report 20198 and in the accompanying country briefs for Armenia, Azerbaijan and Tajikistan available from here(see bottom of the webpage).

This training module describes how to use R to transform the SDG-HS-NTM concordance matrix into a tool to conduct such an analysis of NTM data from TRAINS database of NTMs.

In particular, it describes the following:

  • what data is used to conduct the analysis, where to download it from and how to import it into R;
  • how to extract necessary data from the imported data frames;
  • how to prepare the data for analysis by putting it into the correct format, removing redundant data, unnecessary symbols or white spaces, filtering, arranging and replacing the data as necessary;
  • what code algorithm is necessary to conduct the matching of data in TRAINS database to the concordance strings in the matching matrix and how to record, save and export the results of such matching;
  • how to review and interpret the results of the matching, and to generate illustrative graphs in R or in Excel.

The material contains the R code, explanatory notes, preview of interim results and links to data sources. Complete code algorithm that you can open in RStudio and run with or without further editing is available in a separate script file (see Data section).

This training module also contains quizzes and practice sessions that intend to highlight a few useful things and to help some of the new knowledge to sink in better.

Note that this training module is part of an e-learning series and is provided together with the training module titled Training on using R for trade analysis, which is available in English and in Russian languages. Training on using R for trade analysis is provided together with a rated quiz, satisfactory passing of which entitles students to a certificate of completion. On the contrary, this module R for linking non-tariff measures to the Sustainable Development Goals does not have any rated quiz. So read on and practice at your own pace and for your own pleasure!

Data

All necessary data are described in the table below.9 You can download them into the directory on your PC that you will use as your working directory in RStudio for this training module.

What Training file name Original source
UNCTAD’s TRAINS database on NTMs UNCTAD_i-tip_report_ALL_Measures_x.xlsx here
SDG-HS-NTM concordance matrix _SDG-HS-NTM-Concordance[v1.1 Sep 2019]0.xlsx here
UN Comtrade Commodity Classification HSCodesAll.csv here
Table of concordance between different versions of the Harmonized Commodity Description and Coding System HS_concordance_full.csv here
International Classification of non-tariff measures MAST_all.csv here
ISO 3 country codes and binary data on whether a country has NTM data recorded in TRAINS and on whether it belongs to a certain grouping countries_UNCTAD_iso3.csv here
Complete code algorithm complete_matching_code.R here
Files with matching results per country countries.zip here

First two are the main data files that are used to analyze non-tariff measures and their linkages to the SDGs. Open these two Excel files to familiarize yourself with the content and the setup of these two data sets.

UNCTAD’s TRAINS database contains the following information for each recorded NTM: imposing country, partner affected, NTM code (as per ICNTM 2012), measure description, affected HS codes, description of affected goods, source and national legal basis. In steps below, we will combine four columns titled measure description, description of affected goods, source and national legal basis into one column titled description.

SDG-HS-NTM concordance matrix contains the following information for each concordance string: SDG, SDG Targets, product group description, description of HS code list, list of HS codes, HS version, NTM code (ICNTM), list of keywords, list of negative keywords, and additional notes. You will see that some rows are colored grey and have 0 in column CODE 0/1. These are concordance strings that describe potential relationships that exists between regulating trade in certain goods and achievement of certain SDGs. Due to the setup of the TRAINS database of NTMs, these concordance strings cannot be used in the analysis below. In steps below we will filter these rows out. Filtering will leave us with the concordance strings that describe intended direct (and positive) impact on the achievement of SDGs.

A specific HS-NTM code pair was considered to have a direct linkage to an SDG Target if (1) it has a clearly stated SDG Target-related objective (supported by relevant keywords in the descriptive information present in the TRAINS database), or (2) the examined HS-NTM code combination is not likely to have any objective other than the one that is relevant to an SDG (e.g. trade in hazardous chemicals and waste, endangered species of flora and fauna, cultural heritage items, arms and weapons, etc.).10

The code algorithm, described in detail below, eventually allows to check each NTM entry in the TRAINS database for presence of simultaneous matches with at least one HS code, one NTM code and one keyword described in a given concordance string in SDG-HS-NTM concordance matrix.

The rest of the data frames are used to filter, clean, convert, reorganize and otherwise prepare the data in the two main data frames for subsequent analysis. Further details on these data sets are provided in sections below.

complete_matching_code.R contains the complete code algorithm, and countries.zip has a set of 88 files that contain results of matching HS and NTM codes to SDGs (more on these two in sections below).

Downloading R and RStudio

R is a free software environment for statistical computing and graphics, which runs on a variety of platforms, while RStudio, that we use for the purpose of this training, is an open source software that provides a very convenient console for inputting and running the code, previewing the generated outputs and saving them in different formats.

Find download instructions for PC and Mac in Annex 1 below, or refer to Training on using R for trade analysis for more detailed instructions.

Code

Now we will go through the code step by step, provide necessary explanations and show the interim results. Should you need more information on various packages and functions run help(package = "package_name") for a given package or ?function_name for a given function. Help information will be displayed in the bottom right corner of your RStudio window in tab Help.

The full code algorithm is available in “complete_matching_code.R” file (see Data section). You can open it in your RStudio or is a .txt file. You can edit it or run it as is, having only revised the first line of code setting the working directory (we will explain how to do that below). However, do not rush to use this file. First go through the steps of this training module and familiarize yourself with what each line of code intends to do.

General suggestions and recommendations

Going through this training module may take 1-2 weeks depending on your learning pace and on your familiarity with R code. Doing the training module titled Training on using R for trade analysis first, should make going through this learning module easier and faster. Anyhow the process will take a few separate sessions.

To make sure that no progress is lost, follow the following recommendations:

  • When you start this training, open, name and save a new script in RStudio. As you go through the module it is necessary to paste and run all code from the code chunks in this module, both for training, and for quizzes and training sessions, in one script. You can use # sign to insert comments to easily differentiate between training code and quiz code. Important: Code chunks are page-wide grey boxes that contain the code strings that are described in this module. Code in small grey boxes within the paragraphs of text are to be examined, but should not be copied into your R script. Page-wide white boxes with text preceded by ## contain the output, that is the result of running the code string just above it - this is what you should see when you run the code in your RStudio.

  • Execution of any chunk of the training code, quiz code and some practice code is dependent on prior execution of all earlier chunks of code (including those that set the working directory and load all function packages and data files). So, as you progress through this training module, quizzes and practice sessions, make sure to copy all code into your R script, run it and save it regularly. To save the script press Ctrl+S at any convenient time. To save all the objects with variables and data frames (the workspace) make sure to press “Save” in the pop-up window asking whether you want to save the workspace image, which appears when you close the RStudio window. Next time you open the session you can continue from where you stopped.

  • If your start getting errors after restarting your session, just run all the code from all previous study sessions, and then try to run the code from the new session once again. To launch execution of multiple lines of code at once, select them all, hit Ctrl+Enter and wait for a few moments. To run one string of code place a cursor inside that string and press Ctrl+Enter.

  • If you previously ran the code that had a mistake and that overwrote an existing object or its elements (such code string would contain <- or =) that are used in the following lines of code, running those following lines of code may result in output mistakes or error messages. In this case, after you revised the faulty code, it is better to retrace your steps to where the objects involved were originally created, and rerun that code and all code that follows. Alternatively, in some cases it may be feasible to rerun all code from the very beginning.

  • Warning message that may be shown, usually do not stop execution of code. They are intended to draw your attention to some characteristics of code execution. It still may be useful to attention to them.

  • Error messages state that something went wrong and stop execution of code. The text of the error message indicates what type of error that may be: mistake in the code, absence of certain data objects in the workspace, error in a link to a location on your PC, wrong working directory set in earlier steps, failure to install and/or load the function packages. Error message has to be reviewed and the issues resolved, before you can move forward. In some cases updating R and all function packages, or even uninstalling and reinstalling of R and RStudio, may help, as the software is constantly updated by the developers. At the time of publishing of this training module, the current code worked well for the tasks described.

  • Should you need more information on various packages and functions run help(package = “package_name”) for a given package or ?function_name for a given function. If you run into a function that was introduced earlier in a module, you can search in the module web-page by pressing Ctrl+S and entering the name of the function. This way you will find its earlier mentioning and example if use to refresh your recollection.

  • Notice links to footnotes and external web-resources included throughout the training module. Those may contain additional short clarifications or provide additional useful reference material on NTMs, SDGs and their linkages, as well as on R language and functions.

Quizzes and practice

As you go through the code and instructions below, you will be offered a few practice sessions as well as a few quizzes that intend to help you get better grasp of the workings of R and its code. You are not rated on these quizzes. They are only there for you to learn.

1 Preparatory steps

1.1 Start RStudio, set work directory and install code packages

Start your RStudio, and open your new R script by selecting File/New file/R script or by pressing Ctrl+Shift+N. Then save the new working file by pressing File/Save as…, entering the preferred file name and choosing the preferred directory on your computer.

Set your working directory from which your data frames will be imported and to which the resulting files will be exported by using function setwd. This is a very important step, as it will allow you import files from this location without having to indicate the full path to this directory. And RStudio will save all output and backup files into that directory automatically.

Note for PC users: when inputting the path to any directory in RStudio, it is necessary to replace all backslashes \ with forward slashes /, or you will get an error message and code will not run.

setwd("C:/path/to_your/working_directory")

You can check your working directory using getwd.

getwd()

## [1] "C:/path/to_your/working_directory"

Download the data files listed in section Data above into your working directory.

Before proceeding further, it is necessary to load the following code packages that we will use later. Details on each package can be easily found online. Alternatively, use help(package = "package_name"), to review package information in Help tab in the lower left corner of your RStudio window.

library("dplyr")
library("readxl")
library("reshape2")
library("stringr")
library("stringi")
library("tidyr")
library("ggplot2")

If you get an error message indicating that the required code package does not exist, try installing the package by running code install.packages("package_name") first, and only then run library("package_name").

Warning messages are usually not a problem.

1.2 Import the SDG-HS-NTM concordance matrix into RStudio

If you open the SDG-HS-NTM-Concordance[v1.1 Sep 2019]_0.xlsx file, you will see that the concordance matrix is contained in Sheet 2 titled “Concordance Matrix”. Hence, this is the sheet that needs to be imported by using function read_excel. Note that to be able to further manipulate the data frame it is necessary to create an R object, rather than just import the data file. So below we use matching <-, where matching is the name of the new object, and the arrow <- denotes the action of assignment (alternatively, you can use a single = instead of <-).

Also note that apart from the name of the file, which has to include file extension, the command also contains argument sheet=2 indicating sheet 2 of the imported excel file, argument col_names=TRUE indicating that the first row is used as column names, and argument trim_ws=TRUE indicating that all leading and trailing white spaces should be trimmed.

matching <- read_excel("SDG-HS-NTM-Concordance[v1.1 Sep 2019]_0.xlsx", sheet=2, col_names = TRUE, trim_ws = TRUE)

You can preview the elements of the imported data frame by doing the following:

  • by using function names to preview the names of the columns or function head to preview the first few rows (6 rows by default, which you can change by adding an alternative value for the relevant argument).

Note: In the output after calling function head under the column names you can see text within <>. This indicates class of data stored in each column. <dbl> stands for numeric data, and <chr> stands for character strings. Other possible options are for categorical factors, and for logical. Indicating the right class of data may have impact on the results of data processing.

names(matching)
##  [1] "CODE 0/1      (0: A, IND; 1: C,CwK)"
##  [2] "SDG"
##  [3] "Target"
##  [4] "Product description"
##  [5] "List description"
##  [6] "HS"
##  [7] "HS_version"
##  [8] "NTM"
##  [9] "Keywords"
## [10] "neg_Keywords"
## [11] "A/C/CwK/IND"
## [12] "Notes"
head(matching,10)
## # A tibble: 10 x 12
##    `CODE 0/1      ~ SDG   Target `Product descri~ `List descripti~ HS
##               <dbl> <chr> <chr>  <chr>            <chr>            <chr>
##  1                0 SDG_1 NA     Potentially all~ NA               NA
##  2                0 SDG_1 NA     All products an~ NA               NA
##  3                0 SDG_1 NA     All products an~ NA               NA
##  4                0 SDG_2 NA     Agricultural pr~ HS chapters 01-~ HS c~
##  5                0 SDG_2 NA     Technologies, m~ Relevant 6- dig~ Rele~
##  6                0 SDG_2 NA     Fertilizers and~ Relevant 6- dig~ Rele~
##  7                0 SDG_2 Targe~ All intermediat~ See relevant se~ See ~
##  8                0 SDG_2 Targe~ Any product pro~ See relevant se~ See ~
##  9                0 SDG_2 Targe~ Endangered spec~ See relevant se~ See ~
## 10                1 SDG_2 Targe~ (BASIC)   Agric~ [HS-WTO_Aggri H~ 0101~
## # ... with 6 more variables: HS_version <chr>, NTM <chr>, Keywords <chr>,
## #   neg_Keywords <chr>, `A/C/CwK/IND` <chr>, Notes <chr>
  • by clicking on the name of your object in the Environment tab in the upper right section of the RStudio window, which will open a separate tab with the following preview of your data frame.
The preview of the “matching” data frame object.

The preview of the “matching” data frame object.

Quiz 1



Question 1.1: What output is generated by calling function head(data_frame)? Note: data_frame is just a made up name of a data frame object.

A. Preview of data_frame in a separate tab in RStudio.

# Answer A is incorrect. 

B. Preview of the first 10 rows of data_frame.

# Answer B is incorrect. 

C. Preview of the names of the columns of data_frame.

# Answer C is incorrect.

D. Preview of the first 6 rows of data_frame.

# Answer D is correct! 



Question 1.2: Which of the lines of code below will successfully create an object containing a data frame imported from an Excel file’s sheet 1? Note: data.xlsx is just a made up name of a data file.

A. object <- read("data.xlsx", sheet=1, col_names = TRUE)

# Answer A is incorrect. Need to use function `read_excel`.

B. object = read_excel("data.xlsx", sheet=1, col_names = TRUE)

# Answer B is correct! `<-` and `=` are equivalent.

C. object < read_excel("data.xlsx", sheet=1, col_names = TRUE,)

# Answer C is incorrect. There is one redundant `,` within the round brackets.

D. object <- read_excel(data.xlsx, sheet=1, col_names = TRUE)

# Answer D is incorrect. File name has to be specified within quotation marks.

1.3 Clean up the matching data frame

For the purpose of further analysis we need to remove those rows in matching that contain concordance strings that are marked as Ambiguous and Indirect, so that we only preserve those that are Clear and Clear with keywords, as per the indication in A/C/CwK/IND column. We do that by rewriting object matching with itself, excluding the rows that contain strings “A” and “IND” in column A/C/CwK/IND.11

To do that we use code matching <- matching[!(matching$A/C/CwK/IND%in% c("A", "IND")),]. Square brackets []are used to reference or extract a subset from the existing matching data frame, while operator $ in combination with the data frame object’s name is used to reference specific column by its name. Notice, that logical operator ! is used for negation and operator %in% is used to identify, if an element belongs to a certain group of data elements, in our case values “A” and “IND” in column A/C/CwK/IND.

Before we run the code in the chunk below to subset our matching data frame, let’s first practice different ways of data frame subsetting.

Practice subsetting data frames



Let’s practice a bit with some more examples to get better hang of it. All the code mentioned in the following paragraphs is also included in the code chunk below. Note that this code only prints the results of subsetting in the RStudio console. It does not generate any new objects and does not change anything in the existing objects, as we do not use functions <- or =.

matching$NTM or
matching$'NTM' or
matching$"NTM"

These above all do the same thing and call the content of column NTM as a vector of values. Drop quotation marks for simplicity and use them only if the column name contains symbols that may have a function associated with them, just like we did with matching$'A/C/CwK/IND'. Another way to call contents of the column as a vector is to use matching[[8]], where 8 is the number of the column we are calling.

matching[, "NTM"] subsets matching data frame into a smaller data frame containing only one column NTM and all 119 rows. Subsetting matching to contain two columns is achieved with matching[, c(6,8)] or matching[, c("HS", "NTM")].

matching[,8] creates data frame with only one column 8, matching[8,] creates data frame with only one row 8, and matching[8,8] creates data frame with data from only one cell from 8th row and 8th column. To coerce the one-row-one-column data frame to a vector, you need to add [[1]], like so matching[8,8][[1]].12

You can extract the values from a given cell by using matching$NTM[[8]] as a vector.

matching[1:8,2:8] extracts from matching rows from 1 to 8 and columns from 2 to 8. Thus, the value or the function within square brackets before , references the rows by number, name or by certain criteria, and the value after , references the columns by number, name or by certain criteria. Notice, that we preserve ,, even if we do not set forth any criteria for rows, or columns. Technically calling matching[,] is also correct - we extract all rows and columns, i.e. the complete data frame as is.

matching$'A/C/CwK/IND'=="A" checks if any cells in column A/C/CwK/IND contain “A” and essentially creates a vector of logical values TRUE or FALSE. Note that, if the value after == is a character string, we should use quotation marks. If that value is numeric, we drop the quotation marks, as in matching[[1]]==1, which checks if any cells in column 1 of matching contains value 1.

If value in a cell is NA, we cannot use ==NA or =="NA". We should use function is.na instead. For example, is.na(matching$Keywords) checks whether any cell in column Keyword contain value NA. Subsetting with matching[!is.na(matching$Keywords),] will get you all rows of matching data frame, excluding those that have value NA in column Keyword. See more on ! below in this section.

matching[matching$'A/C/CwK/IND'=="A",] extracts from matching only those rows that contain “A” in column A/C/CwK/IND. Alternatively, matching[matching$'A/C/CwK/IND'!="A",] or matching[!matching$'A/C/CwK/IND'=="A",] extract from matching only those rows that do not contain “A” in column A/C/CwK/IND. Note that in these code lines != stands for “not equal to” and !x stands for “not x”.

For more details on various operators see here. We will use most of them in the sections below.

Notice, that we use == for the logical operator of equality, as = is used for assigning values to an object and is equivalent to ``.

If you have more than one criterion for the rows you can go one of the following ways:

  • matching[matching$'A/C/CwK/IND'!="A"& matching$'A/C/CwK/IND'!="IND", ], where we use logical operator & to set forth two criteria for the rows to be extracted. You can set up as many criteria as necessary.

  • matching[!(matching$'A/C/CwK/IND' %in% c("A", "IND")),] does the same thing as the above, but uses operator %in% and c() to indicate that content in the cells in column A/C/CwK/IND should not be the one present in vector c("A", "IND").

All of the code described above is saved in the code chunk below. Paste it into your RStudio and run after importing the SDG-HS-NTM concordance matrix (done in code chunk above). Look at the output in the console. Running this code does not do anything to the objects saved in you environment, as we do not store this output into any of the existing or new objects by using <- or =. Assign the generated output to a test object, so you can examine it in detail by using such functions as name, head, summary, attributes, length, nrow, ncol.

Note that the first 5 lines of code generate vectors of data from the specified column in a data frame, while the rest of them generate a data frames with a given number of rows and columns. Also note, that each opening bracket has to be paired by a closing bracket!

# vectors
matching$NTM
matching$`NTM`
matching$"NTM"
matching$`A/C/CwK/IND`
matching[[8]]

# data frames
matching[, "NTM"]
matching[, c(6,8)]
matching[, c("HS", "NTM")]
matching[,8]
matching[8,]
matching[8,8]
matching[8,8][[1]]
matching$NTM[[8]]
matching[1:8,2:8]
matching[,]
matching$`A/C/CwK/IND`=="A"
matching[[1]]==1
matching[matching$`A/C/CwK/IND`=="A",]
matching[matching$`A/C/CwK/IND`!="A",]
matching[!matching$`A/C/CwK/IND`=="A",]

matching[matching$`A/C/CwK/IND`!="A"&
                     matching$`A/C/CwK/IND`!="IND", ]

matching[!(matching$`A/C/CwK/IND` %in% c("A", "IND")),]

Quiz 2



Question 2.1: What line of code will open documentation on a function? Note: function_name is just a made up name of a function.

A. help(function_name)

# Answer A is incorrect. 

B. ?function_name

# Answer B is correct!

C. help(function = "function_name")

# Answer C is incorrect. 

D. ?'function_name'

# Answer D is incorrect. 



Question 2.2: What line of code creates an object with the content of column Target of matching data frame?

A. target <- matching$Target

# Answer A is correct!

B. <- matching$Target

# Answer B is incorrect. No object is created. 

C. target <- matching%in%Target

# Answer C is incorrect. Should use `matching$Target` to reference content of column __Target__.

D. target <- matching["Target"]

# Answer D is incorrect. Should use `matching$Target` to reference content of column __Target__.



Question 2.3: What line of code subsets a new data frame object from all rows of matching object that have character string “HS4” in column HS_version?

A. HSversion <- matching[matching$HS_version="HS4",]

# Answer A is incorrect! Need to use `==` to indicate equality.

B. HSversion <- matching[matching$HS_version==HS4,]

# Answer B is incorrect! Character strings should be specified within quotation marks.

C. HSversion <- matching[matching$HS_version=="HS4"]

# Answer C is incorrect! Missing `,` within square brackets.

D. HSversion <- matching[matching$"HS_version"=="HS4",]

# Answer D is correct!



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. These objects you will need for further quizzes and practice sessions.

HSversion <- matching[matching$"HS_version"=="HS4",]

practice_matching <- matching

1.3 Clean up the matching data frame (continued)

Now, back to our code, where we rewrite matching with itself, excluding the rows that contain strings “A” and “IND” in column A/C/CwK/IND.

matching <- matching[!(matching$`A/C/CwK/IND` %in% c("A", "IND")),]

In further analysis we will only need columns Target, HS, HS_version, NTM, Keywords, neg_Keywords.13 Therefore, we rewrite matching with a data frame only containing these specific columns.

matching <- matching [, c("SDG", "Target", "HS", "HS_version", "NTM", "Keywords", "neg_Keywords")]

In columns Keywords and neg_Keywords a few cells contain “NA” as a character string, which should be replaced with NA as a value.

As you can see, we use square brackets to select for replacement only those cells in columns Keywords and neg_Keywords that contain a character string “NA”. Notice that we use == as a logical operator of equality rather than =, which is used to assign a value or a set of characteristics to an object.

matching$Keywords[matching$Keywords=="NA"] <- NA
matching$neg_Keywords[matching$neg_Keywords=="NA"] <- NA

We then also can add an additional column containing row numbers for convenience. To do that, we reference column row in object matching by using operator $. Since this column does not yet exist, assigning values to be entered into that column will create a new one with a specified name. We use <- to store values in that new column. We use : to generate a sequence of numbers from 1 till number of rows in matching data frame, which is determined with the use of function nrow.

matching$row <- 1:nrow(matching)

Now we can preview the resulting matching data frame and check its various attributes to see whether we are well on track.

names(matching)  # names of the columns
## [1] "SDG"          "Target"       "HS"           "HS_version"
## [5] "NTM"          "Keywords"     "neg_Keywords" "row"
head(matching)  # first 6 rows
## # A tibble: 6 x 8
##   SDG   Target    HS       HS_version NTM     Keywords  neg_Keywords    row
##   <chr> <chr>     <chr>    <chr>      <chr>   <chr>     <chr>         <int>
## 1 SDG_2 Target 2~ 010110;~ HS3        A1; A5~ pest; pe~ pesticide; M~     1
## 2 SDG_2 Target 2~ 0101; 0~ HS3        A1; A5~ pest; pe~ pesticide; M~     2
## 3 SDG_3 Target 3~ 263628;~ HS3        N; B31~ !TRIPS!;~ <NA>              3
## 4 SDG_3 Target 3~ 300220;~ HS3        B14; B~ medicine~ <NA>              4
## 5 SDG_3 Target 3~ 263628;~ HS5        B31; B~ generic;~ <NA>              5
## 6 SDG_3 Target 3~ 121130;~ HS5        B11; B~ dangerou~ <NA>              6
matching$row    # values in column 'row'
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
nrow(matching)  # number of rows
## [1] 42
ncol(matching)  # number of columns
## [1] 8
row.names(matching)   # names of rows
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
attributes(matching)  # names of columns, names of rows and class of object
## $names
## [1] "SDG"          "Target"       "HS"           "HS_version"
## [5] "NTM"          "Keywords"     "neg_Keywords" "row"
##
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
##
## $class
## [1] "tbl_df"     "tbl"        "data.frame"

Quiz 3



Question 3.1: Now you want use practice_matching data frame, which is the copy of the original imported matching data frame, to create a new data frame healthSDG, which contains only the rows for SDG 3. Which line of code below will achieve that?

A. healthSDG <- practice_matching[practice_matching$SDG=="SDG_3"]

# Answer A is incorrect. `,` is missing within square brackets.

B. "healthSDG" <- practice_matching[practice_matching$SDG=="SDG_3", ]

# Answer B is incorrect. Redundant quotation marks around the name of a newly created object.

C. healthSDG <- practice_matching[practice_matching$SDG=="SDG_3", ]

# Answer C is correct.

D. healthSDG <- practice_matching[practice_matching$SDG="SDG_3", ]

# Answer D is incorrect. Should use `==` to indicate equality.



Question 3.2: In healthSDG data frame that you created above you want to add an additional column subject, into the cells of which you want to save character string “Good health and wellbeing”. Choose the correct and the most efficient (shortest) line of code from options below?

A. healthSDG$subject <- "Good health and wellbeing"

# Answer A is correct.

B. healthSDG$subject[healthSDG$SDG=="SDG_3"] <- "Good health and wellbeing"

# Answer B is incorrect. `[healthSDG$SDG=="SDG_3"]` is redundant, as the current version of object `healthSDG` only contains concordance strings for SDG 3.

C. healthSDG$subject[healthSDG$SDG==SDG_3] <- "Good health and wellbeing"

# Answer C is incorrect. `[healthSDG$SDG==SDG_3]` is both redundant and wrong. Character string `SDG_3` should be within quotation marks. Current version of object `healthSDG` only contains concordance strings for SDG 3.

D. healthSDG$subject[SDG=="SDG_3"] <- "Good health and wellbeing"

# Answer D is incorrect. `[SDG=="SDG_3"]` is both redundant and wrong. `healthSDG$SDG` should be used to reference column __SDG__ of `healthSDG`. Current version of object `healthSDG` only contains concordance strings for SDG 3.



Question 3.3: In healthSDG data frame you want to rewrite cells in column A/C/CwK/IND that contain “A” and “IND” with character string “omit”. Which line of code below would achieve that?

A. healthSDG$HS_version[healthSDG$"A/C/CwK/IND" %in% c("A", "IND")] <- "omit"

# Answer A is incorrect. `healthSDG$HS_version` should be replaced with `healthSDG$"A/C/CwK/IND"`.

B. healthSDG$"A/C/CwK/IND"[healthSDG$"A/C/CwK/IND" %in% c("A", "IND")] <- omit

# Answer B is incorrect. Character string "omit" should be indicated within quotation marks.

C. healthSDG$"A/C/CwK/IND"[healthSDG$"A/C/CwK/IND" %in% c("A", "IND")] <- "omit"

# Answer C is correct.

D. healthSDG$"A/C/CwK/IND"[healthSDG$"A/C/CwK/IND" %in% c(A, IND)] <- "omit"

# Answer D is incorrect. Character strings "A" and "IND" should be indicated within quotation marks.



Question 3.4: Now you want to drop rows that are marked with string “omit” in omit column. Which line of code below would achieve that?

A. healthSDG <- healthSDG[healthSDG$"A/C/CwK/IND"=="omit", ]

# Answer A is incorrect. `!` is missing within square brackets.

B. healthSDG <- healthSDG[!healthSDG$"A/C/CwK/IND"=="omit", ]

# Answer B is correct. 

C. healthSDG <- healthSDG[!healthSDG$"A/C/CwK/IND"!="omit", ]

# Answer C is incorrect. There are two `!` within square brackets. Only one should be used.

D. healthSDG <- healthSDG[!healthSDG$"A/C/CwK/IND"=="omit" ]

# Answer D is incorrect. There is `,` missing within square brackets.



Question 3.5: In healthSDG data frame you want to add an additional column row containing row numbers. Which line of code below would achieve that?

A. healthSDG$row <- 1:nrow(healthSDG)

# Answer A is correct.

B. healthSDG$row <- 1:ncol(healthSDG)

# Answer B is incorrect. Need to use function `nrow`, not `ncol`.

C. $row <- 1:nrow(healthSDG)

# Answer C is incorrect. Need to use `healthSDG$row` to indicate column __row__ of `healthSDG`.

D. healthSDG$row <- 1-nrow(healthSDG)

# Answer D is incorrect. Need to use `:` to indicate a range.



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these updated objects you will need for further quizzes and practice sessions.

healthSDG <- practice_matching[practice_matching$SDG=="SDG_3", ]

healthSDG$subject <- "Good health and wellbeing"

healthSDG$`A/C/CwK/IND`[healthSDG$"A/C/CwK/IND" %in% c("A", "IND")] <- "omit"

healthSDG <- healthSDG[!healthSDG$"A/C/CwK/IND"=="omit", ]

healthSDG$row <- 1:nrow(healthSDG)

1.3 Clean up the matching data frame (continued)

Some strings in cells in columns NTM, HS, Keywords, neg_Keywords, and Target contain an extra “;” at their end. For example, see the first cell in the NTM column currently containing the string “A1; A5; A8; C1; C3; P6;”. The same is true for values in row 1 of columns HS, Keywords, and neg_Keywords.

We will use a for loop function to go through each cell in relevant columns one by one to check whether there is an extra semicolon present and to remove it. The for (i in 1:nrow(matching) line below sets the sequence from which to derive index i indicating a specific row in a given column.

The loop then does the following for NTM column:

  1. creates a temporary object temp that contains a string in the ith cell in NTM column
  2. trims all leading and trailing white spaces with function trimws
  3. creates a temporary object last_letter containing only the last symbol of the character string stored in temp by using function substr to extract a substring with given parameters and function nchar to determine the number of the last character in a given character string
  4. uses a conditional if...else statement, which reads as if the symbol in last_letter object is “;” then overwrite temp with itself, excluding the last symbol. Notice, that we use nchar(temp)-1 to exclude the last character in a string stored in temp
  5. overwrites the ith cell in NTM column with the modified string from temp

Open Help file for substr function to see that it takes the following three arguments: substr(x, start, stop), where x is text or a character vector, start is first element to be extracted/replaced, and stop is the last element to be extracted/replaced. start and stop can be specified as numbers or as functions that generate numbers based on certain criteria. In our case number for stop is generated with function nchar(temp) that gives the total number of characters in a string in temp object. nchar(temp)-1 gives the total number of characters in a string in temp object minus 1.

The for loop then goes on to do the same for , HS, Keywords, neg_Keywords, and Target columns.

Note that for columns Keywords and neg_Keywords we introduce additional conditional if...else statement, which reads as if temp is not empty, then run the code that follows. This is due to the fact that some cells in these two columns are empty, and we don’t need to change them. To check, whether temp is empty, we use function is.na.

With the last line of code using function rm we remove all redundant objects.

for (i in 1:nrow(matching)){
  # for NTM
  temp <- matching$NTM[i]
  temp <- trimws(temp)
  last_letter <- substr(temp, nchar(temp), nchar(temp))
  if(last_letter==";"){
    temp <- substr(temp, 1, nchar(temp)-1)
  }
  matching$NTM[i] <- temp

  # for HS
  temp <- matching$HS[i]
  temp <- trimws(temp)
  last_letter <- substr(temp, nchar(temp), nchar(temp))
  if (last_letter==";"){
    temp <- substr(temp, 1, nchar(temp)-1)
  }
  matching$HS[i] <- temp

  # for KWs
  temp <- matching$Keywords[i]
  ## do the same as for NTM and HS columns, only if "temp" contains a string
  if (!is.na(temp)){
    temp <- trimws(temp)
    last_letter <- substr(temp, nchar(temp), nchar(temp))
    if(last_letter==";"){
      temp <- substr(temp, 1, nchar(temp)-1)
    }
    matching$Keywords[i] <- temp
  }

  # for neg KWs
  temp <- matching$neg_Keywords[i]
  if (!is.na(temp)){
    temp <- trimws(temp)
    last_letter <- substr(temp, nchar(temp), nchar(temp))
    if (last_letter==";"){
      temp <- substr(temp, 1, nchar(temp)-1)
    }
    matching$neg_Keywords[i] <- temp
  }

  # for Targets
  temp <- matching$Target[i]
  temp <- trimws(temp)
  last_letter <- substr(temp, nchar(temp), nchar(temp))
    if(last_letter==";"){
      temp <- substr(temp, 1, nchar(temp)-1)
    }
    matching$Target[i] <- temp
}

# remove temporary objects
rm(temp, last_letter, i)

Quiz 4



Question 4.1: You want to print the first 3 characters from a string in a cell in row 8 of column SDG in practice_matching data frame. Which line of code below would achieve that?

A. substr(practice_matching[8, SDG], 1, 3)

# Answer A is incorrect. Name of a column should be indicated within quotation marks.

B. substr(practice_matching[8, "SDG"], 1, 3)

# Answer B is correct.

C. substr(practice_matching[8, SDG], 1:3)

# Answer C is incorrect. Name of a column should be indicated within quotation marks. Start and stop character numbers should be indicated as separate arguments and not as a range.

D. substr(practice_matching[8, "SDG"], 1:3)

# Answer D is incorrect. Start and stop character numbers should be indicated as separate arguments and not as a range.



Question 4.2: In practice_matching data frame you now want to overwrite strings in SDG column with only numbers of these SDGs. Which line of code below would achieve that?

A. substr(practice_matching$SDG, 5, nchar(practice_matching$SDG))

# Answer A is incorrect. Need to include `practice_matching$SDG <-` to indicate, what is being overwritten.

B. practice_matching$SDG <- substr(practice_matching$SDG, 5, nchar(SDG))

# Answer B is incorrect. Should use `nchar(practice_matching$SDG)` as an argument for stop character number.

C. practice_matching$SDG < substr(practice_matching$SDG, 5, nchar(practice_matching$SDG))

# Answer C is incorrect. Should use `<-` and not `<`.

D. practice_matching$SDG <- substr(practice_matching$SDG, 5, nchar(practice_matching$SDG))

# Answer D is correct.



Question 4.3: Take a look at HSversion data frame that we created earlier. You now want to use a simple for loop to overwrite “HS4” in column HS_version with “2012”. Which line of code below would achieve that?

A. for(i in 1-nrow(HSversion)){HSversion$HS_version[[i]] <-2012}

# Answer A is incorrect. Should use `:` to indicate a range.

B. for(i %in% 1:nrow(HSversion)){HSversion$HS_version[[i]] <-2012}

# Answer B is incorrect. Should use `in` and not `%in%` to indicate a sequence for `for loop`.

C. for(i in 1:nrow(HSversion)){HSversion$HS_version[[i]] <-2012}

# Answer C is correct.

D. for(i in 1:nrow(HSversion)){HS_version[[i]] <-2012}

# Answer D is incorrect. Should use `HSversion$HS_version` to reference column of an object.



Question 4.4: Take a look at HSversion data frame again. You now want to use a for loop with nested if..else statement to overwrite character string “NA” in column Keywords with value NA. Which line of code below would achieve that?

A. for(i in 1:nrow(HSversion)){
if(HSversion$Keywords[[i]]=="NA"){
HSversion$Keywords[[i]]<-NA
}
}

# Answer A is correct.

B. for(i in 1:nrow(HSversion)){
if(HSversion$Keywords[[i]]==NA){
HSversion$Keywords[[i]]<-NA
}
}

# Answer B is incorrect. Where `NA` is a character string, it should be indicated within quotation marks.

C. for(i in 1:nrow(HSversion)){
if(HSversion$Keywords[[i]]=="NA"){
HSversion$Keywords[[i]]<-NA
}

# Answer C is incorrect. One closing curly bracket is missing.

D. for(i in 1:nrow(HSversion)){
if(HSversion$Keywords[[i]]="NA"){
HSversion$Keywords[[i]]=NA
}
}

# Answer D is incorrect. `==` should be used for equality. `=` can be used for assigning value instead of `<-`.



Question 4.5: How many cells in column Keywords in HSversion object are empty? Use is.na and sum functions to check.

A. 4

# Answer A is incorrect.

B. 10

# Answer B is incorrect.

C. 15

# Answer C is incorrect.

D. 16

# Answer D is correct.



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.

substr(practice_matching[8, "SDG"], 1, 3)

practice_matching$SDG <- substr(practice_matching$SDG, 5, nchar(practice_matching$SDG))

for(i in 1:nrow(HSversion)){
  HSversion$HS_version[[i]] <-2012
}

for(i in 1:nrow(HSversion)){
  if(HSversion$Keywords[[i]]=="NA"){
    HSversion$Keywords[[i]]<-NA
  }
}

sum(is.na(HSversion$Keywords))

1.3 Clean up the matching data frame (continued)

Recalling the string from the first cell in the NTM column now gives us “A1; A5; A8; C1; C3; P6”. The semicolon has been removed!

Now we can expand matching data frame into a long format by column Target by using function separate_rows, trim any white spaces that may have been introduced into the character strings in column Target and renumber the rows in column row.

If a variable contains observations with multiple delimited values, separate_rows function from tidyr package separates the values and places each one in its own row.

Open Help file for separate_rows function to see that it takes the following basic arguments: separate_rows(data, ..., sep = "[^[:alnum:].]+"), where data is name of a data frame object, ... is for selection of columns to be separated, and sep = "[^[:alnum:].]+" is used to indicate which symbol in a string should be treated as a separator delimiting collapsed values, in our case ";".

matching <- separate_rows(matching, Target, sep = ";")
matching$Target <- trimws(matching$Target)
matching$row <- 1:nrow(matching)

Quiz 5



Question 5.1: Use separate_rows function to split the current version of practice_matching data frame by column Target with ; treated as a separator, use trimws to remove any leading or trailing white spaces in column Target, and then use nrow function to determine number of rows in the resulting data frame and to choose the right answer below.

A. 231

# Answer A is incorrect.

B. 211

# Answer B is correct. The correct code is below for reference.

C. 121

# Answer C is incorrect.

D. 196

# Answer D is incorrect.



Question 5.2: Since in practice_matching we did not remove trailing ; in column Target, separate_rows function generated extra rows with empty cells in column Target (i.e. space after ; is treated as an independent string, even though it is empty). Use code to remove all rows for which cell in Target column is empty, and then use nrow to determine number of rows in the resulting data frame and to choose the right answer below. Hint: Use =="" to describe empty string.

A. 201

# Answer A is incorrect.

B. 196

# Answer B is incorrect.

C. 211

# Answer C is incorrect.

D. 204

# Answer D is correct. The correct code is below for reference.



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these updated objects you will need for further quizzes and practice sessions.

# Question 1
practice_matching <- separate_rows(practice_matching, Target, sep = ";")
practice_matching$Target <- trimws(practice_matching$Target)
nrow(practice_matching)

# Question 2
practice_matching <- practice_matching[!practice_matching$Target=="",]
nrow(practice_matching)

1.4 Import and arrange auxiliary data

In the HS column of matching some cells contain “All HS”, indicating that all HS codes may be targeted by relevant regulations - for these we will need to include all 6-digit HS codes. Additionally, some cells contain 2- or 4-digit HS codes. For these we will need to include all underlying 6-digit codes. And lastly, as per the values in HS_version column, some HS codes are from different HS versions (2007, 2012 and 2017). For these we will need to convert HS codes to 2012 version, as this version is the one that is used in the TRAINS database of NTMs.

To do this we will need to use data frames generated from the UN Comtrade Commodity Classification table and from the table of concordance between different versions of the Harmonized Commodity Description and Coding System, referenced Data section above.

First, we create object hs_codes into which we read the .csv file containing the UN Comtrade Commodity Classification.

hs_codes <- read.csv("HSCodesAll.csv", stringsAsFactors = FALSE)

If we look at the unique values in column Classification by using command unique(), we will see the following values BE, H0, H1, H2, H3, H4, H5, S1, S2, S3, S4, representing different types and versions of commodity classifications. We are only interested in HS versions of 2007 (H3), 2012 (H4) and 2017 (H5), as these are used in matching data frame.

So now we create three objects that only include rows for which value in column Classification is “H4”, “H5” or“H3”and the value in column Level is “6” (6-digit). And we are only including columns Code and Description. In the last three lines of code, we rename the columns in the resulting new objects for convenience.

hs_codes12 <- hs_codes[hs_codes$Classification=="H4"&hs_codes$Level==6, c("Code", "Description")]
hs_codes17 <- hs_codes[hs_codes$Classification=="H5"&hs_codes$Level==6, c("Code", "Description")]
hs_codes07 <- hs_codes[hs_codes$Classification=="H3"&hs_codes$Level==6, c("Code", "Description")]

names(hs_codes12) <- c("hs6", "description")
names(hs_codes17) <- c("hs6", "description")
names(hs_codes07) <- c("hs6", "description")

Then, in each object we create additional two columns with 2- and 4-digit HS codes by substringing the character stings in column hs6 with the use of function substr. Numbers 1 and 2 or 1 and 4 indicate the first and the last characters to be extracted.

hs_codes12$hs4 <- substr(hs_codes12$hs6, 1, 4)
hs_codes12$hs2 <- substr(hs_codes12$hs6, 1, 2)

hs_codes17$hs4 <- substr(hs_codes17$hs6, 1, 4)
hs_codes17$hs2 <- substr(hs_codes17$hs6, 1, 2)

hs_codes07$hs4 <- substr(hs_codes07$hs6, 1, 4)
hs_codes07$hs2 <- substr(hs_codes07$hs6, 1, 2)

Lastly, we convert values in columns hs6, hs4 and hs2 in all three objects from characters into numeric values by using function as.numeric. We need to do this to keep it consistent with the HS concordance table that has this data recorded as numeric values (omitting the first zero of the string), as we will see below.

hs_codes12$hs2 <- as.numeric(hs_codes12$hs2)
hs_codes12$hs4 <- as.numeric(hs_codes12$hs4)
hs_codes12$hs6 <- as.numeric(hs_codes12$hs6)

hs_codes17$hs2 <- as.numeric(hs_codes17$hs2)
hs_codes17$hs4 <- as.numeric(hs_codes17$hs4)
hs_codes17$hs6 <- as.numeric(hs_codes17$hs6)

hs_codes07$hs2 <- as.numeric(hs_codes07$hs2)
hs_codes07$hs4 <- as.numeric(hs_codes07$hs4)
hs_codes07$hs6 <- as.numeric(hs_codes07$hs6)

We can now see what new objects look like. For example, object hs_codes12 below. Notice, that we use function head in which we indicate that we want to see first 10 rows instead of the default 6 rows. We also use function class to check the class of data in hs6 column.

head(hs_codes12,10)
##         hs6                                          description hs4 hs2
## 25830 10121             Horses; live, pure-bred breeding animals 101   1
## 25831 10129  Horses; live, other than pure-bred breeding animals 101   1
## 25832 10130                                          Asses; live 101   1
## 25833 10190                              Mules and hinnies; live 101   1
## 25835 10221             Cattle; live, pure-bred breeding animals 102   1
## 25836 10229  Cattle; live, other than pure-bred breeding animals 102   1
## 25837 10231            Buffalo; live, pure-bred breeding animals 102   1
## 25838 10239 Buffalo; live, other than pure-bred breeding animals 102   1
## 25839 10290  Bovine animals; live, other than cattle and buffalo 102   1
## 25841 10310              Swine; live, pure-bred breeding animals 103   1
class(hs_codes12$hs6)
## [1] "numeric"

Quiz 6



Question 6.1: In HS_version you want to check the number of unique character strings stored in column SDG. Use unique function nested within length function to achieve that. What number did you get?

A. 9

# Answer A is incorrect.

B. 11

# Answer B is incorrect.

C. 15

# Answer C is correct.

# The correct code is below for reference.

D. 17

# Answer D is incorrect.



Question 6.2: In HS_version data frame you want to replace column names by their numbers. Which code will achieve that?

A. All

# Answer A is correct. All lines of code are correct and achieve the same result. However, it is always better to use shorter code. 

B. names(HSversion) <- 1:ncol(HSversion)

# Answer B is not entirely correct.  

C. names(HSversion) <- 1:ncol(HSversion[, 1:12])

# Answer C is not entirely correct.   

D. names(HSversion)<- 1:12

# Answer D is not entirely correct.  



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.

# Question 1
length(unique(HSversion$SDG))

# Question 2
names(HSversion) <- 1:ncol(HSversion)
names(HSversion)<- 1:12
names(HSversion) <- 1:ncol(HSversion[, 1:12])

1.4 Import and arrange auxiliary data (continued)

Now let’s create object hs_concordance into which we read .csv file containing concordance table for all existing versions of the HS from 1992 till 2017 and then take a look at it.

hs_concordance <- read.csv(file = "HS_concordance_full.csv", stringsAsFactors = FALSE)
head(hs_concordance)
##    HS92  HS96  HS02  HS07  HS12  HS17
## 1 10111 10111 10110 10110 10121 10121
## 2 10111 10111 10110 10110 10130 10130
## 3 10120 10120 10110 10110 10121 10121
## 4 10120 10120 10110 10110 10130 10130
## 5 10119 10119 10190 10190 10129 10129
## 6 10119 10119 10190 10190 10130 10130

Since the matching data frame only contains 2007, 2012 and 2017 versions of the HS, we only need one concordance between 2007 and 2012, and the second concordance between 2017 and 2012. Let’s create two objects containing only two columns of relevant concordances. We can do this by subsetting the hs_concordance data frame as below.

hs_concordance0712 <- hs_concordance[, c("HS07", "HS12")]
hs_concordance1712 <- hs_concordance[, c("HS17", "HS12")]

head(hs_concordance0712)
##    HS07  HS12
## 1 10110 10121
## 2 10110 10130
## 3 10110 10121
## 4 10110 10130
## 5 10190 10129
## 6 10190 10130

The matching data frame in its NTM column contains NTM classification codes at 1-, 2-, 3- and 4-digit levels. To account for possible variation in assignment of NTM codes to NTM entries in TRAINS database, any 1-, 2- or 3-digit codes need to be expanded to include themselves and all further disaggregated codes. To be able to do that, we need to import a data frame that contains the codes of the International Classification of NTMs referenced above in section Data. Use function head to see what it looks like.

MAST_codes <- read.csv("MAST_all.csv", stringsAsFactors = FALSE)
head(MAST_codes)
##   Code level level1 level2 level3 level4
## 1    A     1      A
## 2   A1     2      A     A1
## 3  A11     3      A     A1    A11
## 4  A12     3      A     A1    A12
## 5  A13     3      A     A1    A13
## 6  A14     3      A     A1    A14

Another data set that we will need at a later stage is “countries_UNCTAD_iso3.csv”, which contains full names of the countries (or economies), ISO3 country codes, a column marking whether or not a given country has NTM data available in TRAINS14, and columns marking whether or not a given country/economy belongs to any classification or geographical grouping (e.g. high income countries, low income countries, land-locked developing countries, South-East Asia, ESCAP, SPECA, etc.). Note, that this data set was prepared by ESCAP in May 2019, and it may need updating with time.

To clean up the data frame a bit, we keep only rows that are marked as having NTM data in TRAINS database (contain value 1 in column has_ntms) and only columns country, iso3, has_ntms and escap. If for your analysis you may need any other additional columns from “countries_UNCTAD_iso3.csv”, make sure to include names of those columns.

countries <- read.csv(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE)
countries <- countries[countries$has_ntms==1, c("country", "iso3", "has_ntms", "escap")]

Now we can take a look at countries data frame.

head(countries, 3)
##                country iso3 has_ntms escap
## 1          Afghanistan  AFG        1     1
## 4              Algeria  DZA        1     0
## 11 Antigua and Barbuda  ATG        1     0

Quiz 7



Question 7.1: You want to create a new data frame object country_inc by importing file “countries_UNCTAD_iso3.csv”. Which line of code below will achieve that?

A. countries_inc <- read.csv(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE

# Answer A is incorrect. Closing bracket is missing. 

B. countries_inc <- read.csv(file = "countries_UNCTAD_iso3", stringsAsFactors = FALSE)

# Answer B is incorrect. File extension is missing from the file name.

C. countries_inc = read.csv(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE)

# Answer C is correct.

D. countries_inc <- read_excel(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE)

# Answer D is incorrect. `read.csv` should be used to import .csv file.



Question 7.2: Now subset country_inc to only include the rows that have 1 in column has_ntms and only keep columns country, iso3, lower_middle_income, upper_middle_income, low_income, high_income. How many rows will the subset data frame country_inc have?

A. 86

# Answer A is incorrect.

B. 74

# Answer B is incorrect.

C. 88

# Answer C is correct.

# The correct code is below for reference.

D. 77

# Answer D is incorrect.



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.

# Question 1
countries_inc = read.csv(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE)


# Question 2
countries_inc <- countries_inc[countries_inc$has_ntms==1, c("country", "iso3", "lower_middle_income", "upper_middle_income", "low_income", "high_income")]
nrow(countries_inc)

1.5 Create SDG-HS-NTM matching list

Now we can use the imported auxiliary data frames to finalize preparation of data from matching data frame. We will create a list object using matching data frame that will be utilized to analyze NTM entries from the TRAINS database of NTMs.

First, we use function paste to overwrite those cells in column HS of matching data frame that contain character string “All HS” with all of the strings contained in column hs6 in hs_codes12 data frame. We use argument collapse = ";" to combine all the separate character strings from column hs6 into one long character string, where the 6-digit HS codes are separated from each other by a semicolon.

matching$HS[matching$HS=="All HS"] <- paste(hs_codes12$hs6, collapse = ";" )  

Now let’s create a list object matching.list using matching data frame. It will contain the same number of list items as the number of rows in the matching data frame. The columns will transform into sub-items within each of the list items, containing the strings from the cells in the corresponding rows and columns of the matching data frame.

To do that we use function split to split a data frame into a list of items (based on the number of rows in it), which in turn contain the list of data elements that were stored in each of the rows of that data frame.

It’s first argument contains name of an object being split, and second argument defines the grouping and uses nested functions seq and nrow to generate a sequence of list items based on the number of rows in matching.

nrow by itself generates number of rows in matching data frame, which is 67. seq creates a sequence of 67 items from 1 to 67, i.e. 67 distinct items of the list object that is being created. Essentially, seq(nrow(matching)) does the same thing as 1:nrow(matching). Run these two lines of code to see what happens, and then run the code in the chunk below.

matching.list <- split(matching, seq(nrow(matching)))

Picture below shows what item 2 of the new matching.list looks like. You can see that a list item contains a tibble with 1 row and 8 columns extracted from matching data frame. You can use print function to print list item 2 of matching.list to see just that.

print(matching.list[[2]])
## # A tibble: 1 x 8
##   SDG   Target  HS        HS_version NTM   Keywords    neg_Keywords     row
##   <chr> <chr>   <chr>     <chr>      <chr> <chr>       <chr>          <int>
## 1 SDG_2 Target~ 0101; 01~ HS3        A1; ~ pest; pest~ pesticide; MR~     2
The preview of item 2 of the “matching.list”.

The preview of item 2 of the “matching.list”.

Now below we include a for loop that will do the following things:

  • transform content of the cells that have a string of characters separated by a semicolon into a list vector of separate character strings or numeric values. This is true for such sub-items as Target, HS, NTM, Keywords, neg_Keywords

  • For sub-item HS :

  • codes longer than 6 digits are truncated to the first 6-digit;

  • 2- ad 4-digit codes are expanded to include all of the underlying 6-digit codes of the same HS version that is indicated in HS_version

  • HS codes of the HS versions other than 2012 - the default version of the TRAINS database - are converted to 2012 version

  • For sub-item NTM, all 1-, 2- or 3-digit NTM codes are expanded to include themselves and all underlying NTM codes at further levels of disaggregation

  • For sub-items Keywords and neg_Keywords all substrings that do not contain “!”, are formatted to lower case. Those that contain “!” are kept as all caps. This is done, as keyword matching with the data in the TRAINS database is generally case insensitive, with the exception of certain abbreviations that may match other words or their parts. Example: Letter combination “ODS”, which as an abbreviation that stands for “ozone depleting substances”, may be found in words “goods”, “rods”, “foods”, etc.

Notice again, that round brackets after for are used to define the sequence for which the loop is run. Round brackets are followed by curly brackets, within which all recycled functions are included. If the loop is very long, the closing curly bracket may be separated from the opening one by many lines of code. It is important not to lose it, otherwise the loop will not run.

The for loop below contains a number of nested conditional if...else statements and another for loop. As it is quite long and complex, we break it up below into its components and examine them step by step for sub-item 2 only. To do that we put a hash sign is front of the line of the for loop, which sets the sequence for which to run the loop (number of rows in matching data frame, from which matching.list was created) and in front of the closing “}” at the very end of the loop. Instead we set i equal to 2 to reference sub-item 2 of matching.list only. Later, to run a complete loop, you will need to remove the hash signs or just copy the full loop from the complete code script provided in “complete_matching_code.R” file (see Data section).

Quiz 8



Question 8.1: Look at healthSDG data frame that we created earlier. For Target 3.9 “All HS” codes are indicated in HS column. You need to replace these with all of the 6-digit HS codes from hs_codes12 data frame. Which line of code below will achieve that?

A. healthSDG$HS[healthSDG$HS=='All HS'] <- paste(hs_codes12$hs6, collapse = ";" )

# Answer A is correct.

B. healthSDG[healthSDG$HS=="All HS"] <- paste(hs_codes12$hs6, collapse = ";" )

# Answer B is incorrect. Need to use `healthSDG$HS` to reference column __HS__ of `healthSDG` object.

C. healthSDG$HS[healthSDG$HS="All HS"] <- paste(hs_codes12$hs6, collapse = ";" )

# Answer C is incorrect. `==` should be used to indicate equality.

D. healthSDG$HS[healthSDG$HS=="All HS"] <- paste(hs_codes12$hs2, collapse = ";" )

# Answer D is incorrect. We are pasting 6-digit HS codes from column __hs6__, and not 2-digit HS codes from column __hs2__.



Question 8.2: Overwrite new healthSDG object from the current iteration of matching data frame that only contains rows for SDG 3. How many rows does this new data frame have?

A. 22

# Answer A is incorrect.

B. 27

# Answer B is correct.

# The correct code is below for reference.

C. 25

# Answer C is incorrect.

D. 29

# Answer D is incorrect.



Question 8.3: Now you want to create a list object from healthSDG data frame. Which line of code below will achieve that?

A. health.list <- split(healthSDG, seq(nrow(healthSDG))

# Answer A is incorrect. One closing bracket is missing. 

B. health.list <- split(healthSDG, seq(ncol(healthSDG)))

# Answer B is incorrect. Nested function `nrow` should be used instead of `ncol`.

C. health.list <- separate_rows(healthSDG, seq(nrow(healthSDG))

# Answer C is incorrect. Function `split` should be used here, and not `separate_rows`.

D. health.list <- split(healthSDG, seq(nrow(healthSDG)))

# Answer D is correct.



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.

# Question 1
healthSDG$HS[healthSDG$HS=="All HS"] <- paste(hs_codes12$hs6, collapse = ";" )

# Question 2
healthSDG <- matching[matching$SDG=="SDG_3", ]
nrow(healthSDG)

# Question 3
health.list <- split(healthSDG, seq(nrow(healthSDG)))

1.6 Truncate long HS codes to 6-digit codes

First, we create an object x by copying a long string from item 2 sub-item HS and splitting it into a vector of character strings separated by “;”. To do this we use function strsplit and indicate ";" as a separator.

Then, we rewrite x by using nested functions to consecutively trim leading and trailing white spaces, substring only the first 6 symbols of the character strings in the vector and turn the character strings into numeric values. And lastly we remove any duplicate codes in object x by using unique function.

Note the following:

  • If we use strsplit(matching.list[[i]]$HS, ";") without adding [[1]], this generates a list object with one component that is a vector of split character strings. To extract split character strings as an atomic vector only, we need to add [[1]], which coerces a list vector to a simpler data object, that is an atomic vector. Run two versions of the code and inspect the generated object to see the difference between them.

  • After running x <- strsplit(matching.list[[i]]$HS, ";")[[1]] notice that some of the HS codes within quotation marks contain a leading white space. Trimming white spaces is crucial, as they are counted just like other characters. If we do not trim them, extracting the first 6 characters will result in extracting a white space and 5 first characters.

  • We convert HS codes to numeric values, which removes the leading “0”, as we will need to use HS concordance data frames to convert HS codes to 2012 version, and those concordance data frames store HS codes in numeric format.

  • Using cat("--------- ", i, " HS --------- \n"), which will simply print --------- i ---------, where i is the number from the loop sequence, will make it easier to view the progress of loop execution in the Console in the bottom left section of RStudio window.

i <- 2
#for (i in 1:nrow(matching)){
  cat("--------- ", i, " --------- \n")
## ---------  2  ---------
  x <- strsplit(matching.list[[i]]$HS, ";")[[1]]

  x <- as.numeric(substr(trimws(x), 1, 6))
  x <- unique(x)

Now x contains 1130 strings containing unique HS codes that are not longer than 6-digit. Of them 1130 are 4-digit HS codes. We need to expand them to 6-digit HS codes, while taking into account HS version marked in sub-item HS_version, which in this case is HS 2007.

So first we need to check the version of HS in which HS codes are recorded and treat them accordingly. Here we use a series of conditional if...else statements. The first one reads as if the value in sub-item HS_version of item 2 of matching.list is equal to “HS4” then check the following condition (this leads to the next if...else statement within), otherwise skip to the next step.

Since HS codes in item 2 are recorded in version HS 2007(HS3), we are directed to the next step. If HS version was 2012, then the code chunk below would consecutively check, whether there are any codes that are 4 or 2 digits long, and, if so, it would then use earlier created object hs_codes12 to expand them to the underlying 6-digit codes. We will examine these actions in more detail in the steps below, as they recur.

  # 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
  if(matching.list[[i]]$HS_version=="HS4"){
      if (length(x[x>99&x<=9999])>0){
      x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes12[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
      x <- x[-which(x>99&x<=9999)]
      x <- c(x,x4)
    }

  # 2. test for presence of 2-digit codes and expand them to 6-digit using the right HS version
    if (length(x[x<=99])>0){
      x2 <- merge(data.frame(x = x[x<=99]), hs_codes12[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
      x <- x[-which(x<=99)]
      x <- c(x,x2)
    }
  }

Practice subsetting a vector of numeric values



Before we go to the next chunk of our for loop, let’s practice subsetting a vector of numeric values. As was the case with data frames above, we can do this by using square brackets to describe criteria, which will be used to extract or reference elements in a vector. Paste the code below into your RStudio console to view the output.

First we create two atomic vector objects y and z by taking the long character strings from practice_matching from cells in column HS rows 87 and 147. We use nested functions to consecutively split one strings into several (“;” is treated as a separator), trim white spaces and convert the strings to numeric. For convenience, let’s join them into one vector yz containing all of the strings from y and z, and print it to view.

y <- as.numeric(trimws(strsplit(practice_matching$HS[[87]], ";")[[1]]))
z <- as.numeric(trimws(strsplit(practice_matching$HS[[147]], ";")[[1]]))
yz <- c(y,z)
print(yz)
##   [1]   8401   8402   8403   8404   8405   8406   8407   8408   8409   8410
##  [11]   8411   8412   8413   8414   8415   8416   8417   8418   8419   8420
##  [21]   8421   8422   8423   8424   8425   8426   8427   8428   8429   8430
##  [31]   8431   8432   8433   8434   8435   8436   8437   8438   8439   8440
##  [41]   8441   8442   8443   8444   8445   8446   8447   8448   8449   8450
##  [51]   8451   8452   8453   8454   8455   8456   8457   8458   8459   8460
##  [61]   8461   8462   8463   8464   8465   8466   8467   8468   8469   8470
##  [71]   8471   8472   8473   8474   8475   8476   8477   8478   8479   8480
##  [81]   8481   8482   8483   8484   8486   8487   8501   8502   8503   8504
##  [91]   8505   8506   8507   8508   8509   8510   8511   8512   8513   8514
## [101]   8515   8516   8517   8518   8519   8521   8522   8523   8525   8526
## [111]   8527   8528   8529   8530   8531   8532   8533   8534   8535   8536
## [121]   8537   8538   8539   8540   8541   8542   8543   8544   8545   8546
## [131]   8547   8548 392210 392220 392290 691010 691090 732410 732421 732429
## [141] 732490 381300     84     85     86     87     88     89

Now let’s use subsetting with square brackets to check for the strings that are 2-, 4-, 6- and more than 6 digits. Since our HS codes are all numeric values and since the leading zeros of HS codes got dropped, we can use the following criteria:

Number of digits in HS code Likely numbers in our HS code vector Criteria for selection to be used in R
2 digits 1 to 99 [yz<=99]
4 digits 100 to 9999 [yz>99&yz<=9999]
6 digits 10000 to 999999 [yz>9999&yz<=999999]
more than 6 digits 1000000 and above [yz>999999]
yz[yz<=99]
## [1] 84 85 86 87 88 89
yz[yz>99&yz<=9999]
##   [1] 8401 8402 8403 8404 8405 8406 8407 8408 8409 8410 8411 8412 8413 8414
##  [15] 8415 8416 8417 8418 8419 8420 8421 8422 8423 8424 8425 8426 8427 8428
##  [29] 8429 8430 8431 8432 8433 8434 8435 8436 8437 8438 8439 8440 8441 8442
##  [43] 8443 8444 8445 8446 8447 8448 8449 8450 8451 8452 8453 8454 8455 8456
##  [57] 8457 8458 8459 8460 8461 8462 8463 8464 8465 8466 8467 8468 8469 8470
##  [71] 8471 8472 8473 8474 8475 8476 8477 8478 8479 8480 8481 8482 8483 8484
##  [85] 8486 8487 8501 8502 8503 8504 8505 8506 8507 8508 8509 8510 8511 8512
##  [99] 8513 8514 8515 8516 8517 8518 8519 8521 8522 8523 8525 8526 8527 8528
## [113] 8529 8530 8531 8532 8533 8534 8535 8536 8537 8538 8539 8540 8541 8542
## [127] 8543 8544 8545 8546 8547 8548
yz[yz>9999&yz<=999999]
##  [1] 392210 392220 392290 691010 691090 732410 732421 732429 732490 381300
yz[yz>999999]
## numeric(0)

To count how many 2-, 4- and 6-digit HS codes are present in vector yz, we combine the above code with length function.

length(yz[yz<=99])
## [1] 6
length(yz[yz>99&yz<=9999])
## [1] 132
length(yz[yz>9999&yz<=999999])
## [1] 10
length(yz[yz>999999])
## [1] 0

We can use function any to check whether any of the elements in a vector conform to a certain criterion. We will use this function later on.

any(yz[yz<=99])
## [1] TRUE
any(yz[yz>99&yz<=9999])
## [1] TRUE
any(yz[yz>9999&yz<=999999])
## [1] TRUE
any(yz[yz>999999])
## [1] FALSE

You can use function which to show the elements of a vector that conform to a given criterion, or to negate it to show those elements that do not conform to it.

yz[which(yz>9999&yz<=999999)]
##  [1] 392210 392220 392290 691010 691090 732410 732421 732429 732490 381300
yz[-which(yz>9999&yz<=999999)]
##   [1] 8401 8402 8403 8404 8405 8406 8407 8408 8409 8410 8411 8412 8413 8414
##  [15] 8415 8416 8417 8418 8419 8420 8421 8422 8423 8424 8425 8426 8427 8428
##  [29] 8429 8430 8431 8432 8433 8434 8435 8436 8437 8438 8439 8440 8441 8442
##  [43] 8443 8444 8445 8446 8447 8448 8449 8450 8451 8452 8453 8454 8455 8456
##  [57] 8457 8458 8459 8460 8461 8462 8463 8464 8465 8466 8467 8468 8469 8470
##  [71] 8471 8472 8473 8474 8475 8476 8477 8478 8479 8480 8481 8482 8483 8484
##  [85] 8486 8487 8501 8502 8503 8504 8505 8506 8507 8508 8509 8510 8511 8512
##  [99] 8513 8514 8515 8516 8517 8518 8519 8521 8522 8523 8525 8526 8527 8528
## [113] 8529 8530 8531 8532 8533 8534 8535 8536 8537 8538 8539 8540 8541 8542
## [127] 8543 8544 8545 8546 8547 8548   84   85   86   87   88   89

With function which we can set forth criteria for selecting elements from a vector. In code chunk below we first create an object a, which only contains those elements of yz that correspond to 2-digit HS codes. And then we create object b, which contains all elements of yz excluding those that correspond to 2-digit HS codes. We will use this function later on.

Use the last line to remove all the object that were created in this practice session and that are redundant now.

a <- yz[which(yz<=99)]
a
## [1] 84 85 86 87 88 89
b <- yz[-which(yz<=99)]
b
##   [1]   8401   8402   8403   8404   8405   8406   8407   8408   8409   8410
##  [11]   8411   8412   8413   8414   8415   8416   8417   8418   8419   8420
##  [21]   8421   8422   8423   8424   8425   8426   8427   8428   8429   8430
##  [31]   8431   8432   8433   8434   8435   8436   8437   8438   8439   8440
##  [41]   8441   8442   8443   8444   8445   8446   8447   8448   8449   8450
##  [51]   8451   8452   8453   8454   8455   8456   8457   8458   8459   8460
##  [61]   8461   8462   8463   8464   8465   8466   8467   8468   8469   8470
##  [71]   8471   8472   8473   8474   8475   8476   8477   8478   8479   8480
##  [81]   8481   8482   8483   8484   8486   8487   8501   8502   8503   8504
##  [91]   8505   8506   8507   8508   8509   8510   8511   8512   8513   8514
## [101]   8515   8516   8517   8518   8519   8521   8522   8523   8525   8526
## [111]   8527   8528   8529   8530   8531   8532   8533   8534   8535   8536
## [121]   8537   8538   8539   8540   8541   8542   8543   8544   8545   8546
## [131]   8547   8548 392210 392220 392290 691010 691090 732410 732421 732429
## [141] 732490 381300

As a final thing, let’s use function any to check if any elements of vector b are present among the elements of vector a. Given the criteria that were used to create these two vector objects, the result is naturally FALSE.

Use the last line to remove all the object that were created in this practice session and that are redundant now.

any(b %in% a)
## [1] FALSE
rm(y,z,yz,a,b)

1.6 Truncate long HS codes to 6-digit codes (continued)

Now let’s get back to our for loop.

The chunk below uses the if...else statement to check, whether the HS codes extracted from item 2 are marked as HS version of 2007. Since it is, the code leads us to the following if...else statements within. Statement 1 checks, whether vector x contains any strings that are 3 or 4 digits long.15

If so, the code then creates object x4 by implementing a series of transformations with object x described in the code string x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes07[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]:

  • it uses function data.frame and argument x=x[x>99&x<=9999] to extract 3- and 4-digit long strings from x and turn them into a x4 data frame object, consisting of one column x containing the extracted HS codes
  • it uses function merge to merge the resulting x4 data frame with another data frame created in an earlier step - hs_codes07, which contains 2-, 4- and 6-digit HS codes of HS 2007. These two data frames are merged by those columns of each containing 4-digit HS codes. This essentially adds to data frame x4 a new hs6 column that contains 6-digit HS codes that correspond to the 4-digit HS codes already present in x4.
  • it uses a further specification in square brackets at the very end, signaling to only extract 6-digit HS codes recorded in the newly added hs6 column, and to rewrite x4 object with them (this turns x4 back into a vector containing new 6-digit HS codes).

In the next code string, we use negation of function which to remove 3- and 4-digit long strings from x.16. And then we rewrite x by joining objects x and x4.

Statement 2 checks, whether vector x contains any strings that are 2 or less digits long. 17 If there are, the code takes us through a series of transformations that follow the same logic as the ones described for transformation of 4-digit codes above. Since, there are no 2-digit HS codes in the examined strings, the code takes us to the following step - conversion of HS codes to HS 2012 version.

First, we remove any duplicates from x using unique. Then, we turn x into a data frame and merge it with a data frame created in an earlier step hs_concordance0712, which contains a table of corresponding 6-digit codes from versions of 2007 and 2012. Note argument all.x=TRUE in function merge. It results in extra rows being added to the output, one for each row in x that has no matching row in y. This way if there is no value in HS07 column of hs_concordance0712 that matches a value in x column in the original object x, then the corresponding cell in column HS12 in the resulting object x will be left empty. We then overwrite these empty cells with the corresponding values from column x of the resulting object x, to make sure none of the HS codes are lost.

In the next step we overwrite x with only the unique HS codes from column HS12 of the resulting data frame.

 if(matching.list[[i]]$HS_version=="HS3"){
  # 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
    if (length(x[x>99&x<=9999])>0){
      x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes07[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
      x <- x[-which(x>99&x<=9999)]
      x <- c(x,x4)
    }

  # 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
    if (length(x[x<=99])>0){
      x2 <- merge(data.frame(x = x[x<=99]), hs_codes07[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
      x <- x[-which(x<=99)]
      x <- c(x,x2)
    }

   # 3. convert HS 2007 codes to HS 2012
    x <- unique(x)
    x <- data.frame(x)
    x <- merge(x,hs_concordance0712, by.x="x", by.y="HS07", all.x=TRUE)
    x$HS12[is.na(x$HS12)] <- x$x[is.na(x$HS12)]
    x <- unique(x$HS12)

  }

The following code chunk tests, whether the examined string of HS codes is marked as HS 2017 version. If it is, then in takes us through a series of transformations that follow the same logic as the ones described immediately above for HS codes marked as HS 2007. But since our examined string is not HS 2017, the code will skip to the step that follows after this code chunk (see last line of code below), which converts the resulting atomic vector ‘x’ to a list vector with function list and writes it to sub-item HS of item 2 of matching.list.

 if(matching.list[[i]]$HS_version=="HS5"){
    # 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
    if (length(x[x>99&x<=9999])>0){
      x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes17[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
      x <- x[-which(x>99&x<=9999)]
      x <- c(x,x4)
    }

    # 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
    if (length(x[x<=99])>0){
      x2 <- merge(data.frame(x = x[x<=99]), hs_codes17[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
      x <- x[-which(x<=99)]
      x <- c(x,x2)
    }

    # 3. convert HS 2017 codes to HS 2012

    x <- unique(x)
    x <- data.frame(x)
    x <- merge(x,hs_concordance1712, by.x="x", by.y="HS17", all.x=TRUE)
    x$HS12[is.na(x$HS12)] <- x$x[is.na(x$HS12)]
    x <- unique(x$HS12)

  }
# 4. write the resulting HS codes into the corresponding subitem in "matching.list"
matching.list[[i]]$HS <- list(x) 

A quick look with length(matching.list[[i]]$HS[[1]]) at sub-item HS of item 2, tells us that it now contains 4857 strings, as opposed to 1130 it had before. And using unique(nchar(matching.list[[i]]$HS[[1]])) tells us that sub-item HS of item 2 only contains HS codes that are 5- and 6-digit long.

You will notice that in matching two rows are marked to likely have a mix of HS codes from versions of 2012 and 2017. For those cases at the research stage it was not possible to determine, the exact version of the HS (for details see the ESCAP working paper ‘Exploring linkages between non-tariff measures and the Sustainable Development Goals: A global concordance matrix and application to Asia and the Pacific’.18) All those codes are 6 digit, and for the purpose of this analysis they are assumed as 2012 version of the HS. The code keeps them as they are.

Quiz 9



Question 9.1: Create a new object a and use a succession of functions strsplit, trimws, as.numeric and length (or nest these functions) to determine the number of unique HS codes stored in row 1 column HS of a current iteration of healthSDG data frame. What is the number?

A. 305

# Answer A is correct.

# The correct code is below for reference.

B. 525

# Answer B is incorrect.

C. 258

# Answer C is incorrect.

D. 369

# Answer D is incorrect.



Question 9.2: Use code to check, whether object a has any values that correspond to 2-digit HS codes. How many are they?

A. 57

# Answer A is incorrect.

B. 39

# Answer B is incorrect.

C. 19

# Answer C is incorrect.

D. 0

# Answer D is correct.

# The correct code is below for reference.



Question 9.3: a now contains HS codes that are marked as HS3 (2007). Write code to convert the codes to 2012 version of HS.

A. 230

# Answer A is incorrect.

B. 310

# Answer B is correct.

# The correct code is below for reference.

C. 218

# Answer C is incorrect.

D. 256

# Answer D is incorrect.



The code stored under the button below is not necessary for further training and quizzes. The chunk includes the correct code strings from the questions above, and the last line removes the redundant objects.

# Question 1
a <- strsplit(healthSDG$HS, ";")[[1]]
a <- trimws(a)
a <- as.numeric(a)
length(a)

# Question 2
length(a[a<=99])

# Question 3
    a <- data.frame(a)
    a <- merge(a,hs_concordance0712, by.x="a", by.y="HS07", all.x=TRUE)
    a$HS12[is.na(a$HS12)] <- a$a[is.na(a$HS12)]
    a <- unique(a$HS12)
    length(a)

    rm(a)

1.7 Expand NTM codes to include themselves and all underlying codes at more disaggregated levels of ICNTM

For this step we will manipulate strings in sub-item NTM of item 2 of matching.list and use object MAST_codes that we created earlier to extract the corresponding NTM codes from.

First, we create an object temp_NTM by copying a long string in sub-item NTM item 2, splitting it into a vector of character strings separated by “;” and trimming all leading and trailing white spaces from each substring. To do this we use strsplit and trimws. We then create object temp_NTM_full, which is a copy of temp_NTM. temp_NTM with serve as a source from which to derive individual NTM codes, and temp_NTM_full will be used for gradually populating the expanded NTM codes.

Here we will use a for loop for the sequence that is a number of character strings in temp_NTM.

To help keep better track of what is being done, the loop first prints the nth NTM code from temp_NTM. For item 2 the first NTM code is “A1”. Then the code saves the NTM code into a new object this_mst, and saves a number of characters in this_mst as a numeric object n_char_numeric, which is “2” in the examined case.

Then the code creates an object this_level where it pastes together the word “level” and the value that is found in the MAST_codes data frame in the cell of the column level that corresponds to the NTM code currently stored in this_mst. This way we get “level2” stored in object this_level, which coincides with one of the column titles in MAST_codes data frame.

To extract the NTM subcodes that are under code “A1”, we create a temporary data frame object MAST_codes_temp, which contains all rows of MAST_codes, the column this_level of which contains the code that is stored in this_mst. Then we overwrite this_mst with unique NTM codes contained in column Code of MAST_codes_temp that are longer than the value stored in n_char_numeric based on the corresponding value in column level. Them we overwrite temp_NTM_full by combining values from this_mst and temp_NTM_full. The latter now contains the 7 NTM codes that fall under the first NTM code of temp_NTM, as well as the remaining not yet expanded NTM codes of temp_NTM. Those will be expanded in the next cycles of this loop execution. The last line of code in this for loop removes the temporary objects n_char_numeric, this_level, and MAST_codes_temp.

You may wonder why we need to create so many strangely named objects and stuff the code with them. It is because each cycle (or iteration) of the loop uses functions that take multiple arguments, specific values of which are dependent of the NTM code that is being processed. In paragraph above we described processing of NTM code A1. One of the following iterations of the loop will likely process NTM code B851. Try running the code within the loop’s curly brackets for that code only and see how the values of each variable change.

Once this for loop runs all of its cycles for the number of NTM codes stored in temp_NTM, the code will overwrite temp_NTM_full with itself keeping only the unique NTM codes. And then all the resulting NTM codes are stored into sub-item NTM of item 2 of matching.list as a list vector. Them we use rm to remove the rest of the temporary objects created in this for loop. Since this loop is part of the bigger for loop the code will return in due course to this loop to expand all NTM codes contained in all items of matching.list.

  temp_NTM <- trimws(strsplit(matching.list[[i]]$NTM, ";")[[1]])
  temp_NTM_full <- temp_NTM

  for (n in 1:length(temp_NTM)){
    print(temp_NTM[n])
    this_mst <- temp_NTM[n]
    n_char_numeric <- as.numeric(nchar(this_mst))
    this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
    MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
    this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
    temp_NTM_full <- c(this_mst, temp_NTM_full)
    rm(n_char_numeric, this_level, MAST_codes_temp)
  }
  temp_NTM_full <- unique(temp_NTM_full)
  matching.list[[i]]$NTM <- list(temp_NTM_full)
  rm(temp_NTM, n,temp_NTM_full)

Quiz 10



Question 10.1: Write code to expand NTM code E3. How many unique codes will be generated? As the last line of code remove all created temporary objects.

A. 10

# Answer A is incorrect.

B. 11

# Answer B is incorrect.

C. 13

# Answer C is correct.

# The correct code is below for reference.

D. 15

# Answer D is incorrect.



Question 10.2: Write nested for loops to expand all NTM codes in all items of health.list. How many NTM codes will be recorded in item 23 after such transformation?

A. 34

# Answer A is correct.

# The correct code is below for reference.

B. 25

# Answer B is incorrect.

C. 18

# Answer C is incorrect.

D. 41

# Answer D is incorrect.



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and some of these objects you will need for further quizzes and practice sessions.

# Question 1
    this_mst <- "E3"
    n_char_numeric <- as.numeric(nchar(this_mst))
    this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
    MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
    this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
    this_mst <- unique (this_mst)
    length(this_mst)

    rm(this_mst, n_char_numeric, this_level, MAST_codes_temp)


# Question 2
for(i in 1:nrow(healthSDG)){

  temp_NTM <- trimws(strsplit(health.list[[i]]$NTM, ";")[[1]])
  temp_NTM_full <- temp_NTM

  for (n in 1:length(temp_NTM)){
    #print(temp_NTM[n])
    this_mst <- temp_NTM[n]
    n_char_numeric <- as.numeric(nchar(this_mst))
    this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
    MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
    this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
    temp_NTM_full <- c(this_mst, temp_NTM_full)
    rm(n_char_numeric, this_level, MAST_codes_temp)
  }
  temp_NTM_full <- unique(temp_NTM_full)
  health.list[[i]]$NTM <- list(temp_NTM_full)
  rm(temp_NTM, n,temp_NTM_full, this_mst)
}
    length(health.list[[23]]$NTM[[1]])

1.8 Prepare keywords and negative keywords for matching

The last section of the big for loop finalizes preparation of keywords and negative keywords in matching.list.

We use conditional if...else statement to check if the sub-item Keywords in item 2 contains any strings. If it does, the statement takes us through the rest of the code within the curly brackets.

First we create a temporary object kw by copying a long string in sub-item Keywords of item 2, splitting it into a vector of character strings separated by “;” and trimming all leading and trailing white spaces from each substring.

The next line, which uses function tolower, overwrites kw with itself by formatting all substrings that do not contain “!” in them to lower case. To specify, which strings need to be formatted, we use code kw[!grepl("!", kw)], in which ! before grepl signals negation and function grepl searches for a specific pattern in the substrings in kw. As a result, those keywords, which are abbreviations recorded as all caps and marked with “!” will remain as all caps and will be used for case sensitive keyword matching. The rest of keywords will be used for case insensitive matching. This is done as certain abbreviations may match other words or their parts. Example: Letter combination “ODS”, which as an abbreviation stands for “ozone depleting substances”, may be found in words “goods”, “rods”, “foods”, etc.

Then all the resulting keywords are stored into sub-item Keywords of item 2 of matching.list as a list vector. Then we use rm to remove the temporary object kw.

The same is done for negative keywords.

# Keywords
if(!is.na(matching.list[[i]]$Keywords)){
  kw <- trimws(strsplit(matching.list[[i]]$Keywords, ";")[[1]])
  kw[!grepl("!", kw)] <- tolower(kw[!grepl("!", kw)])
  matching.list[[i]]$Keywords <- list(kw)
  kw
  rm(kw)
}
  #Negative keywords
if(!is.na(matching.list[[i]]$neg_Keywords)){
  nkw <- trimws(strsplit(matching.list[[i]]$neg_Keywords, ";")[[1]])
  nkw[!grepl("!", nkw)] <- tolower(nkw[!grepl("!", nkw)])
  matching.list[[i]]$neg_Keywords <- list(nkw)
  nkw
  rm(nkw)
  }
#}

Quiz 11



Question 11.1: We want to see how many strings in row 40 of matching data frame are abbreviations marked with !. Which line of code will achieve that? Note that ; can be used to put two lines of code in one line.

A. KeyW <- trimws(strsplit(matching$Keywords[[40]], ";")[[1]]); length(KeyW[grepl("!")])

# Answer A is incorrect. `grepl` function is missing an argument indicating a character vector or an object that should be searched for matches.

B. KeyW <- trimws(strsplit(matching$Keywords[[40]], ";")[[1]]); length(grepl("!", KeyW))

# Answer B is incorrect. We use `grepl` function to subset character vector `KeyW`, thus we need to use square brackets for subsetting like so: `KeyW[grepl("!")]`.

C. KeyW <- trimws(strsplit(matching$Keywords[[40]], ";")[[1]]); length(KeyW[grepl("!", KeyW)])

# Answer C is correct.

D. KeyW <- trimws(strsplit(matching$Keywords[[40]], ";")[[1]]); length(KeyW[!grepl("!", KeyW)])

# Answer D is incorrect. Unnecessary negation is indicated with `!` within the square brackets in front of `grepl`.



Question 11.2: Write code to check the number of non-abbreviations in row 38 of matching data frame. Choose the correct number below.

A. 63

# Answer A is correct.

# The correct code is below for reference.

B. 67

# Answer B is incorrect. 

C. 54

# Answer C is incorrect. 

D. 58

# Answer D is incorrect. 



Once you are done with the quiz, you can check the correct code under this button.

# Question 1

KeyW <- trimws(strsplit(matching$Keywords[[40]], ";")[[1]]); length(KeyW[grepl("!", KeyW)])

# Question 2

KeyW <- trimws(strsplit(matching$Keywords[[38]], ";")[[1]]); length(KeyW[!grepl("!", KeyW)])

rm(KeyW)

1.9 Review the resulting matching.list object

So, we examined all of the code chunks within the large for loop that finalizes the setting up of the matching.list. Here is what item 2 of the matching.list looks like after all the transformations.

The preview of item 2 of the “matching.list” after the transformations.

The preview of item 2 of the “matching.list” after the transformations.

If you run a complete loop for the complete sequence of i (currently, inactivated by hash symbols in code chunk above), all relevant sub-items within all items of matching.list will be transformed this way.

In fact the code for the complete loop is under this button. Make sure to paste it in RStudio and run it before you move on with the training. It will take a few minutes. Look through it. Do you understand what each line of code does?

# creating `matching.list` once again. we need to do this once again, as we need `matching.list` objects unchanged, so as to demonstrate the workings of the complete full loop

matching.list <- split(matching, seq(nrow(matching)))

for (i in 1:nrow(matching)){
#cat("--------- ", i, " --------- \n")
# HS
x <- strsplit(matching.list[[i]]$HS, ";")[[1]]
x <- as.numeric(substr(trimws(x), 1, 6))
x <- unique(x)

# 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
if(matching.list[[i]]$HS_version=="HS4"){
  if (length(x[x>99&x<=9999])>0){
    x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes12[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
    x <- x[-which(x>99&x<=9999)]
    x <- c(x,x4)
  }

  # 2. test for presence of 2-digit codes and expand them to 6-digit using the right HS version
  if (length(x[x<=99])>0){
    x2 <- merge(data.frame(x = x[x<=99]), hs_codes12[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
    x <- x[-which(x<=99)]
    x <- c(x,x2)
  }
}

if(matching.list[[i]]$HS_version=="HS3"){
  # 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
  if (length(x[x>99&x<=9999])>0){
    x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes07[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
    x <- x[-which(x>99&x<=9999)]
    x <- c(x,x4)
  }

  # 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
  if (length(x[x<=99])>0){
    x2 <- merge(data.frame(x = x[x<=99]), hs_codes07[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
    x <- x[-which(x<=99)]
    x <- c(x,x2)
  }

  # 3. convert HS 2007 codes to HS 2012
  x <- unique(x)
  x <- data.frame(x)
  x <- merge(x,hs_concordance0712, by.x="x", by.y="HS07")
  x <- unique(x$HS12)

}

if(matching.list[[i]]$HS_version=="HS5"){
  # 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
  if (length(x[x>99&x<=9999])>0){
    x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes17[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
    x <- x[-which(x>99&x<=9999)]
    x <- c(x,x4)
  }

  # 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
  if (length(x[x<=99])>0){
    x2 <- merge(data.frame(x = x[x<=99]), hs_codes17[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
    x <- x[-which(x<=99)]
    x <- c(x,x2)
  }

  # 3. convert HS 2017 codes to HS 2012
  x <- data.frame(x)
  x <- merge(x,hs_concordance1712, by.x="x", by.y="HS17")
  x <- unique(x$HS12)

}
# 4. write the resulting HS codes into the corresponding subitem in "matching.list"
matching.list[[i]]$HS <- list(x)

# NTMs
temp_NTM <- trimws(strsplit(matching.list[[i]]$NTM, ";")[[1]])
temp_NTM_full <- temp_NTM

for (n in 1:length(temp_NTM)){
#print(temp_NTM[n])
this_mst <- temp_NTM[n]
n_char_numeric <- as.numeric(nchar(this_mst))
this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
temp_NTM_full <- c(this_mst, temp_NTM_full)
rm(n_char_numeric, this_level, MAST_codes_temp)
}
temp_NTM_full <- unique(temp_NTM_full)
matching.list[[i]]$NTM <- list(temp_NTM_full)
rm(temp_NTM, n,temp_NTM_full)

# Keywords
if(!is.na(matching.list[[i]]$Keywords)){
  kw <- trimws(strsplit(matching.list[[i]]$Keywords, ";")[[1]])
  kw[!grepl("!", kw)] <- tolower(kw[!grepl("!", kw)])
  matching.list[[i]]$Keywords <- list(kw)
  kw
  rm(kw)
}

#Negative keywords
if(!is.na(matching.list[[i]]$neg_Keywords)){
  nkw <- trimws(strsplit(matching.list[[i]]$neg_Keywords, ";")[[1]])
  nkw[!grepl("!", nkw)] <- tolower(nkw[!grepl("!", nkw)])
  matching.list[[i]]$neg_Keywords <- list(nkw)
  nkw
  rm(nkw)
}
}
rm(x,x2,x4,i,this_mst)

Practice writing code with nested for loop’s and if...else statements



Try writing a long for loop to modify health.list in the same way that we modified matching.list in the sections above. If the task seems too intimidating to do from scratch, refer to the complete for loop in a file with a complete code algorithm (see Data section).

Important: In one of the quizzes above we already implemented such transformation for NTM codes in health.list. You can see that by clicking onto health.list object in the Environment tab in the top right corner and then on any of the list items. You will see that NTM subitem is already stored as a list. And we need to achieve the same result for all other subitems of each item. Thus, the loop that you will put together should not have the code for processing NTM codes, otherwise an error message will be generated, and the code will stop running.

First, try to put together or edit the code from the complete algorithm by yourself. Once you gave it a chance click the button below to see the complete code. Look through it. Do you understand what each line of code does?

for (i in 1:nrow(healthSDG)){
  cat("--------- ", i, " --------- \n")
  # HS
  x <- strsplit(health.list[[i]]$HS, ";")[[1]]
  x <- as.numeric(substr(trimws(x), 1, 6))
  x <- unique(x)

  # 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
  if(health.list[[i]]$HS_version=="HS4"){
    if (length(x[x>99&x<=9999])>0){
      x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes12[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
      x <- x[-which(x>99&x<=9999)]
      x <- c(x,x4)
    }

    # 2. test for presence of 2-digit codes and expand them to 6-digit using the right HS version
    if (length(x[x<=99])>0){
      x2 <- merge(data.frame(x = x[x<=99]), hs_codes12[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
      x <- x[-which(x<=99)]
      x <- c(x,x2)
    }
  }

  if(health.list[[i]]$HS_version=="HS3"){
    # 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
    if (length(x[x>99&x<=9999])>0){
      x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes07[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
      x <- x[-which(x>99&x<=9999)]
      x <- c(x,x4)
    }

    # 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
    if (length(x[x<=99])>0){
      x2 <- merge(data.frame(x = x[x<=99]), hs_codes07[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
      x <- x[-which(x<=99)]
      x <- c(x,x2)
    }

    # 3. convert HS 2007 codes to HS 2012

    x <- unique(x)
    x <- data.frame(x)
    x <- merge(x,hs_concordance0712, by.x="x", by.y="HS07", all.x=TRUE)
    x$HS12[is.na(x$HS12)] <- x$x[is.na(x$HS12)]
    x <- unique(x$HS12)

  }

  if(health.list[[i]]$HS_version=="HS5"){
    # 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
    if (length(x[x>99&x<=9999])>0){
      x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes17[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
      x <- x[-which(x>99&x<=9999)]
      x <- c(x,x4)
    }

    # 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
    if (length(x[x<=99])>0){
      x2 <- merge(data.frame(x = x[x<=99]), hs_codes17[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
      x <- x[-which(x<=99)]
      x <- c(x,x2)
    }

    # 3. convert HS 2017 codes to HS 2012

    x <- unique(x)
    x <- data.frame(x)
    x <- merge(x,hs_concordance1712, by.x="x", by.y="HS17", all.x=TRUE)
    x$HS12[is.na(x$HS12)] <- x$x[is.na(x$HS12)]
    x <- unique(x$HS12)

  }
  # 4. write the resulting HS codes into the corresponding subitem in "health.list"
  health.list[[i]]$HS <- list(x)

   # Keywords
  if(!is.na(health.list[[i]]$Keywords)){
    kw <- trimws(strsplit(health.list[[i]]$Keywords, ";")[[1]])
    kw[!grepl("!", kw)] <- tolower(kw[!grepl("!", kw)])
    health.list[[i]]$Keywords <- list(kw)
    kw
    rm(kw)
  }

  #Negative keyworkds
  if(!is.na(health.list[[i]]$neg_Keywords)){
    nkw <- trimws(strsplit(health.list[[i]]$neg_Keywords, ";")[[1]])
    nkw[!grepl("!", nkw)] <- tolower(nkw[!grepl("!", nkw)])
    health.list[[i]]$neg_Keywords <- list(nkw)
    nkw
    rm(nkw)
  }
}
rm(x,x2,x4,i,this_mst)

Since all the transfomation has been already been implemented to all items of matching.list, we can get the properly transformed health.list by just subsetting the matching.list. What code could work for that?

health.list <- matching.list[3:29]; names(health.list) <-1:27

1.10 Import excel file with UNCTAD TRAINS database of NTMs into R and prepare it for matching

Now we need to import the data frame that contains the data from the UNCTAD’s TRAINS database of NTMs into R, and them clean up and organize the data. Similarly to the instances above, we use read_excel to do that. However, this time we need to include argument skip=9 to skip the first nine rows of the sheet titled as “new”, as they do not contain any data relevant to us.

Please note that the excel file with NTM data has been modified to include columns with translations of non-English descriptions into English.19

ntms_all <- read_excel("UNCTAD_i-tip_report_ALL_Measures_x.xlsx",  skip=9, sheet="new")

If you look at the column names with names(ntms_all), you will see that some of the columns need to be removed, namely: Category, In force, Withdrawn, Also domestic, …17. We can do this by using pipe operator %>% and function select of dplyr package to select specific columns that we want to keep, as is shown below. Using pipe operator %>% instead of nested functions within multiple round brackets makes the code look more readable.

Note, that a string of code can use a few %>% and be quite long. In this case you can break it up into a few lines. Each of such lines should end in %>%, otherwise the code won’t run.

ntms_all <-  ntms_all %>% select("Country imposing",
                                 "Partner affected",
                                 "NTM Code",
                                 "HS",
                                 "Measure description",
                                 "National legal basis" ,
                                 "Product description",
                                 "Source",
                                 "Measure description_English",
                                 "Product description_English",
                                 "Source_English",
                                 "National legal basis_English")

Keyword and negative keyword matching will be implemented for all descriptive columns of ntms_all. So, for convenience we merge all four descriptive columns: first, the original ones and then the translated ones (the latter are marked with _"English"). For that we use function unite of tidyr package and include the following arguments:

  • ntms_all to specify the source data frame
  • "Description" to specify the name of the new column
  • c() with the names of the columns to be united
  • sep=" " to indicate that a space should be inserted between the combined character strings
  • remove=TRUE to indicate that the source columns should be removed from the data frame
ntms_all <- unite(ntms_all, "Description", c(
  `Measure description`,
  `National legal basis` ,
  `Product description`,
  `Source`
), sep = " ", remove = TRUE)

ntms_all <- unite(ntms_all, "Description_English", c(
  "Measure description_English",
  "Product description_English",
  "Source_English",
  "National legal basis_English"
), sep = " ", remove = TRUE)

For any of the empty united columns, the string “NA” was included in the new column. And if all four columns were empty, four consecutive “NA”s were recorded into the resulting column, like so “NA NA NA NA”. Now, those cells which in column Descriptions do not contain “NA NA NA NA” and for which the corresponding cells in column Description_English do not contain “NA NA NA NA”, need to be overwritten with the content of the corresponding cells in column Description_English. That is, we need to overwrite descriptions in language other than English with the descriptions translated into English.

Having done that, we need to delete now redundant column Description_English. See the relevant code strings below.

ntms_all[ntms_all$Description_English!="NA NA NA NA","Description"] <-
  ntms_all[ntms_all$Description_English!="NA NA NA NA","Description_English"]

ntms_all <- ntms_all[,-which(names(ntms_all)=="Description_English")]

And finally, we use pipe operator %>% and function rename of dplyr package to rename some of the columns for more convenient reference. We can also remove some of the objects that we won’t need from workspace, to make the it less cluttered. As the last step you can take a look at ntms_all with head function.

ntms_all <- ntms_all %>% rename(reporter = `Country imposing`, partner = `Partner affected`, NTM = `NTM Code`, description="Description")
rm(hs_codes07, hs_codes17, hs_concordance, hs_concordance0712, hs_concordance1712)

head(ntms_all)
## # A tibble: 6 x 5
##   reporter      partner      NTM   HS                description
##   <chr>         <chr>        <chr> <chr>             <chr>
## 1 Venezuela, B~ Argentina, ~ B81   300310, 300320, ~ Resolution MERCOSUR /~
## 2 Afghanistan   All Members  P11   0102, 2903, 2904~ Livestock is banned f~
## 3 Afghanistan   All Members  P11   4403              Cuts from all kind of~
## 4 Afghanistan   All Members  P13   06, 07, 08, 0909~ To export from Afghan~
## 5 Afghanistan   All Members  P13   27, 71, 72        You can only export m~
## 6 Afghanistan   All Members  P13   27, 71, 72        In addition to the mi~

Quiz 12



Question 12.1: You want to rename columns in HSversion object we created earlier and to keep only columns 2 through 10. Take a look at that object. Which line of code will achieve that?

A.

HSversion <- HSversion %>%
rename(sdg='2', target='3', product_description='4', list_description='5', hs='6', hsversion='7', ntm='8', kw='9', nkw='10') %>%
select(2,10)

# Answer A is incorrect. In our case the argument in `select` function should include a range of column numbers from 2 through to 10, like so: `select(2:10)`. `select (2, 10)` extracts only two columns: 2 and 10.

B.

HSversion <- HSversion %>%
select(2:10) %>%
rename(sdg='2', target='3', product_description='4', list_description='5', hs='6', hsversion='7', ntm='8', kw='9', nkw='10')

# Answer B is correct.

C. Both.

# Answer C is incorrect.



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code string from the question above, and the object you will need for further quizzes and practice sessions.

HSversion <- HSversion %>% rename(sdg='2', target='3', product_description='4', list_description='5', hs='6', hsversion='7', ntm='8', kw='9', nkw='10') %>% select(2:10)

2 SDG-HS-NTM-Keywords matching

2.1 Brief overview

Below we include a new long for loop that carries out a number of actions to implement matching of data available for China in TRAINS database of NTMs, now stored in ntms_all, against the SDG-HS-NTM Concordance Matrix, now stored in matching.list.

As was done with previous long loop, we will break it up into steps that will contain other smaller loops. For demonstration purposes, we will only run the loop for a limited subset of NTM data. To deactivate the loop we pasted hash sign in front of fors and in front of the closing curly brackets of the loops. The complete code for the loop is available in “complete_matching_code.R” file (see Data section).

2.2 Select a country and create a subset of ntms_all data frame

The list of countries, for which NTM data is available in the TRAINS database of NTMs is stored in countries data frame along with the ISO3 codes for countries. The for loop below will run the code for all countries for which NTM data is available, as is determined by the opening line for (c in 1:nrow(countries)), But for demonstration purposes we only select China, which is in row 20 of the countries data frame. So we assign value of 20 to object c. Then we create two objects this_country and this_countryiso3 to store name of the country “China” and its ISO3 code “CHN” respectively.

Then with the last line of code below we subset a new data frame ntms that contains all NTM data for the cth reporter, which is China in our case, that are available in ntms_all data frame.

Note: if you want to run analysis of all NTMs for only one country, you can set the sequence of for loop only for that country like so: for (c=20) for China. The full index of countries in available in countries data frame.

Additionally, countries_UNCTAD_iso3.csv, from which we generated object countries contains columns with integer values 1 or 0 indicating that some countries are part of certain classification, geographical or economic integration groupings. Therefore, this loop below may be run for such country groupings as well. For example, if you would like to run the code for SPECA countries20 for which NTM data is available in TRAINS database, you could use the following code lines to set up the for loop.

# import the source excel file
# countries <- read.csv(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE)

# subset the data frame to only contain rows for countries that are part of SPECA and columns with country names and ISO3 codes
# speca <- countries[countries$speca==1, c(1,2) ]

# set up the `for loop`
# for (c in 1:nrow(speca)){rest of the loop}

Back to our code:

c <-20  # 20 for China
#for (c in 1:nrow(countries)){

  this_countryiso3 <- countries$iso3[c]
  this_country <- countries$country[c]
  cat( "------ ",this_countryiso3 ," ------\n"); flush.console()    
## ------  CHN  ------
  ntms <- ntms_all[ntms_all$reporter==this_country, ] 

Now, we create a new object ntms.list based on the ntms data frame with number of items equal to the number of rows of ntms and the sub-items containing the data from the individual columns of the corresponding rows.

ntms.list <- split(ntms, seq(nrow(ntms)))

2.3 Truncate long HS codes to 6-digit codes

The following steps of the loop will be examined for China’s NTM stored in item 5998 of ntms.list, which has the following description:

print(ntms.list[[5998]]$description)
## [1] "The following documents should be provided to the administration of environmental protection of the State Council for the application for exporting hazardous waste: […] 7. Users of the importing country (region) validate certification or permit of hazardous waste treatment or utilization. [...] 11. Business permit of exporters. If exports specify in the collection, storage, treatment or utilization of hazardous waste, they must provide the permit for hazardous waste business. <U+5371><U+9669><U+5E9F><U+7269><U+51FA><U+53E3><U+6838><U+51C6><U+7BA1><U+7406><U+529E><U+6CD5>(<U+7B2C>47<U+53F7><U+4EE4>) Hazardous Waste listed in Annexes I, II, VIII, IX of the Basel Convention, refering to Annex 8 Ministry of environmental Protection of the People's Republic of China - <U+5371><U+9669><U+5E9F><U+7269><U+51FA><U+53E3><U+6838><U+51C6><U+7BA1><U+7406><U+529E><U+6CD5>(<U+7B2C>47<U+53F7><U+4EE4>)"

As was already done before for matching.list, we check if the HS codes stored in item 5998 are longer than 6-digits and truncate those to 6 first digits. We also expand all codes that are 4 and 2 digit long to the underlying 6-digit HS codes using 2012 version of the HS (within the TRAINS database of NTMs HS codes are assumed to be registered in HS 2012).

For this purpose we use a nested for loop for the sequence of rows present in object ntms. As before, we use hash sign before the opening and the closing lines of the loop to deactivate it and, instead, set we i as equal to 5998. To run the loop for all rows of ntms you will need to remove the hash signs.

i<-5998
 # 1. loop through each row of NTM data in the ntms.list
  #for (i in 1:nrow(ntms)){

# 2. split the string stored in subitem "HS" of the ith item into a vector or separate character strings stored in object "x"
    x <- list(strsplit(ntms.list[[i]]$HS, ",")[[1]])[[1]]

# 3. overwrite x with itself removing potentially empty strings
    x <- x[!is.na(x)]

# 4. truncate any HS codes longer than 6-digit, and overwrite x as a numeric vector removing any duplicating strings
    x <- unique(as.numeric(substr(trimws(x), 1, 6)))

# 5. test for presence of 4-digit codes and expand them to 6-digit using the 2012 HS version

    if (length(x[x>99&x<=9999])>0){
      x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes12[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
      x <- x[-which(x>99&x<=9999)]
      x <- c(x,x4)
    }

# 6. test for presence of 2-digit codes and expand them to 6-digit using the 2012 HS version
    if (length(x[x<=99])>0){
      x2 <- merge(data.frame(x = x[x<=99]), hs_codes12[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
      x <- x[-which(x<=99)]
      x <- c(x,x2)
    }

# 7. remove all duplicates and write the resulting set of HS codes to subitem "HS" of the ith item as a list vector

    x <- unique(x)
    ntms.list[[i]]$HS <- list(x)
    #cat( "------ ",i ," ------\n");flush.console()
  #}

# 8. remove redundant objects
  rm(x,x2,x4,i)

2.4 Create empty data frame for populating results of matching

In the next step we create an empty data frame df that will be used to populate results of matching for an individual country - China in our case.

The df will contain 13 columns that will eventually contain the following information about each NTM linked to an SDG:

  1. measure measure number for a given country in ntms.list
  2. matching number of a matched row in matching
  3. nt_length number of HS codes regulated by a given NTM that match HS codes in one of the concordance strings in matching.list
  4. matchhs6share share of HS codes regulated by a given NTM that match HS codes in one of the concordance strings in matching.list
  5. hs6s the list of HS codes regulated by a given NTM that match HS codes in one of the concordance strings in matching.list
  6. matchNTM logical value for whether an NTM code of a given NTM matched any of the NTM codes stored in one of the concordance strings in matching.list
  7. NTM NTM code of a given NTM that matched any of the NTM codes stored in one of the concordance strings in matching.list
  8. matchkw keywords that were found present in a given NTM description
  9. matchkwlength number of keywords that were found present in a given NTM description
  10. matchkwshare share of keywords from a given concordance strings that were found present in a given NTM description
  11. matchnkw negative keywords that were found present in a given NTM description
  12. matchnkwlength number of negative keywords that were found present in a given NTM description
  13. matchnkwshare share of negative keywords from a given concordance strings that were found present in a given NTM description

Note, that for each column we include an argument specifying the class of the data to be stored.

df <- data.frame(measure=integer(),
                   matching=integer(),
                   nt_length = integer(),
                 matchhs6share = numeric(),
                   hs6s = character(),

                   matchNTM = logical(),
                   NTM = character(),

                   matchkw = character(),
                   matchkwlength = integer(),
                   matchkwshare = numeric(),

                   matchnkw = character(),
                   matchnkwlength = integer(),
                 matchnkwshare = numeric(),

                   stringsAsFactors=FALSE) 

2.5 Matching of NTM entries in ntms.list against the SDG-HS-NTM concordance strings in matching.list

Now we finally come to the steps when we implement the actual matching of NTMs from TRAINS database against the SDG-HS-NTM concordance strings. Below n stands for the item number in the ntms.list, while m is the item number in matching.list. For our example, we continue with the NTM number 5998, as above, and use item number 36 from matching.list which references use of NTMs to regulate trans-boundary movement of hazardous chemicals and waste.

The first nested loop runs for the sequence of rows in ntms data frame and creates the following temporary objects from ntms.list:

  • nt contains HS codes for nth NTM of country c stored in ntms.list as an atomic vector
  • this_mst contains NTM code for nth NTM of country c stored in ntms.list expanded to include itself and all underlying NTM codes at all further disaggregated levels of ICNTM (same as was done earlier for matching.list). This is done to eliminate cases where matching fails due to certain NTMs being registered in TRAINS database using NTM codes at a higher level of disaggregation than is envisaged in matching.list. This processing is similar to what have been done earlier in section 1.7 above, which describes processing of NTM codes in matching.list
  • description contains the text from description column formatted to lower case21
  • description_caps contains the text from description column without formatting to lower case

Notice, the following 2 lines of code below:

description <- gsub("[\r\n]", "", description)
description <- iconv(description, from = 'UTF-8', to = 'ASCII//TRANSLIT')

In very few cases, NTM description contains text pasted from a table and preserves some symbols that relate to the source formatting. Such symbols may result in error messages and halt execution of code. These two lines remove such symbols.

The first line uses gsub function to find and remove all line breaks and newlines, which are referenced with regular expressions /r and /n. For more details on basic regular expressions in R used for manipulating character strings look, for example, here.

The second line uses function iconv to convert character encoding from UTF-8 to ASCII//TRANSLIT, which removes accents, but keeps the letters.

 n=5998; m=36

  #for (n in 1:nrow(ntms)){
    nt <- ntms.list[[n]]$HS[[1]]

    this_mst <- ntms.list[[n]]$NTM      # NTM code of the nth measure from country c
    n_char_numeric <- as.numeric(nchar(this_mst))
    this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
    MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
    this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
    rm(n_char_numeric, this_level, MAST_codes_temp)

    description  <- ntms.list[[n]]$description
    description <- gsub("[\r\n]", "", description)
    description <- iconv(description, from = 'UTF-8', to = 'ASCII//TRANSLIT')

    description_caps <- description
    description <- tolower(description)

The further nested loop does the following for the sequence of rows in matching data frame:

  • creates temporary object mt from matching.list which contains all HS codes from mth item of matching.list as an atomic vector
  • creates temporary object ntn which contains those HS codes from nt that are present in mt: in our examined case, all 754 HS codes present in the 5998th NTM of China match the HS codes present in the 36th item of the matching.list
  • creates temporary object matchhs6share, which contains the value for share of a country c’s 6-digit HS codes stored in nt that match codes stored in mt
 #for (m in 1:nrow(matching)){
      mt <- matching.list[[m]]$HS[[1]]
      ntn <- nt[nt %in% mt]
      matchhs6share <- length(ntn)/length(nt)*100
  • creates temporary object matchNTM which stores a logical value for whether or not NTM code present in the nth NTM of country c matches NTM code in the mth item in matching.list, which in our case is TRUE
  • creates temporary object NTM which contains the NTM code present in the nth NTM of country c
    matchNTM <- any(this_mst %in% matching.list[[m]]$NTM[[1]])
    NTM <- ntms.list[[n]]$NTM
  • kw contains keywords from mth item of matching.list as an atomic vector with keywords marked with exclamation marks removed22
  • kw_caps contains only the keywords marked with exclamation marks (the exclamations marks themselves are removed)
  • nkw and nkw_caps similarly to the above

Once again, this is done to separate keywords that are abbreviations (all-caps character strings) and that will be used for case-sensitive matching, from all other keywords that will be used for case-insensitive matching. Notice, that to create a list of all-caps keywords we use function grepl to include criterion to find and extract only those keywords that have ! in them. And later, we use function gsub on kw_caps vector to find and remove all ! from the character strings. And then, we use grepl function once again to remove from kw vector all keywords that contain !.

      kw <- matching.list[[m]]$Keywords[[1]]

      kw_caps <- kw[grepl("!", kw)]
      kw_caps <- gsub("!","",kw_caps)
      kw   <- kw[!grepl("!", kw)]

      nkw <- matching.list[[m]]$neg_Keywords[[1]]
      nkw_caps <- nkw[grepl("!", nkw)]
      nkw_caps <- gsub("!","",nkw_caps)
      nkw   <- nkw[!grepl("!", nkw)]

For some of the items in matching.list the keywords are not necessary and are not provided. For them the matching is considered complete if there are only 1 matching HS code and 1 matching NTM code present in a given NTM. So below we include a conditional if...else statement that prescribes two alternative treatments of a country’s NTM description, depending on availability of keywords in matching.list.

If length of vector kw is equal to 1 and is empty, then the following temporary objects should be created: * matchkw with value “NO KW INCL”; * matchkwlength with value 1, * matchkwshare with value 100.

Otherwise, the following objects should be created:

  • matchkw with strings that contain those words and words combinations from an NTM description that strictly match the words and words combinations that are present in description and description_caps
  • matchkwlength with a value equal to the length of vector matchkw indicating a number of matching expressions
  • matchkwshare with a value for the share of keywords provided in mth item of matching.list present in the description of nth NTM of country c

Notice the use of function stri_detect_fixed which is used to look for case-sensitive matches for a certain pattern (second argument) in a specified characters string (first argument). Case-sensitivity is a default setting for this function.

The following if...else statements indicates, that if matchkwlength is equal to 0, then matchkw should be made empty.

      if(length(kw)==1&(is.na(kw[[1]]))){

        matchkw <- "NO KW INCL"
        matchkwlength <- 1
        matchkwshare <- 100

      } else
      {
        matchkw <- kw[stri_detect_fixed(description,kw)]
        matchkw_caps <- kw_caps[stri_detect_fixed(description_caps,kw_caps)]
        matchkw <- c(matchkw, matchkw_caps)
        matchkwshare <- length(matchkw)/length(kw)*100
       matchkwlength <- length(matchkw)
        if(length(matchkw)==0){matchkw <- ""}

      }

Absence of negative keywords in a given concordance string of matching.list does not affect anything. In fact most of the items in matching.list do not include any negative keywords. So, we create objects matchnkwlength, matchnkwshare and matchnkw with default values of 0, 0 or ``.

However, to take into account those concordance strings of matching.list that do contain negative keywords, we use the following conditional if...else statement, that is similar to the one used for keywords above.

      matchnkwlength  <- 0
      matchnkwshare <- 0
      matchnkw <- ""

      if (length(nkw)>0&!(is.na(nkw[[1]]))){
        matchnkw <- nkw[stri_detect_fixed(description,nkw)]
        matchnkw_caps <- nkw_caps[stri_detect_fixed(description,nkw_caps)]
        matchnkw <- c(matchnkw, matchnkw_caps)
        matchnkwshare <- length(matchnkw)/length(nkw)*100
        matchnkwlength <- length(matchnkw)
        if(length(matchnkw)==0){matchnkw <- ""}

      }

Now let’s store the results in the currently empty data frame df, by creating an additional empty row in the data frame using argument [(nrow(df)+1),] and populating it with the values from the temporary objects generated above.

As the loops take each of the multiple entries of a country’s NTMs through each of the 67 concordance strings, printing the line __ helps to keep track of the progress made.

Function flush.console is not necessary, but it ensures that the display of output in the R console is current, even if output buffering is on.

     df[(nrow(df)+1),] <-
        c(
          n,
          m,
          length(ntn),
          matchhs6share,
          paste(ntn,collapse=";"),

          matchNTM,
          NTM,

          paste(matchkw,collapse=";"),
          matchkwlength,
          matchkwshare,

          paste(matchnkw,collapse=";"),
          matchnkwlength,
          matchnkwshare
        )

      cat( "------ c: ", this_countryiso3, ", n: ", n ,", m: ", m ," ------\n");

      flush.console()
    #}
 #}

Below we add new columns or overwrite the existing ones with the following values:

  • column hs_match will contain an integer value of 1 if the corresponding value in column matchhs6share is above 0;
  • column kw_match will contain an integer value of 1 if the corresponding value in column matchkwshare is above 0;
  • column matchNTM will be overwritten with integer value of 1 if it contains a logical value of TRUE;
  • column nkw_match will contain an integer value of 1 if the corresponding value in column matchnkwshare is above 0;
  • column index will sum up the values in the first three columns described above and subtract the value in the latter column.

Note, that we are specifically interested in those NTMs that contain a simultaneous match to one of the possible values outlined in concordance strings in matching.list, while lacking any of the negative keywords (if they are provided in matching.list). Such NTMs in column index would contain value 3. The rest we can remove, with last line of code in the vignette below, which indicates that in df we only keep those rows that contain value 3 in column index.

  df$hs_match <- as.integer(df$matchhs6share > 0)
  df$kw_match <- as.integer(df$matchkwshare > 0)
  df$matchNTM <- as.integer(as.logical(df$matchNTM))
  df$nkw_match <- as.integer(df$matchnkwshare > 0)

  df$index <- df$hs_match + df$kw_match + df$matchNTM - df$nkw_match

  df <- df[df$index==3, ]

As the result of the actions above we know that China’s 5998th NTM is a full match to a concordance string 36, which assigns this NTM to SDG Target 12.4 of responsible management of hazardous chemicals and waste, as it contains 754 HS codes, 3 keywords and an NTM code that matched the possible values stored in the concordance string 36, while no negative keywords were listed for concordance string 36 of matching.list.

To finalize analysis summary for our sample, we create a new data frame df2, in which we merge df by column matching and matching by column row. This way the resulting df2 includes information on which SDG target each matched NTM is linked to.

Now we can export the summary df2 by using write.csv(), in the argument of which we indicate what to export (df2), how to name the exported file (paste0("countries/", "detailed", this_countryiso3, ".csv")), not to include row names(row.names = FALSE) and what to store in empty cells (na="").

Take a close look at the first argument in the nested paste0 function. It has a forward slash in it - "countries/". This is actually a name of the sub-folder in our working directory into which we are saving the generated country files. This sub-folder has to be created in advance, otherwise you will get an error message. The following two arguments are elements of the file name, the second part of which will depend on the string stored in this_countryiso3 object at the start of a given loop cycle. And the last mandatory element is the file extension ".csv".

And then we remove all now redundant objects.

  df2 <- merge(df, matching[c("row", "Target")], by.x="matching", by.y="row")

  write.csv(df2, paste0("countries/", "detailed", this_countryiso3, ".csv"), row.names = FALSE, na="")

  rm(c, description, description_caps, kw, kw_caps,m, matchkw, matchkw_caps, matchkwlength, matchnkw, matchnkw_caps, matchnkwlength, matchNTM, mt, n, nkw, nkw_caps, nt, NTM, ntms, ntn, this_country, this_countryiso3,this_mst, matchhs6share, matchkwshare, matchnkwshare, matchHS, hs6s, ntms.list)

#}

As mentioned earlier, the complete matching loop for all countries for which data is available in ntms_all data frame is provided in the R file with the complete code algorithm and also under the button below. Take a look at it first. Do you understand what each line of code does?

Running the code for all the countries and even just for a selected group of countries may take a long time, depending of the power of your PC. Alternatively, you can revise the first line of the loop for (c in 1:nrow(countries)){ to for (c in 1:1){ and only run the complete loop for Afghanistan. It should take just a couple of minutes, and you will be able to see the complete country file generated by this loop and saved in folder countries in your working directory.

For the purpose of this training and for your convenience we have already run the complete loop for all 88 countries. The complete set of country files is available here in a zip file. Download the zip file and extract it into the countries subfolder of your working directory (make sure to overwrite all files that are already present in that folder). That is where the output files would have been saved if we ran the complete loop for all countries. Let’s pretend that that is what just happened ;-)

## ----for loop setup 1 ----

for (c in 1:nrow(countries)){

  this_countryiso3 <- countries$iso3[c]
  this_country <- countries$country[c]
  cat( "------ ",this_countryiso3 ," ------\n");flush.console()

  ntms <- ntms_all[ntms_all$reporter==this_country, ]

## ----for loop setup 2 ----
  ntms.list <- split(ntms, seq(nrow(ntms)))

## ----for loop setup 3 ----

  # 1. loop through each row of NTM data in the ntms.list
  for (i in 1:nrow(ntms)){

    # 2. split the string stored in subitem "HS" of the ith item into a vector or separate character strings stored in object "x"
    x <- list(strsplit(ntms.list[[i]]$HS, ",")[[1]])[[1]]

    # 3. overwrite x with itself removing potentially empty strings
    x <- x[!is.na(x)]

    # 4. truncate any HS codes longer than 6-digit, and overwrite x as a numeric vector removing any duplicating strings
    x <- unique(as.numeric(substr(trimws(x), 1, 6)))

    # 5. test for presence of 4-digit codes and expand them to 6-digit using the 2012 HS version

    if (length(x[x>99&x<=9999])>0){
      x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes12[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
      x <- x[-which(x>99&x<=9999)]
      x <- c(x,x4)
    }

    # 6. test for presence of 2-digit codes and expand them to 6-digit using the 2012 HS version
    if (length(x[x<=99])>0){
      x2 <- merge(data.frame(x = x[x<=99]), hs_codes12[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
      x <- x[-which(x<=99)]
      x <- c(x,x2)
    }

    # 7. remove all duplicates and write the resulting set of HS codes to subitem "HS" of the ith item as a list vector
    x <- unique(x)
    ntms.list[[i]]$HS <- list(x)
    cat( "------ ",i ," ------\n");flush.console()
  }

  # 8. remove redundant objects
  rm(x,x2,x4,i)

## ----for loop setup 4 ----

  df <- data.frame(measure=integer(),
                   matching=integer(),
                   nt_length = integer(),
                   matchhs6share = numeric(),
                   hs6s = character(),

                   matchNTM = logical(),
                   NTM = character(),

                   matchkw = character(),
                   matchkwlength = integer(),
                   matchkwshare = numeric(),

                   matchnkw = character(),
                   matchnkwlength = integer(),
                   matchnkwshare = numeric(),

                   stringsAsFactors=FALSE)


## ----for loop matching 1 ----

  for (n in 1:nrow(ntms)){
    nt <- ntms.list[[n]]$HS[[1]]

    this_mst <- ntms.list[[n]]$NTM      # NTM code of the nth measure from country c
    n_char_numeric <- as.numeric(nchar(this_mst))
    this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
    MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
    this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
    rm(n_char_numeric, this_level, MAST_codes_temp)

    description  <- ntms.list[[n]]$description
    description <- gsub("[\r\n]", "", description)
    description <- iconv(description, from = 'UTF-8', to = 'ASCII//TRANSLIT')
    description_caps <- description

    description <- tolower(description)

        ## ----for loop matching 2 ----
    for (m in 1:nrow(matching)){
      mt <- matching.list[[m]]$HS[[1]]
      ntn <- nt[nt %in% mt]
      matchhs6share <- length(ntn)/length(nt)*100

      ## ----for loop matching 3 ----

      matchNTM <- any(this_mst %in% matching.list[[m]]$NTM[[1]])
      NTM <- ntms.list[[n]]$NTM



      ## ----for loop matching 4 ----

      kw <- matching.list[[m]]$Keywords[[1]]

      kw_caps <- kw[grepl("!", kw)]
      kw_caps <- gsub("!","",kw_caps)
      kw   <- kw[!grepl("!", kw)]

      nkw <- matching.list[[m]]$neg_Keywords[[1]]
      nkw_caps <- nkw[grepl("!", nkw)]
      nkw_caps <- gsub("!","",nkw_caps)
      nkw   <- nkw[!grepl("!", nkw)]

      ## ----for loop matching 5 ----

      if(length(kw)==1&(is.na(kw[[1]]))){

        matchkw <- "NO KW INCL"
        matchkwlength <- 1
        matchkwshare <- 100

      } else
      {
        matchkw <- kw[stri_detect_fixed(description,kw)]
        matchkw_caps <- kw_caps[stri_detect_fixed(description_caps,kw_caps)]
        matchkw <- c(matchkw, matchkw_caps)
        matchkwshare <- length(matchkw)/length(kw)*100
        matchkwlength <- length(matchkw)
        if(length(matchkw)==0){matchkw <- ""}

      }

      ## ----for loop matching 6 ----

      matchnkwlength  <- 0
      matchnkwshare <- 0
      matchnkw <- ""

      if (length(nkw)>0&!(is.na(nkw[[1]]))){
        matchnkw <- nkw[stri_detect_fixed(description,nkw)]
        matchnkw_caps <- nkw_caps[stri_detect_fixed(description,nkw_caps)]
        matchnkw <- c(matchnkw, matchnkw_caps)
        matchnkwshare <- length(matchnkw)/length(nkw)*100
        matchnkwlength <- length(matchnkw)
        if(length(matchnkw)==0){matchnkw <- ""}

      }

      ## ----for loop matching 7 ----

      df[(nrow(df)+1),] <-
        c(
          n,
          m,
          length(ntn),
          matchhs6share,
          paste(ntn,collapse=";"),

          matchNTM,
          NTM,

          paste(matchkw,collapse=";"),
          matchkwlength,
          matchkwshare,

          paste(matchnkw,collapse=";"),
          matchnkwlength,
          matchnkwshare
        )

      cat( "------ c: ", this_countryiso3, ", n: ", n ,", m: ", m ," ------\n");

      flush.console()
    }
  }

## ----for loop matching 8 ----

  df$hs_match <- as.integer(df$matchhs6share > 0)
  df$kw_match <- as.integer(df$matchkwshare > 0)
  df$matchNTM <- as.integer(as.logical(df$matchNTM))
  df$nkw_match <- as.integer(df$matchnkwshare > 0)

  df$index <- df$hs_match + df$kw_match + df$matchNTM - df$nkw_match

  df <- df[df$index==3, ]

## ----for loop matching 9 ----

  df2 <- merge(df, matching[c("row", "Target")], by.x="matching", by.y="row")

  write.csv(df2, paste0("countries/", "detailed", this_countryiso3, ".csv"), row.names = FALSE, na="")

  rm(c, description, description_caps, kw, kw_caps,m, matchkw, matchkw_caps, matchkwlength, matchnkw, matchnkw_caps, matchnkwlength, matchNTM, mt, n, nkw, nkw_caps, nt, NTM, ntms, ntn, this_country, this_countryiso3,this_mst, matchhs6share, matchkwshare, matchnkwshare, matchHS, hs6s, ntms.list)

}

Quiz 13



Question 13.1: In healthSDG we want to replace targets with their numbers. What line of code below would achieve that?

A. healthSDG$Target < gsub("Target", "", healthSDG$Target)

# Answer A is incorrect. `<-` should be used instead of `<`.

B. healthSDG$Target <- gsub("Target", "", Target)

# Answer B is incorrect. Should use `healthSDG$Target` to reference column __Target__ of object `healthSDG`.

C. healthSDG$Target <- gsub("Target", , healthSDG$Target)

# Answer C is incorrect. Replacement character string should be indicated within quotation marks, even if it is empty.

D. healthSDG$Target <- gsub("Target", "", healthSDG$Target)

# Answer D is correct.



Question 13.2: We want to see if HS code 263628 is associated with any of the SDG targets. What line of code would achieve that?

A. grepl(263628, healthSDG$HS)

# Answer A is incorrect. Character string should be indicated within quotation marks. 

B. grepl("263628", healthSDG$HS)

# Answer B is correct. 

C. grepl("263628", "HS")

# Answer C is incorrect. `healthSDG$HS` should be used to reference column __HS__ of object `healthSDG`.

D. grepl("263628" %in% healthSDG$HS)

# Answer D is incorrect. Arguments of function `grepl` are indicated incorrectly.



Question 13.3: Now we want to see what unique SDG targets have HS code 263628 associated with them. Please write code and choose the correct answer below.

A. " 3.1" " 3.2" " 3.3" " 3.5" " 3.7" " 3.8" " 3.b"

# Answer A is incorrect. 

B. " 3.1" " 3.2" " 3.3" " 3.4" " 3.7" " 3.9" " 3.b"

# Answer B is incorrect. 

C. " 3.1" " 3.2" " 3.3" " 3.4" " 3.7" " 3.8" " 3.b"

# Answer C is correct.

# The correct code is below for reference.

D. " 3.1" " 3.2" " 3.3" " 3.4" " 3.7" " 3.8" " 3.a"

# Answer D is incorrect. 



Question 13.4: Use functions strsplit, trimws, length and operator %in% to check if any of the HS codes in row 20 of healthSDG are present among HS codes recorded in row 1 of healthSDG. You can either nest the functions to make one string of code, or write separate lines of code for each of separate steps. What is the correct answer?

A. 0

# Answer A is incorrect. 

B. 36

# Answer B is incorrect. 

C. 85

# Answer C is correct.

# The correct code is below for reference.

D. 51

# Answer D is incorrect. 



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.

# Question 1
healthSDG$Target <- gsub("Target", "", healthSDG$Target)

# Question 2
grepl("263628", healthSDG$HS)

# Question 3
unique(healthSDG$Target[grepl("263628", healthSDG$HS)])

# Question 4

length(trimws(strsplit(healthSDG$HS[[20]], ";")[[1]]) %in% trimws(strsplit(healthSDG$HS[[1]], ";")[[1]]))

3 Review the results of SDG-HS-NTM matching

3.1 Brief overview of matching results

Now we have a set of 88 .csv files with the results of SDG-HS-NTM matching for all of the countries present in TRAINS database of NTMs. The .csv files only contain data on NTMs that were matched to various lines of matching.list. The files have been stored in countries folder inside your working directory (we set it in section 1.1 of this module) . We can now do a number of things that will help us review the results of matching and to create illustrative plots.

However, first let’s take a look at matching results for all of China’s NTMs. Having run the full loop for all China we can see that the total of 2662 unique NTMs are a full match to various concordance strings of matching.list.

You will notice that the resulting data frame for China contains 6635 rows corresponding to each instance of matched NTM, while we analyzed 7357 NTMs of which only 2662 unique NTMs are in fact found to match SDG targets.

This discrepancy is due to a number of reasons:

  • some concordance strings in matching data frame included more than one target. For examples refer to a screenshot of the unedited imported matching data frame in section Import the SDG-HS-NTM concordance matrix into RStudio above. We later expanded matching data frame into long format by column Target. Therefore, when summarizing the results and preparing graphs we will use code to take this into account by removing duplicate rows in summary data frames.

  • some concordance strings contained alternative sets of HS codes yielded by research. These were expanded into long format manually in the source Excel files during the research stage. For example, see concordance strings for Target 12.4 in matching. Again, when summarizing the results and preparing plots we will use code to take this into account by removing duplicate rows in summary data frames.

  • some of the NTMs may match more than one target, because, as per their description, they intend to address more than one issue. This is especially true for regulation of trade in so called sensitive goods, the broad group of which may include: hazardous substances and waste, medicines, toxins, viruses, microorganisms, chemicals, wastes, narcotics, living modified organisms, alien species, radioactive substances, guns, weapons, explosives and their precursors, dual use technologies, tobacco, rough diamonds, endangered species of flora and fauna, and cultural and historical heritage items, etc. In this case simultaneous matching with a few concordance strings cannot be considered as duplication.

3.2 Create a data frame with .csv file names and ISO3 codes for countries

Let’s first review in detail what .csv files we have generated as a result of matching. We can use list.files function to retrieve a vector of .csv file names.

files_csv <- list.files("countries", pattern = "\\.csv$")

In some cases it may be useful to indicate relative path to a file or a folder, i.e. relative to the preset working directory. In this case the first line of code would look like files_csv <- list.files("./countries", pattern = "\\.csv$"), where ./ refers to the working directory and can be used by itself.

We then use data.frame function to turn it into a data frame containing a file name and a ISO3 country codes. The latter we derive from the names of the files by using substr function to extract the last three symbols before the file extension (all characters from 9th till 11th).

Notice, argument pattern = "\\.csv$" used with list.files function. It uses regular expression "\\.csv$" indicating, that we are selecting files what have ‘.csv’ at the end of their names. If the folder does not contain any files with extensions other than .csv, we can actually drop that argument.

Note , that when working with character strings in R, . is used to represent any character, except new line. Therefore, if we want to detect a period in a character string, we need to include \\ right before it. $ is used to indicate that the character combination is to be found at the end of the file names. For more details on basic regular expressions in R used for manipulating character strings refer, for example, to https://stringr.tidyverse.org/articles/regular-expressions.html

files_csv <- data.frame(files_csv, stringsAsFactors = FALSE)
files_csv$ISO3 <- substr(files_csv$files_csv, 9, 11)
names(files_csv) <-c("file_name", "ISO3")
head(files_csv)
##         file_name ISO3
## 1 detailedAFG.csv  AFG
## 2 detailedARE.csv  ARE
## 3 detailedARG.csv  ARG
## 4 detailedATG.csv  ATG
## 5 detailedAUS.csv  AUS
## 6 detailedBEN.csv  BEN

Note that since ISO3 code for EU is a two character string, the corresponding string in column 2 now contains a ‘.’, that we need to remove.

print(files_csv[29,2])
## [1] "EU."

We can use gsub function to find “.” in the resulting strings in ISO3 and to remove them.23

files_csv$ISO3 <- gsub("\\.", "", files_csv$ISO3)
print(files_csv[29,2])
## [1] "EU"

Quiz 14



Question 14.1: You want to retrieve all files and folders from your working directory into a data frame that has file names in column 1, which you want to name as __names_. Which line of code will achieve that?

A.
files <- data.frame(list.files(getwd), stringsAsFactors = FALSE)
names(files) <- "names"

# Answer A is incorrect. `getwd()` should be used instead of `getwd`.

B.
files <- data.frame(list.files(getwd()), stringsAsFactors = FALSE)
names(files) <- "names"

# Answer B is correct.

C.
files <- data.frame(list.files(getwd()), stringsAsFactors = FALSE) %>% names("names")

# Answer C is incorrect. Argument of function `names` should contain the name of the data frame object it is applied to. Code overwriting the name of the column is also missing.

D.
files <- data.frame(list.files(getwd()), stringsAsFactors = FALSE)
names <- "names"

# Answer D is incorrect. Argument of function `names` should contain the name of the data frame object it is applied to.



Question 14.2: In files data frame you want to keep only those rows that contain names of the files, and not the folders (file names have file extensions that have a dot in them), and then to check the number of the resulting rows. Which line of code will achieve that?

A.
files == files[grepl("\\.", files$names), ,drop=FALSE]
nrow(files)

# Answer A is correct.

B.
files <- files[grepl(".", files$names), ,drop=FALSE]
nrow(files)

# Answer B is incorrect. `.` in R represents any character except a newline. So if we need to indicate a literal `.`, it has to be preceeded by two escape signs like so `\\.`.

C.
files <- files[grepl("\\.", names), ,drop=FALSE];
nrow(files)

# Answer C is incorrect. `files$names` should be used to reference column __names__ of object `files`.

D.
files <- files[grepl("\\.", files$names), ,drop=FALSE]
nrow()

# Answer D is incorrect. Argument specifying the name of a data frame object has to be indicated for function `nrow()`.



Question 14.3: To reduce code you want to retrieve the number of all files (and not folders) from your working directory in one code string. Which line of code will achieve that?

A. nrow(data.frame(list.files(getwd(), pattern="\\."), stringsAsFactors = FALSE))

# Answer A is not entirely correct.  

B. All

# Answer B is correct.

C. data.frame(list.files(getwd(), pattern="\\."), stringsAsFactors = FALSE) %>% nrow()

# Answer C is not entirely correct.  

D. data.frame(list.files("./", pattern="\\."), stringsAsFactors = FALSE) %>% nrow()

# Answer D is not entirely correct.  



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above.

# Question 1
files <- data.frame(list.files(getwd()), stringsAsFactors = FALSE); names(files) <- "names"

# Question 2
files <- files[grepl("\\.", files$names), ,drop=FALSE]; nrow(files)

# Question 3
nrow(data.frame(list.files(getwd(), pattern="\\."), stringsAsFactors = FALSE))
data.frame(list.files(getwd(), pattern="\\."), stringsAsFactors = FALSE) %>% nrow()
data.frame(list.files("./", pattern="\\."), stringsAsFactors = FALSE) %>% nrow()

3.3 Populate analysis results into ntms_all data frame

ntms_all data frame that we created earlier contains data on all of the NTMs present in TRAINS database of NTMs. We can now populate it with the data from the .csv files that we generated, so that the ntms_all contains data on all of the NTMs: those that were found to be linked to SDGs and those that were found not to have such linkage.

In ntms_all let’s now add a column numbering all the measures within each individual reporter country. Later, this will allow us merging ntms_all and the generated .csv files by both country and NTM number.

First, we use function unique to extract the list of all unique countries present in reporter column of ntms_all. And then we use for loop, which for each of the unique countries stored in unique_countries object adds measure column to ntms_all with measure indices from 1 up to a total number of NTMs of a given country stored in ntms_all.

unique_countries <- unique(ntms_all$reporter)
for (j in 1:length(unique_countries)){
  ntms_all[ntms_all$reporter==unique_countries[j],"measure"] <- 1:nrow(ntms_all[ntms_all$reporter==unique_countries[j],])
}

For further convenience we add an iso3 column to ntms_all by merging it with countries data frame. We use left_join function, which does the same thing as merge function that we used earlier. The first two arguments indicate the names of the objects to merge, and argument by=c() specifies the column names of the two data frames, by which both objects should be merged.

Notice, that both in left_join and in merge you can indicate a specific subset of a data frame to be merged. In our case, we use countries[,c("country", "iso3")] to indicate that we merge ntms_all with a subset of countries data frame that has only two columns: countries and iso3.

ntms_all <- left_join(ntms_all, countries[,c("country", "iso3")], by=c("reporter" = "country"))

Before we populate ntms_all with the results of SDG-HS-NTM matching, let’s add five more columns with empty cells as below.

ntms_all$matching <- NA
ntms_all$NTM_temp <- NA
ntms_all$Target <- NA
ntms_all$hs6s <- NA
ntms_all$matchkw <- NA

Below we include for loop what will populate ntms_all with data from the generated files, the list of which is stored in object files_csv.

For the sequence of all the generated .csv files, it does the following:

  1. Creates temp_file which imports the fth .csv file from the specified directory
  2. Overwrites temp_file with itself preserving only rows for which values in column matchnkwshare are 0, only keeping columns matching, measures, NTM, hs6s, matchkw, and Target
  3. Creates temporary object ntms_all_temp only containing NTMs for the fth country, and removing columns matching, NTM_temp, hs6s, matchkw and Target
  4. Clears ntms_all of all the rows for country f
  5. Joins ntms_all_temp with the temp_file by column measure of each
  6. Adds the rows of ntms_all_temp to the bottom of ntms_all by creating the necessary number of rows at the bottom of ntms_all and pasting ntms_all_temp data into them
  7. Removes redundant objects to keep the workspace clean
for (f in 1:nrow(files_csv)){
  temp_file <- read.csv(paste0("countries/", files_csv$file_name[f]), stringsAsFactors = FALSE)
  temp_file <- temp_file[temp_file$matchnkwshare==0, c("matching", "measure",  "NTM", "Target", "hs6s", "matchkw")]
  ntms_all_temp <- ntms_all[ntms_all$iso3==files_csv$ISO3[f], -c(which(names(ntms_all)%in%c("matching", "NTM_temp", "Target", "hs6s", "matchkw")))]
  ntms_all <- ntms_all[ntms_all$iso3!=files_csv$ISO3[f],]
  ntms_all_temp <- left_join(ntms_all_temp, temp_file, by="measure")

  nextrow <- nrow(ntms_all)+1
  ntms_all[nextrow:(nextrow+nrow(ntms_all_temp)-1),] <- ntms_all_temp
    rm(ntms_all_temp)

    # cat( "(f:)",f ,"\n");flush.console()
}

rm(f, nextrow, unique_countries,j, df, df2, MAST_codes, hs_codes, hs_codes12, matching.list)

write.csv(ntms_all, "ALL_NTMs_after_matching_191217.csv", row.names = FALSE, col.names = TRUE)
#ntms_all <- read.csv("ALL_NTMs_after_matching_191217.csv", stringsAsFactors = FALSE, header = TRUE)

Look below at the two lines of code extracted from the code chunk above, and specifically at nrow(ntms_all)+1 and nextrow:(nextrow+nrow(ntms_all_temp)-1). Note: Don’t paste these two lines of code into your R script, as they are already present in the code chunk above.

  nextrow <- nrow(ntms_all)+1
  ntms_all[nextrow:(nextrow+nrow(ntms_all_temp)-1),] <- ntms_all_temp

If you already went through the Training on using R for trade analysis you may remember that R can be used for arithmetic operations both with numbers, and objects that have numbers stored in them, and functions that generate numbers. This is exactly what is done here.

First, object nextrow is created that stores a sum of a number for the amount of rows in ntms_all data frame in a given cycle of for loop execution and 1. In the first run of the loop that number would be 61652+1=61653. We need +1, as we need to post the rows from ntms_all_temp into the next row created after the existing last row in ntms_all.

To set a range of numbers we use :, as in 1:10, which refers to a range of numbers from 1 through to 10. Likewise, in [nextrow:(nextrow+nrow(ntms_all_temp)-1),] we refer to all rows from a number stored in nextrow through to a number that is the sum of a number stored in nextrow plus the number of rows in ntms_all_temp at a given cycle of for loop execution minus 1. In the first run of the loop that would be 61653+285-1=61937. So nextrow:(nextrow+nrow(ntms_all_temp)-1) sets the range of rows from 61653 through to 61937 into which to paste the data.

Since these rows do not exist yet, referencing them and writing data into them creates them at the bottom of ntms_all.

Now ntms_all contains data on all of the NTMs in TRAINS database: those that were found to be matched to any of the list items of matching.list and those that were found not to have any matches.

Target column contains information on what SDG target a given NTM was linked to. Some of the Targets have more than one dedicated concordance strings in matching.list, so we have a very detailed presentation of data. We can see, which HS codes, NTM codes and keywords of a given NTM matched to what list item of matching.list.

We exported this data frame as an .csv file into the working directory using write.csv function, as per the code above. However, as we continue our data review and move on to creating graphs, we will need to remove duplicates as necessary.

Quiz 15



Question 15.1: In practice_matching data frame there is a total of 204 rows. The rows are assigned index of 0 or 1 in its first column. You want to create a new column index_number numbering all the measures within each of the individual categories. Which of the lines of code will achieve that?

A.

unique_index <- unique(practice_matching[[1]])

for (j in 1:length(unique_index)){
practice_matching[practice_matching[[1]]==unique_index[j],"index_number"] <- 1:nrow(practice_matching[practice_matching[[1]]==unique_index[j],])
}

# Answer A is correct.

B.

unique_index <- unique(practice_matching[[1]])

for (j in 1:nrow(unique_index)){
practice_matching[practice_matching[[1]]==unique_index[j],"index_number"] <- 1:nrow(practice_matching[practice_matching[[1]]==unique_index[j],])
}

# Answer B is incorrect. Function `length` should be used to set the sequence of `for loop` in this case, as object `unique_index` is a vector.

C.

unique_index <- unique(practice_matching[[1]])

for (j in 1:length(unique_index)){
practice_matching[practice_matching[[1]]==unique_index[j],"index_number"] <- nrow(practice_matching[practice_matching[[1]]==unique_index[j],])
}

# Answer C is incorrect. Need to indicate a range of numbers to be assigned to the specified rows, like so: `1:nrow(practice_matching[practice_matching[[1]]==unique_index[j],])}`.

D.

unique_index <- unique(practice_matching[[1]])

for (j in 1:length(unique_index)){
practice_matching[practice_matching[[1]]==unique_index[j],index_number] <- 1:nrow(practice_matching[practice_matching[[1]]==unique_index[j],])
}

# Answer D is incorrect. Name of the column in square brackets has to be indicated within quotation marks. 



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the question above, and the object you will need for further quizzes and practice sessions.

# Question 1
unique_index <- unique(practice_matching[[1]])

for (j in 1:length(unique_index)){
   practice_matching[practice_matching[[1]]==unique_index[j],"index_number"] <- 1:nrow(practice_matching[practice_matching[[1]]==unique_index[j],])
}

rm(unique_index)

3.4 NTMs distribution by SDG targets

To create a bar chart for distribution of NTMs addressing at least one SDG target, as the one shown below, we can use ntms_all to create a data frame ntms_all_Target, in which we remove columns matching, hs6s and matchkw and keep only the unique rows (i.e. remove duplicate rows).

ntms_all_Target <- ntms_all[, -which(names(ntms_all) %in% c("matching", "hs6s", "matchkw"))]
ntms_all_Target <- unique(ntms_all_Target)

We can now use functions from dplyr package to group NTMs by Target, remove row for which value in Target is not available, and then we can use barplot function to create a bar plot. Run ?barplot to see details on the purpose of various arguments of barplot function. In our case we specified such arguments as values to plot, title of the plot, color theme for the bars, source for bar labels, orientation of text in bar labels, size of text in bar labels and in label of axis y, text to label axis y. You can use more or less arguments to customize your bar chart.

Note, that when you are using a function from a particular package, it may sometimes be useful to specify that package in the code, like we did below by appending dplyr:: in front of functions count and group_by. It is important to do, when you have loaded more than one package that contain functions that share one name.

By_target <- ntms_all_Target %>% dplyr::group_by(Target) %>% dplyr::count() %>% filter(!is.na(Target))

barplot(By_target$n, main="NTMs distribution by SDG targets", col=rainbow(25),names.arg = By_target$Target, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Number of NTMs directly addressing SDGs", ylim=c(0,10000))

It would be more informative to put SDG targets in their order. For that we add a new column Number to By_target into which we extract the target numbers converted to numbers. We use function nchar to generate argument for the end symbol within function substr.

Notice that some of the Target numbers contain letters in them, such as 3.a. These cannot be converted to numeric format, and are left empty. We can replace NA with a number that would put the target in the correct row once we sort the data frame. In our case we use 3.91 to put this row behind Target 3.9. Then we use order function to order the data frame by column Number, and run the barplot function once again.

Now all SDG targets are in the right order and look nice on the bar plot. To make it more informative we can add a column with brief titles for the targets that reflect their sense, and use those as labels. We will do similar task for other graphs below.

If R functionality for making charts seems too confusing or limited, you can always use function write.table to copy the data to clipboard and then to paste it into excel file for further manipulation. We will do that for one of the charts below.

By_target$Number <- as.numeric(substr(By_target$Target, 8, nchar(By_target$Target)))
which <- which(By_target$Target=="Target 3.a")
By_target$Number[[which]] <- 3.91
By_target <- By_target[order(By_target$Number),]

barplot(By_target$n, main="NTMs distribution by SDG targets", col=rainbow(25),names.arg = By_target$Target, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Number of NTMs directly addressing SDGs", ylim=c(0,10000))

3.5 NTMs distribution by subject areas based on SDG targets

However, one needs to bear in mind that at the target level there is a lot of duplication as the result of the methodology used to generate the concordance strings. Some of the concordance strings in the original matrix are assigned to a group of targets within the same Goal. At the same time only certain targets within only 9 SDGs have a concordance strings dedicated to them, so it is good to see what specific objectives NTMs linked to SDGs aim to address.

In this connection, it may be more useful to group NTMs linked to SDG targets by their subject areas. Based on the methodology, we suggest using the groupings as per the table below.

Goal Subject area Rows in matching df
Goal 2 Food security 1-2
Goal 3 Medicines & medical technologies 3:21
Goal 3 Food safety 25
Goal 3 Good nutrition 26
Goal 3 Drugs & psychotropics 22
Goal 3 Alcohol consumption 27
Goal 3 Vehicle safety 28
Goal 3 Tobacco use 23, 24
Goal 3 Poison exposure 29
Goal 6 Water efficiency 30, 31
Goal 7 Energy efficiency 32
Goal 11 Air & noise pollution 33, 34
Goal 11 Cultural and historical heritage 35
Goal 12 Waste & hazardous chemicals 36:51
Goal 12 Sustainable processes and production methods (PPMs) 52:54
Goal 14 Illegal, unregulated and unreported fishing (IUU fishing) 55, 56
Goal 15 Illegal timber 57
Goal 15 Endangered species of flora and fauna 58
Goal 15 Living modified organisms (LMOs) 59
Goal 16 Arms and weapons 60:64
Goal 16 Conflict minerals 65, 66
Goal 16 Counterfeit goods 67

Thus, we add an empty column titled area to ntms_all and fill it with the subject area labels by using the data from matching column. Notice, that we add \n to wrap the longer labels.

ntms_all$area <- NA

ntms_all$area[ntms_all$matching %in% c(1,2)] <- "Food\nsecurity"
ntms_all$area[ntms_all$Target %in% c(3:21)] <- "Medicines &\ntechnologies"
ntms_all$area[ntms_all$matching %in% c(25)] <- "Food\nsafety"
ntms_all$area[ntms_all$matching %in% c(26)] <- "Good\nnutrition"
ntms_all$area[ntms_all$matching %in% c(28)] <- "Vehicle\nsafety"
ntms_all$area[ntms_all$matching %in% c(27)] <- "Alcohol\nconsumption"
ntms_all$area[ntms_all$matching %in% c(22)] <- "Narcotics &\npsychotropics"
ntms_all$area[ntms_all$matching %in% c(29)] <- "Poison\nexposure"
ntms_all$area[ntms_all$matching %in% c(23, 24)] <- "Tobacco\nuse"
ntms_all$area[ntms_all$matching %in% c(30, 31)] <- "Water\nefficiency"
ntms_all$area[ntms_all$matching %in% c(32)] <- "Energy\nefficiency"
ntms_all$area[ntms_all$matching %in% c(33, 34)] <- "Air & noise\npollution"
ntms_all$area[ntms_all$matching %in% c(35)] <- "Cultural\nheritage"
ntms_all$area[ntms_all$matching %in% c(36:51)] <- "Waste &\nchemicals"
ntms_all$area[ntms_all$matching %in% c(52:54)] <- "Sustainable\nPPMs"
ntms_all$area[ntms_all$matching %in% c(55:56)] <- "IUU fishing"
ntms_all$area[ntms_all$matching %in% c(57)] <- "Illegal\ntimber"
ntms_all$area[ntms_all$matching %in% c(58)] <- "Endangered\nspecies"
ntms_all$area[ntms_all$matching %in% c(59)] <- "LMOs"
ntms_all$area[ntms_all$matching %in% c(60:64)] <- "Arms &\nweapons"
ntms_all$area[ntms_all$matching %in% c(65:66)] <- "Conflict\nminerals"
ntms_all$area[ntms_all$matching %in% c(67)] <- "Counterfeit\ngoods"

We also add a column Goal in which we generate information about the Goal number a given NTM is related to by using the data in column Target. We first use substr to extract 8th and 9th symbol from strings stored in Target. We then use gsub to find “.” in the resulting strings in Goal and to remove them.24 This way we get just the Goal number. Then in a separate column Goaln we paste the numbers from Goal converted to numbers. And in Goal we use paste0 to paste together “Goal” and the goal number. Data in Goaln will be uses later for sorting the rows, and data in Goal will be used for labels.

ntms_all$Goal <- substr(ntms_all$Target, 8,9)
ntms_all$Goal <- gsub("\\.", "", ntms_all$Goal)
ntms_all$Goaln <- as.numeric(ntms_all$Goal)
ntms_all$Goal <- paste0("Goal ", ntms_all$Goal)

Now we create a data frame object named ntms_all_area based on ntms_all in which we remove columns matching, hs6s, matchkw, and Target, and then we use unique to only keep the unique rows.

ntms_all_area <- ntms_all[, -which(names(ntms_all) %in% c("matching", "hs6s", "matchkw", "Target"))]
ntms_all_area <- unique(ntms_all_area)

We can now use functions from dplyr package to group NTMs by Target, remove row for which value in Target is not available. We then sort the rows by column Goaln and use barplot function to create a bar plot.

By_area <- ntms_all_area %>% dplyr::group_by(area, Goal, Goaln) %>% dplyr::count() %>% dplyr::filter(!is.na(area))
By_area <- By_area[order(By_area$Goaln),]

barplot(By_area$n, main="Distribution of NTMs matching SDG Targets by subject areas", col="sky blue",names.arg = By_area$area, las=2, cex.names= 0.8, cex.axis = 0.8, xlim = c(0,24), width=1, ylab = "Number of NTMs directly addressing SDGs", ylim = c(0, 10000))

Note, you can save a graph as a new R object, or export a graph as a .png file. After you ran all the code in the chunk below, check your working directory for the .png files with the name “plotbyarea.png”.

# saving bar plot as an object
plot_byarea <-barplot(By_area$n, main="Distribution of NTMs matching SDG Targets by subject areas", col="sky blue",names.arg = By_area$area, las=2, cex.names= 0.8, cex.axis = 0.8, xlim = c(0,24), width=1, ylab = "Number of NTMs directly addressing SDGs", ylim = c(0, 10000))

# export bar plot as a .png file
png(filename="plotbyarea.png")
barplot(By_area$n, main="Distribution of NTMs matching SDG Targets by subject areas", col="sky blue",names.arg = By_area$area, las=2, cex.names= 0.8, cex.axis = 0.8, xlim = c(0,24), width=1, ylab = "Number of NTMs directly addressing SDGs", ylim = c(0, 10000))
dev.off()

We can also use write.table function, as below, to copy to clipboard and then paste the data from By_area data frame into Excel where we can create even nicer graph, which would also label SDGs to which the plotted subject areas are related. See the resulting graph below.

 write.table(head(By_area, nrow(By_area)), "clipboard", sep="\t", row.names=FALSE)
Distribution of NTMs matching SDG Targets by subject areas.

Distribution of NTMs matching SDG Targets by subject areas.

Before we proceed further we remove some redundant objects.

rm(By_area, By_target, files_csv, matching, ntms_all_area, ntms_all_Target)

3.6 NTMs distribution by Goals

Now let’s summarize the NTMs linked to SDGs by Goal rather than by Target. We create a data frame object named ntms_all_Goal based on ntms_all in which we remove columns matching, hs6s, matchkw, area and Target, and then we use unique to only keep the unique rows. We can now use functions from dplyr package to group NTMs by Goal, keeping column Goaln as well, remove row for which value in Goaln is “NA”, and arrange the rows by Goaln. And then we use barplot function to create a bar plot.

ntms_all_Goal <- ntms_all[, -which(names(ntms_all) %in% c("matching", "hs6s", "matchkw", "Target", "area"))]
ntms_all_Goal <- unique(ntms_all_Goal)

By_goal <- ntms_all_Goal %>% dplyr::group_by(Goal, Goaln) %>% dplyr::count() %>% filter((!is.na(Goaln))) %>% arrange(Goaln)

barplot(By_goal$n, main="Distribution of NTMs by Goals", col="sky blue",names.arg = By_goal$Goal, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Number of NTMs directly addressing SDGs", ylim=c(0,12000))

To make the bar chart more informative we can add a column with titles of the Goals, and use those as bar labels. Notice, that we add \n to wrap the longer labels.

SDG_titles <- c("ZERO\nHUNGER", "GOOD HEALTH\n& WELL-BEING", "CLEAN WATER\n&SANITATION", "CLEAN &\nAFFORDABLE\nENERGY", "SUSTAINABLE\nCITIES &\nCOMMUNITIES", "RESPONSIBLE\nCONSUMPTION\n& PRODUCTION", "LIFE UNDER\nWATER", "LIFE ON\nLAND", "PIECE,\nJUSTICE &\nSTRONG\nINSTITUTIONS")
By_goal$Title <- SDG_titles

barplot(By_goal$n, main="Distribution of NTMs by Goals", ylab = "Number of NTMs directly addressing SDGs", col="sky blue",names.arg = By_goal$Title, las=2, cex.names= 0.7, cex.axis = 0.8, ylim=c(0,12000))

Quiz 16



Question 16.1: You want to create new object ntmsall_hi by extracting from ntms_all only rows for economies that are marked as high income economies in countries_inc. Which code will achieve that?

A. All

# Answer A is correct.

B.
hi_c <- countries_inc[countries_inc$high_income==1, 1:2]
ntmsall_hi <- ntms_all[ntms_all$reporter %in% hi_c$country,]

# Answer B is not entirely correct.  

C. ntmsall_hi <- ntms_all[ntms_all$reporter %in% countries_inc$country[countries_inc$high_income!=0],]

# Answer C is not entirely correct.  

D. ntmsall_hi <- ntms_all[ntms_all$reporter %in% countries_inc$country[countries_inc$high_income==1],]

# Answer D is not entirely correct.  



Question 16.2: Now you want to create object ntmsall_hi_goal that would copy ntmsall_hi excluding columns “matching”, “hs6s”, “matchkw”, “Target”, “area” and then keep only the unique rows.

A.

ntmsall_hi_goal <- ntmsall_hi[, -which(names(ntmsall_hi) %in% c("matching", "hs6s", "matchkw", "Target", "area"))] %>% ntmsall_hi_goal <- unique(ntmsall_hi_goal)

# Answer A is incorrect. Incorrect use of pipe operator `%>%`.

B.

ntmsall_hi_goal <- unique(ntmsall_hi[, -which(names(ntmsall_hi) %in% c("matching", "hs6s", "matchkw", "Target", "area")])

# Answer B is incorrect. Closing bracket for function `which` is missing within square brackets.

C.

ntmsall_hi_goal <- ntmsall_hi[, -which(names(ntmsall_hi) %in% c("matching", "hs6s", "matchkw", "Target", "area"))] %>% unique()

# Answer C is correct.

D. All

# Answer D is incorrect. Not all answers are correct.



Question 16.3: You now want to use dplyr functions to overwrite ntmsall_hi_goal by consecutively grouping data into rows with the same value in columns Goal and Goaln, counting the number of rows with the same values, filtering out the row that has NA value in column Goaln and then arranging the rows in ascending order by column Goaln. Is the code below complete?

Code:

ntmsall_hi_goal %>% dplyr::group_by(Goal, Goaln) %>% dplyr::count() %>% filter((!is.na(Goaln))) %>% arrange(Goaln)

A. Yes, it is complete.

# Answer A is incorrect.

B. It is missing a bracket.

# Answer B is incorrect.

C. It does not generate or overwrite any object.

# Answer C is correct.

D. Argument for filter has a mistake.

# Answer D is incorrect.



Question 16.4: You used the code below to create a bar plot of distribution of NTMs imposed by high income economies and directly addressing SDGs by Goals. What argument in barplot function you need to adjust to shorten y axis?

Code:

barplot(hi_by_goal$n, main="Distribution of NTMs by Goals in high income economies", ylab = "Number of NTMs directly addressing SDGs", col="sky blue",names.arg = hi_by_goal$Goal, las=2, cex.names= 0.7, cex.axis = 0.8, ylim=c(0,12000))

A. cex.axis

# Answer A is incorrect. `cex.axis` regulates size of the numeric axis labels.

B. ylab

# Answer B is incorrect. `ylab` specifies label for y axis.

C. ylim

# Answer C is correct.

D. cex.names

# Answer D is incorrect. `cex.names` regulates size of bar labels. 



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.

# Question 1

  #1
  hi_c <- countries_inc[countries_inc$high_income==1, 1:2]; ntmsall_hi <- ntms_all[ntms_all$reporter %in% hi_c$country,]

  #2
  ntmsall_hi <- ntms_all[ntms_all$reporter %in% countries_inc$country[countries_inc$high_income!=0],]

  #3
  ntmsall_hi <- ntms_all[ntms_all$reporter %in% countries_inc$country[countries_inc$high_income==1],]

# Question 2
  #3
  ntmsall_hi_goal <- ntmsall_hi[, -which(names(ntmsall_hi) %in% c("matching", "hs6s", "matchkw", "Target", "area"))] %>% unique()

# Question 3

hi_by_goal <- ntmsall_hi_goal %>% dplyr::group_by(Goal, Goaln) %>% dplyr::count() %>% filter((!is.na(Goaln))) %>% arrange(Goaln)

# Question 4

barplot(hi_by_goal$n, main="Distribution of NTMs by Goals in high income economies", ylab = "Number of NTMs directly addressing SDGs", col="sky blue",names.arg = hi_by_goal$Goal, las=2, cex.names= 0.7, cex.axis = 0.8, ylim=c(0,5000))

3.7 NTMs directly addressing SDGs by groups of NTM types

Another way to summarize NTMs that directly address SDGs is by their types as per the taxonomy laid out in the International Classification of Non-tariff Measures (version of 2012; for details see MAST_all.csv that we utilized earlier). There are the following distinct groups of NTMs:

Broad groups NTM types NTM letter (ICNTM Chapter)
Technical SPS A
Technical TBT B
Technical Pre-shipment inspection C
Non-technical Quantity control E
Non-technical Contingent trade protective D
Non-technical Price control F
Non-technical Other G,H,I,J,K,L,M,N,O
Export All export P

Using ntms_all_Goal data frame we create a new one ntms_Goal in which we remove all rows, for which data in column Goaln is not available, i.e. we remove all rows for NTMs that were not found to be linked to SDGs. We then use substr function to create column NTM_letter which only contains the first letter from NTM_temp. This is the letter that indicates the chapter of ICNTM to which a given NTM belongs.

With the following lines of code we create two additional columns broad and NTMtype to which we add strings describing NTM types and broad NTM groups based on the chapter letter in column NTM_letter. Notice, that we add \n to wrap the longer labels.

ntms_Goal <- ntms_all_Goal[!is.na(ntms_all_Goal$Goaln),]

ntms_Goal$NTM_letter <- substr(ntms_Goal$NTM_temp, start = 1, stop = 1)

ntms_Goal$broad <- NA
ntms_Goal$broad[ntms_Goal$NTM_letter %in% c("A", "B", "C")] <- "Technical"
ntms_Goal$broad[ntms_Goal$NTM_letter %in% c("P")] <- "Export"
ntms_Goal$broad[!ntms_Goal$NTM_letter %in% c("A", "B", "C", "P")] <- "Non-technical"
ntms_Goal$NTMtype <- NA
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("A")] <- "SPS"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("B")] <- "TBT"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("C")] <- "Pre-shipment\ninspection"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("E")] <- "Quantity\ncontrol"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("D")] <- "Contingent\ntrade\nprotective"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("F")] <- "Price\ncontrol"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("G","H","I","J","K","L","M","N","O")] <- "Other"
ntms_Goal$NTM_letter[ntms_Goal$NTMtype=="Other"] <- "G"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("P")] <- "All export"

We can now use functions from dplyr package to group NTMs by NTMtype, keeping columns NTM_letter and broad, and arrange the rows by NTM_letter. We then add column share in which we store the results of calculating a share of a given type of NTMs within the total types of NTMs linked to SDGs. And then we use barplot function to create a bar plot.

By_ntm <- ntms_Goal %>% group_by(NTMtype, NTM_letter, broad) %>% count() %>% arrange(NTM_letter)
By_ntm$share <- By_ntm$n/sum(By_ntm$n)*100
barplot(By_ntm$share, main="NTMs directly addressing SDGs by NTM types", col=c("red", "orange", "yellow", "green", "blue", "purple" ),names.arg = By_ntm$NTMtype, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Share of NTMs directly addressing SDGs, %", ylim=c(0,60))

We can also use write.table function, as below, to copy and paste the data from By_ntm data frame into Excel where we can create even nicer graph, which would also label broad NTM groups to which the plotted NTM types belong. See the resulting graph below.

write.table(head(By_ntm, nrow(By_ntm)), "clipboard", sep="\t", row.names=FALSE)
NTMs directly addressing SDGs by NTM types.

NTMs directly addressing SDGs by NTM types.

Quiz 17



Question 17.1: Refer to the code provided in this section and write code that will transform data in object ntmsall_hi_goal generated earlier by removing all rows that have NA value in column Goaln, add column NTM_letter that will contain NTM letter extracted from values in column NTM_temp, add column broad that will contain broad NTM groups, and add column NTMtype that will contain NTM types, the latter two both based on the values in column NTM_letter, and save it into object ntms_hi_goal.

Then use code to check the number of NTMs that are marked as SPS and TBT in column NTMtype. What is that number?

A. 9416

# Answer A is incorrect.

B. 10560

# Answer B is incorrect.

C. 9832

# Answer C is incorrect.

D. 10959

# Answer D is correct.

# The correct code is below for reference.



Question 17.2: Refer to the code provided in this section and write code that will transform data in object ntms_hi_goal and generate a bar plot for distribution of NTMs imposed by high income economies and directly addressing SDGs by NTM types.

What is the share of “All export” measures?

A. Approx. 18

# Answer A is incorrect.

B. Approx. 15

# Answer B is incorrect.

C. Approx. 10

# Answer C is correct.

# The correct code is below for reference.

D. Approx. 5

# Answer D is incorrect.



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.

# Question 1
ntms_hi_goal <- ntmsall_hi_goal[!is.na(ntmsall_hi_goal$Goaln),]

ntms_hi_goal$NTM_letter <- substr(ntms_hi_goal$NTM_temp, start = 1, stop = 1)

ntms_hi_goal$broad <- NA
ntms_hi_goal$broad[ntms_hi_goal$NTM_letter %in% c("A", "B", "C")] <- "Technical"
ntms_hi_goal$broad[ntms_hi_goal$NTM_letter %in% c("P")] <- "Export"
ntms_hi_goal$broad[!ntms_hi_goal$NTM_letter %in% c("A", "B", "C", "P")] <- "Non-technical"
ntms_hi_goal$NTMtype <- NA
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("A")] <- "SPS"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("B")] <- "TBT"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("C")] <- "Pre-shipment\ninspection"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("E")] <- "Quantity\ncontrol"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("D")] <- "Contingent\ntrade\nprotective"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("F")] <- "Price\ncontrol"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("G","H","I","J","K","L","M","N","O")] <- "Other"
ntms_hi_goal$NTM_letter[ntms_hi_goal$NTMtype=="Other"] <- "G"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("P")] <- "All export"

nrow(ntms_hi_goal[ntms_hi_goal$NTMtype=="SPS"|
               ntms_hi_goal$NTMtype=="TBT",])

# Question 2

hi_ntm <- ntms_hi_goal %>% group_by(NTMtype, NTM_letter, broad) %>% count() %>% arrange(NTM_letter)
hi_ntm$share <- hi_ntm$n/sum(hi_ntm$n)*100
hi_ntm_graph <- barplot(hi_ntm$share, main="NTMs directly addressing SDGs by NTM types", col=c("red", "orange", "yellow", "green", "blue", "purple" ),names.arg = hi_ntm$NTMtype, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Share of NTMs directly addressing SDGs, %", ylim=c(0,60))

hi_ntm[hi_ntm$NTMtype=="All export", "share"][[1]]

3.8 Shares of NTMs directly addressing SDGs within all of NTMs, by country and in comparison with regional and global averages

Currently, we have data on NTM linkages to SDGs for 88 countries. We can calculate shares of NTMs directly addressing SDGs for each country and compare those to regional (ESCAP) and global averages.

For that we will reuse ntms_Goal and countries data frames.

First, we create a new By_country data frame by using functions from dplyr package to group NTMs by reporter country, and we use as.numeric to make sure that values in columns n and share are numeric.

And then we use the following lines of code and a for loop to create column share and populate it with a calculated value for the share of NTMs directly addressing SDGs within a total of all NTMs in a given country. Notice, that we use ntms_all data frame and functions unique and length to get the value for the total number of NTMs for a given country present in By_cGoal.

By_country <- ntms_Goal  %>% group_by(reporter) %>% count()
By_country$n <- as.numeric(By_country$n)
By_country$share <- 0

for(i in 1:nrow(By_country)){
   By_country$share[[i]] <- round(By_country$n[[i]]/length(unique(ntms_all$measure[ntms_all$reporter==By_country$reporter[[i]]]))*100, digits = 2)
  }

In the lines of code below we calculate values for number of NTMs addressing SDGs globally and in ESCAP region, for total number of NTMs globally and in ESCAP region, and the shares of NTMs addressing SDGs in the worlds and in ESCAP region, and then populate the relevant values into By_cGoal.

Notice that we use merge function and refer to columns country and escap of ‘countries’ data frame to extract values for countries in ESCAP region only.

We then use write.table function, as below, to copy and paste the data from By_country data frame into Excel where we can create a graph, which plots bars for each individual country and lines for regional and global averages. See the resulting graph below.

# World
sdgtotal <- sum(By_country$n)
total <- as.numeric(nrow(unique(ntms_all[,c("reporter", "measure")])))
share <- round(sdgtotal/total*100, digits = 2)
By_country[(nrow(By_country)+1),] <- c("World",
                                   sdgtotal,
                                   share)

# ESCAP
sdgtotal <- merge(By_country, countries[countries$escap==1, c("country", "escap")], by.x = "reporter", by.y = "country")
sdgtotal <- sum(as.numeric(sdgtotal$n))

total <- merge(ntms_all, countries[countries$escap==1, c("country", "escap")], by.x = "reporter", by.y = "country")
total <- as.numeric(nrow(unique(total[,c("reporter", "measure")])))
share <- round(sdgtotal/total*100, digits = 2)
By_country[(nrow(By_country)+1),] <- c("ESCAP",
                                   sdgtotal,
                                   share)

write.table(head(By_country, nrow(By_country)), "clipboard", sep="\t", row.names=FALSE)
Shares of NTMs directly addressing SDGs by country vs regional and global averages. Note that not all 88 countries fit in this graph. You can select a shorter list of countries to be shown in the graph or try some other pragh type

Shares of NTMs directly addressing SDGs by country vs regional and global averages. Note that not all 88 countries fit in this graph. You can select a shorter list of countries to be shown in the graph or try some other pragh type

Practice arranging data and creating bar plots



Try using data frames ntms_all_Goal and countries_inc and code to prepare the NTM data and then to create a bar plot showing the shares of NTMs directly addressing SDGs, which are imposed by countries that belong to different income groups and the EU.

Once you finish your attempt, refer to the code below that does just that. Look through it. Do you understand what each line of code does?

#

ntms_all_Goal <- merge(ntms_all_Goal, countries_inc, by.x="reporter", by.y="country", all.x = TRUE)

ntms_all_Goal$income_level[ntms_all_Goal$high_income==1] <- "High income"
ntms_all_Goal$income_level[ntms_all_Goal$low_income==1] <- "Low income"
ntms_all_Goal$income_level[ntms_all_Goal$upper_middle_income==1] <- "Upper middle income"
ntms_all_Goal$income_level[ntms_all_Goal$lower_middle_income==1] <- "Lower middle income"
ntms_all_Goal$income_level[ntms_all_Goal$reporter=="European Union"] <- "EU"
# any(is.na(ntms_all_Goal$income_level))

by_inc_all <- unique(ntms_all_Goal[, c("reporter", "measure", "income_level")]) %>% dplyr::group_by(income_level) %>% count(); names(by_inc_all) <- c("inc", "total")
by_inc_sdg <- ntms_all_Goal %>% filter(is.na(ntms_all_Goal$Goaln)) %>% dplyr::group_by(income_level) %>% count(); names(by_inc_sdg) <- c("inc", "sdg")

by_inc <- merge(by_inc_sdg, by_inc_all, by="inc")
by_inc$share <- round(by_inc$sdg/by_inc$total*100, digit=2)

barplot(by_inc$share, main="Share of NTMs directly addressing SDGs by country income groups", col=c("red", "orange", "yellow", "green", "blue", "purple" ),names.arg = by_inc$inc, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Share of NTMs directly addressing SDGs, %", ylim=c(0,100))

Quiz 18



Question 18.1: Refer to object ntms_all_Goal You want to see what unique SDGs are associated with HS code 841810 (Refrigerators and freezers; etc.) and NTM code B11 (Prohibition for TBT reasons). Write code and choose the correct answer.

A. “Goal 12” “Goal 11” “Goal 9” “Goal 16” “Goal 7”

# Answer A is incorrect.

B. “Goal 12” “Goal 11” “Goal 3” “Goal 16” “Goal 7”

# Answer B is correct.

# The correct code is below for reference.

C. “Goal 12” “Goal 11” “Goal 6” “Goal 16” “Goal 7”

# Answer C is incorrect.

D. “Goal 12” “Goal 11” “Goal 15” “Goal 16” “Goal 7”

# Answer D is incorrect.



Question 18.2: Now you want to get the descriptions of all unique measures that are associated with HS code 841810 (Refrigerators and freezers; etc.), NTM code B11 (Prohibition for TBT reasons) and Goal 12 (Responsible consumption and production). Write the code that will achieve that. How many are such unique measures?

A. 28

# Answer A is incorrect.

B. 29

# Answer B is incorrect.

C. 30

# Answer C is correct.

# The correct code is below for reference.

D. 31

# Answer D is incorrect.



Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training.

# Question 1
unique(ntms_all_Goal$Goal[!is.na(ntms_all_Goal$Goaln)&
                            grepl("B11", ntms_all_Goal$NTM_temp)&
                            grepl("841810", ntms_all_Goal$HS)])

# Question 2
unique(ntms_all_Goal$description[!is.na(ntms_all_Goal$Goaln)&
                            grepl("B11", ntms_all_Goal$NTM_temp)&
                            grepl("841810", ntms_all_Goal$HS)&
                              grepl("Goal 12", ntms_all_Goal$Goal)])

Note: When working with the HS codes in the resulting summary, note that in entries that have only one HS code listed, the initial zero is omitted. So it is necessary to use code to correct those.

3.9 Shares of NTMs directly addressing SDGs by Goal among NTMs addressing SDGs for selected countries in comparison with regional and global averages

We may want to see in more detail the shares of NTMs addressing SDGs by Goal in some selected countries as opposed to the global or regional averages.

For that we again reuse ntms_Goal and countries data frames. First, we create a new By_cGoal data frame by using functions from dplyr package to group NTMs by reporter country and Goal, while preserving column Goaln as well, and we use arrange to sort the resulting data frame by name of the reporting countries and goal numbers. We then only extract the data for the countries, that we want to review in more details, for example, China, in our case.

To extract the relevant values for World, we again use ntms_Goal and dplyr package functions to create By_cGoal_WORLD data frame by grouping NTMs by Goal, while keeping column Goaln as well, and by arranging the resulting data frame by Goaln. We also add an additional column reporter filled with string “World”.

To extract values for countries in ESCAP region, we create By_cGoal_ESCAP by merging ntms_Goal with the subset of countries data frame, defined as countries[countries$escap==1, c("country", "escap")] by columns reporter and country. And then we use dplyr package functions to create By_cGoal_ESCAP data frame by grouping NTMs by Goal, while keeping column Goaln as well, and by arranging the resulting data frame by Goaln. We also add an additional column reporter filled with string “ESCAP”.

Then we use rbind function to combine all three data frames into one By_cGoal_summary, to which we add two additional columns total and share.

To calculate the total number of NTMs linked to SDGs for the World we use sum function on By_cGoal_summary data frame to sum all values in column n associated with reporter “World”. We do the same for ESCAP and China. We then calculate the share of NTMs linked to individual SDGs within total number of NTMs linked to any SDGs per individual reporter.

Now to build the stacked bar chart as below, we use ggplot function from ggplot2 package. To build the chart properly we need to convert values in Goaln to factors, so these are treated as discrete numeric data.25 For more details on what each element of ggplot function is used for, please refer to ‘Data Visualization with ggplot2 Cheat Sheet’ at https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf.

By_cGoal <- ntms_Goal  %>% group_by(reporter, Goal, Goaln) %>% count() %>% arrange(reporter, Goaln)
By_cGoal <- By_cGoal[By_cGoal$reporter=="China", ]

By_cGoal_WORLD <- ntms_Goal %>% group_by(Goal, Goaln) %>% count() %>% arrange(Goaln)
By_cGoal_WORLD$reporter <- "World"

By_cGoal_ESCAP <-merge(ntms_Goal, countries[countries$escap==1, c("country", "escap")], by.x = "reporter", by.y = "country")
By_cGoal_ESCAP <- By_cGoal_ESCAP  %>% group_by(Goal, Goaln) %>% count() %>% arrange(Goaln)
By_cGoal_ESCAP$reporter <- "ESCAP"

By_cGoal_summary <- rbind(By_cGoal_WORLD, By_cGoal_ESCAP)
By_cGoal_summary <- rbind(By_cGoal_summary, By_cGoal)

By_cGoal_summary$total <- 0
By_cGoal_summary$share <- 0

By_cGoal_summary$total[By_cGoal_summary$reporter=="World"] <- sum(By_cGoal_summary$n[By_cGoal_summary$reporter=="World"])
By_cGoal_summary$total[By_cGoal_summary$reporter=="ESCAP"] <- sum(By_cGoal_summary$n[By_cGoal_summary$reporter=="ESCAP"])
By_cGoal_summary$total[By_cGoal_summary$reporter=="China"] <- sum(By_cGoal_summary$n[By_cGoal_summary$reporter=="China"])

By_cGoal_summary$share <- By_cGoal_summary$n/By_cGoal_summary$total*100

By_cGoal_summary$Goaln <- as.factor(By_cGoal_summary$Goaln)

ggplot(By_cGoal_summary, aes(fill=Goaln, y=share, x=reporter, label=Goal)) +
    geom_bar(position="fill", stat="identity") +
  theme_bw(base_size = 14) +
  scale_fill_discrete(name = "Goals", labels = By_cGoal_summary$Goal)

Alternatively, you can use write.table function, as below, to copy and paste the data from By_cGoal_summary data frame into Excel, where you can conveniently create a similar stacked bar graph, or a grouped bar graph similar to the one that was used in ‘Asia-Pacific Trade and Investment Report 2019: Navigating Non-tariff Measures towards Sustainable Development’ on page 2226. See the resulting graph below.

write.table(head(By_cGoal_summary, nrow(By_cGoal_summary)), "clipboard", sep="\t", row.names=FALSE)
Distribution of NTMs that directly address SDGs, by Goal.

Distribution of NTMs that directly address SDGs, by Goal.

4 Final notes

Interpretation of and other ways to summarize data

Section 3 above shows different ways, in which you can summarize and graphically present data generated by the application of SDG-HS-NTM Concordance Matrix to the UNCTAD’s database of NTMs. Of course, these are not exhaustive.

It is also possible to extract products and product groups at 2-, 4- and 6-digit level that are most frequently addressed by NTMs that are associated with all or specific SDGs or their targets. It is possible to see, what NTM types addressing all or specific SDGs are applied to a product or a product group globally, regionally or by a country. It is possible to identify those NTM types that are most often used to address SDG-related issues.

It is also possible to calculate a few indicators commonly used to quantify the intensity of non-tariff measures specifically for SDG-related NTMs: coverage ratio, frequency index and prevalence score27; or an indicator that measures the similarity between NTM policies across member States and sectors: regulatory distance28. For detailed method of using R and the available NTM and trade data sets to calculate the above indicators for all NTMs, you can refer to Training on using R for trade analysis. You can apply this method to calculate these indicators for SDG-linked NTMs as well.

Flexibility of SDG-HS-NTM concordance matrix

The SDG-HS-NTM concordance matrix is based on the methodology jointly developed by ESCAP and UNCTAD and is described in ESCAP working paper available from here.

HS codes, NTM codes and keywords were defined based on literature, international treaties and data available from TRAINS. You can edit these by adding or removing rows, keywords, NTM and HS codes, and see how that changes the output. When adding new HS codes, NTM codes, or keywords, make sure to separate them with ; and to add opening and closing ! to all-caps abbreviations in keywords.

You may even add more concordance strings, if you are able to identify additional definitive linkages between SDGs and regulations imposed on traded goods. In any case, the SDG-HS-NTM Concordance Matrix can be a living and evolving tool that can be adjusted depending on the needs or on the emerging new knowledge.

Acknowledgements

This training was prepared under the overall supervision of Mia Mikic, Director, Trade, Investment and Innovation Division (TIID) of the United Nations Economic and Social Commission for Asia and the Pacific (ESCAP), and Yann Duval, Chief of Trade Policy and Facilitation (TPFS), TIID, ESCAP. This module was developed by Alexey Kravchenko, Economic Affairs Officer, TPFS, TIID, ESCAP and Maria Semenova, Consultant, TPFS, TIID, ESCAP. The authors are grateful for the technical support provided by Hrisyana Doitchinova, and for the testing feedback received from Andrea Dalla Rosa.

Annex 1 - Downloading and installing R and RStudio

Windows Users

To Install R

  • Open an internet browser and go to www.r-project.org.
  • Click the “download R” link in the middle of the page under “Getting Started.”
  • Select a CRAN location (a mirror site) and click the corresponding link.
  • Click on the “Download R for Windows” link at the top of the page.
  • Click on the “install R for the first time” link at the top of the page.
  • Click “Download R for Windows” and save the executable file somewhere on your computer. Run the .exe file and follow the installation instructions.

Now that R is installed, you need to download and install RStudio.

To Install RStudio

  • Go to www.rstudio.com and click on the “Download RStudio” button.
  • Click on “Download RStudio Desktop.”
  • Click on the version recommended for your system, or the latest Windows version, and save the executable file. Run the .exe file and follow the installation instructions.

To Install the SDSFoundations Package

  • Download SDSFoundations to your desktop (make sure it has the “.zip” extension).
  • Open RStudio.
  • Click on the Packages tab in the bottom right window.
  • Click “Install.”
  • Select install from “Package Archive File.”
  • Select the SDSFoundations package file from your desktop.
  • Click install. You are done! You can now delete the SDSpackage file from your desktop.

Mac Users

To Install R * Open an internet browser and go to www.r-project.org. * Click the “download R” link in the middle of the page under “Getting Started.” * Select a CRAN location (a mirror site) and click the corresponding link. * Click on the “Download R for (Mac) OS X” link at the top of the page. * Click on the file containing the latest version of R under “Files.” * Save the .pkg file, double-click it to open, and follow the installation instructions. * Now that R is installed, you need to download and install RStudio.

To Install RStudio * Go to www.rstudio.com and click on the “Download RStudio” button. * Click on “Download RStudio Desktop.” * Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder.

To Install the SDSFoundations Package * Download SDSFoundations to your desktop (make sure it has the “.tgz” extension). * Open RStudio. * Click on the Packages tab in the bottom right window. * Click “Install.” * Select install from “Package Archive File.” * Select the SDSFoundations package file from your desktop. * Click install. You are done! You can now delete the SDSpackage file from your desktop.


  1. UNCTAD. International classification of non-tariff measures. 2012 version. Available at https://unctad.org/en/PublicationsLibrary/ditctab20122_en.pdf?user=46

  2. More information on the Sustainable Development Goals is available at https://sustainabledevelopment.un.org/?menu=1300

  3. Kravchenko, Alexey; Semenova, Maria; Lee, Seul and Duval, Yann (2019). Exploring linkages between non-tariff measures and the Sustainable Development Goals: A global concordance matrix and application to Asia and the Pacific. Trade, Investment and Innovation Working Paper No. 04/19, ESCAP: Bangkok.

  4. Guidelines to collect data on official non-tariff measures are available at https://unctad.org/en/PublicationsLibrary/ditctab2014d4_en.pdf

  5. UNCTAD. International classification of non-tariff measures. 2012 version. Available at https://unctad.org/en/PublicationsLibrary/ditctab20122_en.pdf?user=46

  6. In 2019 a new version of ICNTM was launched, which preserved the overall structure of the classification, while introducing a few very notable changes and expanding taxonomy of NTMs under a few chapters.

  7. MAST team is comprised of Food and Agriculture Organization of the United Nations, International Monetary Fund, International Trade Centre, Organization for Economic Cooperation and Development, United Nations Conference on Trade and Development, United Nations Industrial Development Organization, World Bank, and World Trade Organization

  8. Asia-Pacific Trade and Investment Report 2019: Navigating Non-tariff Measures towards Sustainable Development, Chapter 1: Why non-tariff measures matter for sustainable development. Available at https://www.unescap.org/sites/default/files/APTIR2019_Chapter1.pdf

  9. Some of this data is regularly updated, and newer versions may become available from their publishers.

  10. Kravchenko, Alexey; Semenova, Maria; Lee, Seul and Duval, Yann (2019). Exploring linkages between non-tariff measures and the Sustainable Development Goals: A global concordance matrix and application to Asia and the Pacific. Trade, Investment and Innovation Working Paper No. 04/19, ESCAP: Bangkok.

  11. The following color coding is used in the excel concordance matrix: (1) concordance strings that describe direct linkage and are clear enough to enable further quantitative analysis are not color-coded (in the excel SDG-HS-NTM Concordance Matrix they are also marked as “C” for “Clear” or “CwK” for “Clear with keywords”); (2) concordance strings that are too ambiguous, or for which their impact on NTMs is considered indirect, or for which not enough information is available, are color coded in grey (in the excel SDG-HS-NTM Concordance Matrix they are marked as “A” for “Ambiguous” and “IND” for “Indirect”).

  12. Notice, when we run head(matching) the first line of the output contains # A tibble: 6 x 8, stating that matching object is a tibble. Tibble is a type of data frames that was developed specifically for use in R to allow for better management of data. You can read about the characteristicts and capabilities of tibbles here. Essentially, tibbles and data frames are very similar, but respond somewhat differently to same types of manipulation, including subsetting. For example, if an object is a data frame code dataframe[8,8] generates a vector containing one value from the cell in row 8 and column 8 (a one-dimensional object). If the same object is a tibble, code tibble[8,8] generates a matrix with 1 row and 1 column (a two-dimensional object). So, if we need to extract the data from that cell as a vector, we need to coerce it to one dimension, which is done by adding [[1]] at the end of that code, e.g. tibble[8,8][[1]].

  13. Negative keywords are the words and word combinations that have to be absent in an analyzed NTM entry in the TRAINS database.

  14. Note that some countries that are marked as having no NTMs belong to an economic integration grouping, such as the European Union, so NTM data are available for the grouping rather than for the individual countries. In some other cases countries are marked as having zero NTMs due to the lack of NTM data for those countries in TRAINS database.

  15. 3-digit codes are, in fact, 4-digit codes, in which first zeros got omitted when the codes were converted from characters to numeric values. These are the codes from HS chapters 01 through 09.

  16. Negation can be signaled using “-” or “!”.

  17. 1-digit codes are, in fact, 2-digit codes, in which first zeros got omitted when the codes were converted from characters to numeric values. These are the codes from HS chapters 01 through 09.

  18. Kravchenko, Alexey; Semenova, Maria; Lee, Seul and Duval, Yann (2019). Exploring linkages between non-tariff measures and the Sustainable Development Goals: A global concordance matrix and application to Asia and the Pacific. Trade, Investment and Innovation Working Paper No. 04/19, ESCAP: Bangkok.

  19. This is relevant for such countries as Argentina, Benin, Plurinational State of Bolivia, Brazil, Burkina Faso, Chile, Colombia, Costa Rica, Côte d’Ivoire, Cuba, Ecuador, El Salvador, Guatemala, Guinea, Honduras, Mali, Mexico, Nicaragua, Niger, Panama, Paraguay, Peru, Senegal, Togo, Uruguay, Bolivarian Republic of Venezuela.

  20. The United Nations Special Programme for the Economies of Central Asia (SPECA) was launched in 1998 to strengthen subregional cooperation in Central Asia and its integration into the world economy. The countries of SPECA are Afghanistan, Azerbaijan, Kazakhstan, Kyrgyzstan, Tajikistan, Turkmenistan and Uzbekistan.

  21. As was explained earlier, Keywords and neg_Keywords that do not contain “!”, are have been formatted to lower case. Those that contain “!” were kept as all caps. This is done, as keyword matching with the data in the TRAINS database is generally case insensitive, with the exception of certain abbreviations that may match other words or their parts. Example: Letter combination “ODS”, which as an abbreviation that stands for “ozone depleting substances”, may be found in words “goods”, “rods”, “foods”, etc.

  22. Exclamation marks are used to signal words that are abbreviations. They are all caps and matching for them has to be case sensitive.

  23. When working with character strings in R, . is used to represent any character, except new line. Therefore, if we want to detect a period in a character string, we need to include \\ right before it. For more details on basic regular expressions in R used for manipulating character strings refer, for example, to https://stringr.tidyverse.org/articles/regular-expressions.html

  24. When working with character strings in R, . is used to represent any character, except new line. Therefore, if we want to detect a period in a character string, we need to include \\ right before it. For more details on basic regular expressions in R used for manipulating character strings refer, for example, to https://stringr.tidyverse.org/articles/regular-expressions.html

  25. If data is continuous, rather than discrete, R will create gradient color bars chart.

  26. Available at https://www.unescap.org/publications/APTIR2019

  27. UNCTAD. Non-tariff measures: Economic assessment and policy options for development. 2018. Available on page 92 here

  28. UNCTAD. Deep regional integration and non-tariff measures: A methodology for data analysis. Policy issues in international trade and commodities research study series No. 69. 2015. Available on page 2 and onward here