This training is titled R for linking non-tariff measures to the Sustainable Development Goals and is second in the two-module E-learning series on R for trade and trade policy analysis published by the Trade, Investment and Innovation Division of the Economic and Social Commission for Asia and the Pacific (TIID ESCAP).
Non-Tariff Measures (NTMs) are policy measures - other than ordinary customs tariffs - that can potentially have an economic effect on international trade in goods, changing quantities traded, or prices or both.1 By definition they are not inherently protectionist, as while they can impose additional costs on the movement of goods, they also purport to address specific and legitimate non-trade objectives, some of which are relevant to the achievement of thе long-standing internationally agreed upon development priorities, which were solidified in the Sustainable Development Goals (SDGs) in 2015.2
In an effort to understand and describe the linkages that exist between NTMs imposed on internationally traded goods and the SDGs, ESCAP and UNCTAD jointly developed an SDG-HS-NTM concordance matrix based on their methodology outlined in the ESCAP working paper ‘Exploring linkages between non-tariff measures and the Sustainable Development Goals: A global concordance matrix and application to Asia and the Pacific’.3 The methodology and the excel file containing the concordance matrix are available here.
The matrix can be used to conduct analysis of global, regional or country-level data on non-tariff measures collected and stored in the UNCTAD’s global database on NTMs (TRAINS) available at https://trains.unctad.org/. The database contains information on more than 65,000 regulations imposed by over 100 countries. The data was collected following UNCTAD Guidelines to collect data on official non-tariff measures 2016 4 and classified in accordance with the International Classification of Non-Tariff Measures (ICNTM;5 version of 2012),6 developed by the Multi-Agency Support Team (MAST) Group.7
You can download an Excel file with a complete NTM data set dated as of May 2019 from here. An example of such analysis was conducted by ESCAP and its results were presented in Asia-Pacific Trade and Investment Report 20198 and in the accompanying country briefs for Armenia, Azerbaijan and Tajikistan available from here(see bottom of the webpage).
This training module describes how to use R to transform the SDG-HS-NTM concordance matrix into a tool to conduct such an analysis of NTM data from TRAINS database of NTMs.
In particular, it describes the following:
The material contains the R code, explanatory notes, preview of interim results and links to data sources. Complete code algorithm that you can open in RStudio and run with or without further editing is available in a separate script file (see Data section).
This training module also contains quizzes and practice sessions that intend to highlight a few useful things and to help some of the new knowledge to sink in better.
Note that this training module is part of an e-learning series and is provided together with the training module titled Training on using R for trade analysis, which is available in English and in Russian languages. Training on using R for trade analysis is provided together with a rated quiz, satisfactory passing of which entitles students to a certificate of completion. On the contrary, this module R for linking non-tariff measures to the Sustainable Development Goals does not have any rated quiz. So read on and practice at your own pace and for your own pleasure!
All necessary data are described in the table below.9 You can download them into the directory on your PC that you will use as your working directory in RStudio for this training module.
What | Training file name | Original source |
---|---|---|
UNCTAD’s TRAINS database on NTMs | UNCTAD_i-tip_report_ALL_Measures_x.xlsx | here |
SDG-HS-NTM concordance matrix | _SDG-HS-NTM-Concordance[v1.1 Sep 2019]0.xlsx | here |
UN Comtrade Commodity Classification | HSCodesAll.csv | here |
Table of concordance between different versions of the Harmonized Commodity Description and Coding System | HS_concordance_full.csv | here |
International Classification of non-tariff measures | MAST_all.csv | here |
ISO 3 country codes and binary data on whether a country has NTM data recorded in TRAINS and on whether it belongs to a certain grouping | countries_UNCTAD_iso3.csv | here |
Complete code algorithm | complete_matching_code.R | here |
Files with matching results per country | countries.zip | here |
First two are the main data files that are used to analyze non-tariff measures and their linkages to the SDGs. Open these two Excel files to familiarize yourself with the content and the setup of these two data sets.
UNCTAD’s TRAINS database contains the following information for each recorded NTM: imposing country, partner affected, NTM code (as per ICNTM 2012), measure description, affected HS codes, description of affected goods, source and national legal basis. In steps below, we will combine four columns titled measure description, description of affected goods, source and national legal basis into one column titled description.
SDG-HS-NTM concordance matrix contains the following information for each concordance string: SDG, SDG Targets, product group description, description of HS code list, list of HS codes, HS version, NTM code (ICNTM), list of keywords, list of negative keywords, and additional notes. You will see that some rows are colored grey and have 0 in column CODE 0/1. These are concordance strings that describe potential relationships that exists between regulating trade in certain goods and achievement of certain SDGs. Due to the setup of the TRAINS database of NTMs, these concordance strings cannot be used in the analysis below. In steps below we will filter these rows out. Filtering will leave us with the concordance strings that describe intended direct (and positive) impact on the achievement of SDGs.
A specific HS-NTM code pair was considered to have a direct linkage to an SDG Target if (1) it has a clearly stated SDG Target-related objective (supported by relevant keywords in the descriptive information present in the TRAINS database), or (2) the examined HS-NTM code combination is not likely to have any objective other than the one that is relevant to an SDG (e.g. trade in hazardous chemicals and waste, endangered species of flora and fauna, cultural heritage items, arms and weapons, etc.).10
The code algorithm, described in detail below, eventually allows to check each NTM entry in the TRAINS database for presence of simultaneous matches with at least one HS code, one NTM code and one keyword described in a given concordance string in SDG-HS-NTM concordance matrix.
The rest of the data frames are used to filter, clean, convert, reorganize and otherwise prepare the data in the two main data frames for subsequent analysis. Further details on these data sets are provided in sections below.
complete_matching_code.R contains the complete code algorithm, and countries.zip has a set of 88 files that contain results of matching HS and NTM codes to SDGs (more on these two in sections below).
R is a free software environment for statistical computing and graphics, which runs on a variety of platforms, while RStudio, that we use for the purpose of this training, is an open source software that provides a very convenient console for inputting and running the code, previewing the generated outputs and saving them in different formats.
Find download instructions for PC and Mac in Annex 1 below, or refer to Training on using R for trade analysis for more detailed instructions.
Now we will go through the code step by step, provide necessary explanations and show the interim results. Should you need more information on various packages and functions run help(package = "package_name")
for a given package or ?function_name
for a given function. Help information will be displayed in the bottom right corner of your RStudio window in tab Help.
The full code algorithm is available in “complete_matching_code.R” file (see Data section). You can open it in your RStudio or is a .txt file. You can edit it or run it as is, having only revised the first line of code setting the working directory (we will explain how to do that below). However, do not rush to use this file. First go through the steps of this training module and familiarize yourself with what each line of code intends to do.
Going through this training module may take 1-2 weeks depending on your learning pace and on your familiarity with R code. Doing the training module titled Training on using R for trade analysis first, should make going through this learning module easier and faster. Anyhow the process will take a few separate sessions.
To make sure that no progress is lost, follow the following recommendations:
When you start this training, open, name and save a new script in RStudio. As you go through the module it is necessary to paste and run all code from the code chunks in this module, both for training, and for quizzes and training sessions, in one script. You can use #
sign to insert comments to easily differentiate between training code and quiz code. Important: Code chunks are page-wide grey boxes that contain the code strings that are described in this module. Code in small grey boxes within the paragraphs of text are to be examined, but should not be copied into your R script. Page-wide white boxes with text preceded by ##
contain the output, that is the result of running the code string just above it - this is what you should see when you run the code in your RStudio.
Execution of any chunk of the training code, quiz code and some practice code is dependent on prior execution of all earlier chunks of code (including those that set the working directory and load all function packages and data files). So, as you progress through this training module, quizzes and practice sessions, make sure to copy all code into your R script, run it and save it regularly. To save the script press Ctrl+S
at any convenient time. To save all the objects with variables and data frames (the workspace) make sure to press “Save” in the pop-up window asking whether you want to save the workspace image, which appears when you close the RStudio window. Next time you open the session you can continue from where you stopped.
If your start getting errors after restarting your session, just run all the code from all previous study sessions, and then try to run the code from the new session once again. To launch execution of multiple lines of code at once, select them all, hit Ctrl+Enter
and wait for a few moments. To run one string of code place a cursor inside that string and press Ctrl+Enter
.
If you previously ran the code that had a mistake and that overwrote an existing object or its elements (such code string would contain <-
or =
) that are used in the following lines of code, running those following lines of code may result in output mistakes or error messages. In this case, after you revised the faulty code, it is better to retrace your steps to where the objects involved were originally created, and rerun that code and all code that follows. Alternatively, in some cases it may be feasible to rerun all code from the very beginning.
Warning message that may be shown, usually do not stop execution of code. They are intended to draw your attention to some characteristics of code execution. It still may be useful to attention to them.
Error messages state that something went wrong and stop execution of code. The text of the error message indicates what type of error that may be: mistake in the code, absence of certain data objects in the workspace, error in a link to a location on your PC, wrong working directory set in earlier steps, failure to install and/or load the function packages. Error message has to be reviewed and the issues resolved, before you can move forward. In some cases updating R and all function packages, or even uninstalling and reinstalling of R and RStudio, may help, as the software is constantly updated by the developers. At the time of publishing of this training module, the current code worked well for the tasks described.
Should you need more information on various packages and functions run help(package = “package_name”) for a given package or ?function_name for a given function. If you run into a function that was introduced earlier in a module, you can search in the module web-page by pressing Ctrl+S
and entering the name of the function. This way you will find its earlier mentioning and example if use to refresh your recollection.
Notice links to footnotes and external web-resources included throughout the training module. Those may contain additional short clarifications or provide additional useful reference material on NTMs, SDGs and their linkages, as well as on R language and functions.
As you go through the code and instructions below, you will be offered a few practice sessions as well as a few quizzes that intend to help you get better grasp of the workings of R and its code. You are not rated on these quizzes. They are only there for you to learn.
Start your RStudio, and open your new R script by selecting File/New file/R script or by pressing Ctrl+Shift+N
. Then save the new working file by pressing File/Save as…, entering the preferred file name and choosing the preferred directory on your computer.
Set your working directory from which your data frames will be imported and to which the resulting files will be exported by using function setwd
. This is a very important step, as it will allow you import files from this location without having to indicate the full path to this directory. And RStudio will save all output and backup files into that directory automatically.
Note for PC users: when inputting the path to any directory in RStudio, it is necessary to replace all backslashes \
with forward slashes /
, or you will get an error message and code will not run.
setwd("C:/path/to_your/working_directory")
You can check your working directory using getwd
.
getwd()
## [1] "C:/path/to_your/working_directory"
Download the data files listed in section Data above into your working directory.
Before proceeding further, it is necessary to load the following code packages that we will use later. Details on each package can be easily found online. Alternatively, use help(package = "package_name")
, to review package information in Help tab in the lower left corner of your RStudio window.
library("dplyr")
library("readxl")
library("reshape2")
library("stringr")
library("stringi")
library("tidyr")
library("ggplot2")
If you get an error message indicating that the required code package does not exist, try installing the package by running code install.packages("package_name")
first, and only then run library("package_name")
.
Warning messages are usually not a problem.
If you open the SDG-HS-NTM-Concordance[v1.1 Sep 2019]_0.xlsx file, you will see that the concordance matrix is contained in Sheet 2 titled “Concordance Matrix”. Hence, this is the sheet that needs to be imported by using function read_excel
. Note that to be able to further manipulate the data frame it is necessary to create an R object, rather than just import the data file. So below we use matching <-
, where matching
is the name of the new object, and the arrow <-
denotes the action of assignment (alternatively, you can use a single =
instead of <-
).
Also note that apart from the name of the file, which has to include file extension, the command also contains argument sheet=2
indicating sheet 2 of the imported excel file, argument col_names=TRUE
indicating that the first row is used as column names, and argument trim_ws=TRUE
indicating that all leading and trailing white spaces should be trimmed.
matching <- read_excel("SDG-HS-NTM-Concordance[v1.1 Sep 2019]_0.xlsx", sheet=2, col_names = TRUE, trim_ws = TRUE)
You can preview the elements of the imported data frame by doing the following:
names
to preview the names of the columns or function head
to preview the first few rows (6 rows by default, which you can change by adding an alternative value for the relevant argument).Note: In the output after calling function head
under the column names you can see text within <>
. This indicates class of data stored in each column. <dbl>
stands for numeric data, and <chr>
stands for character strings. Other possible options are
names(matching)
## [1] "CODE 0/1 (0: A, IND; 1: C,CwK)"
## [2] "SDG"
## [3] "Target"
## [4] "Product description"
## [5] "List description"
## [6] "HS"
## [7] "HS_version"
## [8] "NTM"
## [9] "Keywords"
## [10] "neg_Keywords"
## [11] "A/C/CwK/IND"
## [12] "Notes"
head(matching,10)
## # A tibble: 10 x 12
## `CODE 0/1 ~ SDG Target `Product descri~ `List descripti~ HS
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 0 SDG_1 NA Potentially all~ NA NA
## 2 0 SDG_1 NA All products an~ NA NA
## 3 0 SDG_1 NA All products an~ NA NA
## 4 0 SDG_2 NA Agricultural pr~ HS chapters 01-~ HS c~
## 5 0 SDG_2 NA Technologies, m~ Relevant 6- dig~ Rele~
## 6 0 SDG_2 NA Fertilizers and~ Relevant 6- dig~ Rele~
## 7 0 SDG_2 Targe~ All intermediat~ See relevant se~ See ~
## 8 0 SDG_2 Targe~ Any product pro~ See relevant se~ See ~
## 9 0 SDG_2 Targe~ Endangered spec~ See relevant se~ See ~
## 10 1 SDG_2 Targe~ (BASIC) Agric~ [HS-WTO_Aggri H~ 0101~
## # ... with 6 more variables: HS_version <chr>, NTM <chr>, Keywords <chr>,
## # neg_Keywords <chr>, `A/C/CwK/IND` <chr>, Notes <chr>
Question 1.1: What output is generated by calling function head(data_frame)
? Note: data_frame
is just a made up name of a data frame object.
A. Preview of data_frame
in a separate tab in RStudio.
# Answer A is incorrect.
B. Preview of the first 10 rows of data_frame
.
# Answer B is incorrect.
C. Preview of the names of the columns of data_frame
.
# Answer C is incorrect.
D. Preview of the first 6 rows of data_frame
.
# Answer D is correct!
Question 1.2: Which of the lines of code below will successfully create an object containing a data frame imported from an Excel file’s sheet 1? Note: data.xlsx
is just a made up name of a data file.
A. object <- read("data.xlsx", sheet=1, col_names = TRUE)
# Answer A is incorrect. Need to use function `read_excel`.
B. object = read_excel("data.xlsx", sheet=1, col_names = TRUE)
# Answer B is correct! `<-` and `=` are equivalent.
C. object < read_excel("data.xlsx", sheet=1, col_names = TRUE,)
# Answer C is incorrect. There is one redundant `,` within the round brackets.
D. object <- read_excel(data.xlsx, sheet=1, col_names = TRUE)
# Answer D is incorrect. File name has to be specified within quotation marks.
matching
data frameFor the purpose of further analysis we need to remove those rows in matching
that contain concordance strings that are marked as Ambiguous and Indirect, so that we only preserve those that are Clear and Clear with keywords, as per the indication in A/C/CwK/IND column. We do that by rewriting object matching
with itself, excluding the rows that contain strings “A” and “IND” in column A/C/CwK/IND.11
To do that we use code matching <- matching[!(matching$
A/C/CwK/IND%in% c("A", "IND")),]
. Square brackets []
are used to reference or extract a subset from the existing matching
data frame, while operator $
in combination with the data frame object’s name is used to reference specific column by its name. Notice, that logical operator !
is used for negation and operator %in%
is used to identify, if an element belongs to a certain group of data elements, in our case values “A” and “IND” in column A/C/CwK/IND.
Before we run the code in the chunk below to subset our matching
data frame, let’s first practice different ways of data frame subsetting.
Let’s practice a bit with some more examples to get better hang of it. All the code mentioned in the following paragraphs is also included in the code chunk below. Note that this code only prints the results of subsetting in the RStudio console. It does not generate any new objects and does not change anything in the existing objects, as we do not use functions <-
or =
.
matching$NTM
or
matching$'NTM'
or
matching$"NTM"
These above all do the same thing and call the content of column NTM as a vector of values. Drop quotation marks for simplicity and use them only if the column name contains symbols that may have a function associated with them, just like we did with matching$'A/C/CwK/IND'
. Another way to call contents of the column as a vector is to use matching[[8]]
, where 8
is the number of the column we are calling.
matching[, "NTM"]
subsets matching
data frame into a smaller data frame containing only one column NTM and all 119 rows. Subsetting matching
to contain two columns is achieved with matching[, c(6,8)]
or matching[, c("HS", "NTM")]
.
matching[,8]
creates data frame with only one column 8, matching[8,]
creates data frame with only one row 8, and matching[8,8]
creates data frame with data from only one cell from 8th row and 8th column. To coerce the one-row-one-column data frame to a vector, you need to add [[1]]
, like so matching[8,8][[1]]
.12
You can extract the values from a given cell by using matching$NTM[[8]]
as a vector.
matching[1:8,2:8]
extracts from matching
rows from 1 to 8 and columns from 2 to 8. Thus, the value or the function within square brackets before ,
references the rows by number, name or by certain criteria, and the value after ,
references the columns by number, name or by certain criteria. Notice, that we preserve ,
, even if we do not set forth any criteria for rows, or columns. Technically calling matching[,]
is also correct - we extract all rows and columns, i.e. the complete data frame as is.
matching$'A/C/CwK/IND'=="A"
checks if any cells in column A/C/CwK/IND contain “A” and essentially creates a vector of logical values TRUE
or FALSE
. Note that, if the value after ==
is a character string, we should use quotation marks. If that value is numeric, we drop the quotation marks, as in matching[[1]]==1
, which checks if any cells in column 1 of matching
contains value 1.
If value in a cell is NA
, we cannot use ==NA
or =="NA"
. We should use function is.na
instead. For example, is.na(matching$Keywords)
checks whether any cell in column Keyword contain value NA
. Subsetting with matching[!is.na(matching$Keywords),]
will get you all rows of matching
data frame, excluding those that have value NA
in column Keyword. See more on !
below in this section.
matching[matching$'A/C/CwK/IND'=="A",]
extracts from matching
only those rows that contain “A” in column A/C/CwK/IND. Alternatively, matching[matching$'A/C/CwK/IND'!="A",]
or matching[!matching$'A/C/CwK/IND'=="A",]
extract from matching
only those rows that do not contain “A” in column A/C/CwK/IND. Note that in these code lines !=
stands for “not equal to” and !x
stands for “not x”.
For more details on various operators see here. We will use most of them in the sections below.
Notice, that we use ==
for the logical operator of equality, as =
is used for assigning values to an object and is equivalent to ``.
If you have more than one criterion for the rows you can go one of the following ways:
matching[matching$'A/C/CwK/IND'!="A"& matching$'A/C/CwK/IND'!="IND", ]
, where we use logical operator &
to set forth two criteria for the rows to be extracted. You can set up as many criteria as necessary.
matching[!(matching$'A/C/CwK/IND' %in% c("A", "IND")),]
does the same thing as the above, but uses operator %in%
and c()
to indicate that content in the cells in column A/C/CwK/IND should not be the one present in vector c("A", "IND")
.
All of the code described above is saved in the code chunk below. Paste it into your RStudio and run after importing the SDG-HS-NTM concordance matrix (done in code chunk above). Look at the output in the console. Running this code does not do anything to the objects saved in you environment, as we do not store this output into any of the existing or new objects by using <-
or =
. Assign the generated output to a test
object, so you can examine it in detail by using such functions as name
, head
, summary
, attributes
, length
, nrow
, ncol
.
Note that the first 5 lines of code generate vectors of data from the specified column in a data frame, while the rest of them generate a data frames with a given number of rows and columns. Also note, that each opening bracket has to be paired by a closing bracket!
# vectors
matching$NTM
matching$`NTM`
matching$"NTM"
matching$`A/C/CwK/IND`
matching[[8]]
# data frames
matching[, "NTM"]
matching[, c(6,8)]
matching[, c("HS", "NTM")]
matching[,8]
matching[8,]
matching[8,8]
matching[8,8][[1]]
matching$NTM[[8]]
matching[1:8,2:8]
matching[,]
matching$`A/C/CwK/IND`=="A"
matching[[1]]==1
matching[matching$`A/C/CwK/IND`=="A",]
matching[matching$`A/C/CwK/IND`!="A",]
matching[!matching$`A/C/CwK/IND`=="A",]
matching[matching$`A/C/CwK/IND`!="A"&
matching$`A/C/CwK/IND`!="IND", ]
matching[!(matching$`A/C/CwK/IND` %in% c("A", "IND")),]
Question 2.1: What line of code will open documentation on a function? Note: function_name
is just a made up name of a function.
A. help(function_name)
# Answer A is incorrect.
B. ?function_name
# Answer B is correct!
C. help(function = "function_name")
# Answer C is incorrect.
D. ?'function_name'
# Answer D is incorrect.
Question 2.2: What line of code creates an object with the content of column Target of matching
data frame?
A. target <- matching$Target
# Answer A is correct!
B. <- matching$Target
# Answer B is incorrect. No object is created.
C. target <- matching%in%Target
# Answer C is incorrect. Should use `matching$Target` to reference content of column __Target__.
D. target <- matching["Target"]
# Answer D is incorrect. Should use `matching$Target` to reference content of column __Target__.
Question 2.3: What line of code subsets a new data frame object from all rows of matching
object that have character string “HS4” in column HS_version?
A. HSversion <- matching[matching$HS_version="HS4",]
# Answer A is incorrect! Need to use `==` to indicate equality.
B. HSversion <- matching[matching$HS_version==HS4,]
# Answer B is incorrect! Character strings should be specified within quotation marks.
C. HSversion <- matching[matching$HS_version=="HS4"]
# Answer C is incorrect! Missing `,` within square brackets.
D. HSversion <- matching[matching$"HS_version"=="HS4",]
# Answer D is correct!
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. These objects you will need for further quizzes and practice sessions.
HSversion <- matching[matching$"HS_version"=="HS4",]
practice_matching <- matching
1.3 Clean up the matching
data frame (continued)
Now, back to our code, where we rewrite matching
with itself, excluding the rows that contain strings “A” and “IND” in column A/C/CwK/IND.
matching <- matching[!(matching$`A/C/CwK/IND` %in% c("A", "IND")),]
In further analysis we will only need columns Target, HS, HS_version, NTM, Keywords, neg_Keywords.13 Therefore, we rewrite matching
with a data frame only containing these specific columns.
matching <- matching [, c("SDG", "Target", "HS", "HS_version", "NTM", "Keywords", "neg_Keywords")]
In columns Keywords and neg_Keywords a few cells contain “NA” as a character string, which should be replaced with NA
as a value.
As you can see, we use square brackets to select for replacement only those cells in columns Keywords and neg_Keywords that contain a character string “NA”. Notice that we use ==
as a logical operator of equality rather than =
, which is used to assign a value or a set of characteristics to an object.
matching$Keywords[matching$Keywords=="NA"] <- NA
matching$neg_Keywords[matching$neg_Keywords=="NA"] <- NA
We then also can add an additional column containing row numbers for convenience. To do that, we reference column row in object matching
by using operator $
. Since this column does not yet exist, assigning values to be entered into that column will create a new one with a specified name. We use <-
to store values in that new column. We use :
to generate a sequence of numbers from 1 till number of rows in matching
data frame, which is determined with the use of function nrow
.
matching$row <- 1:nrow(matching)
Now we can preview the resulting matching
data frame and check its various attributes to see whether we are well on track.
names(matching) # names of the columns
## [1] "SDG" "Target" "HS" "HS_version"
## [5] "NTM" "Keywords" "neg_Keywords" "row"
head(matching) # first 6 rows
## # A tibble: 6 x 8
## SDG Target HS HS_version NTM Keywords neg_Keywords row
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
## 1 SDG_2 Target 2~ 010110;~ HS3 A1; A5~ pest; pe~ pesticide; M~ 1
## 2 SDG_2 Target 2~ 0101; 0~ HS3 A1; A5~ pest; pe~ pesticide; M~ 2
## 3 SDG_3 Target 3~ 263628;~ HS3 N; B31~ !TRIPS!;~ <NA> 3
## 4 SDG_3 Target 3~ 300220;~ HS3 B14; B~ medicine~ <NA> 4
## 5 SDG_3 Target 3~ 263628;~ HS5 B31; B~ generic;~ <NA> 5
## 6 SDG_3 Target 3~ 121130;~ HS5 B11; B~ dangerou~ <NA> 6
matching$row # values in column 'row'
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
nrow(matching) # number of rows
## [1] 42
ncol(matching) # number of columns
## [1] 8
row.names(matching) # names of rows
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
attributes(matching) # names of columns, names of rows and class of object
## $names
## [1] "SDG" "Target" "HS" "HS_version"
## [5] "NTM" "Keywords" "neg_Keywords" "row"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
##
## $class
## [1] "tbl_df" "tbl" "data.frame"
Question 3.1: Now you want use practice_matching
data frame, which is the copy of the original imported matching
data frame, to create a new data frame healthSDG
, which contains only the rows for SDG 3. Which line of code below will achieve that?
A. healthSDG <- practice_matching[practice_matching$SDG=="SDG_3"]
# Answer A is incorrect. `,` is missing within square brackets.
B. "healthSDG" <- practice_matching[practice_matching$SDG=="SDG_3", ]
# Answer B is incorrect. Redundant quotation marks around the name of a newly created object.
C. healthSDG <- practice_matching[practice_matching$SDG=="SDG_3", ]
# Answer C is correct.
D. healthSDG <- practice_matching[practice_matching$SDG="SDG_3", ]
# Answer D is incorrect. Should use `==` to indicate equality.
Question 3.2: In healthSDG
data frame that you created above you want to add an additional column subject, into the cells of which you want to save character string “Good health and wellbeing”. Choose the correct and the most efficient (shortest) line of code from options below?
A. healthSDG$subject <- "Good health and wellbeing"
# Answer A is correct.
B. healthSDG$subject[healthSDG$SDG=="SDG_3"] <- "Good health and wellbeing"
# Answer B is incorrect. `[healthSDG$SDG=="SDG_3"]` is redundant, as the current version of object `healthSDG` only contains concordance strings for SDG 3.
C. healthSDG$subject[healthSDG$SDG==SDG_3] <- "Good health and wellbeing"
# Answer C is incorrect. `[healthSDG$SDG==SDG_3]` is both redundant and wrong. Character string `SDG_3` should be within quotation marks. Current version of object `healthSDG` only contains concordance strings for SDG 3.
D. healthSDG$subject[SDG=="SDG_3"] <- "Good health and wellbeing"
# Answer D is incorrect. `[SDG=="SDG_3"]` is both redundant and wrong. `healthSDG$SDG` should be used to reference column __SDG__ of `healthSDG`. Current version of object `healthSDG` only contains concordance strings for SDG 3.
Question 3.3: In healthSDG
data frame you want to rewrite cells in column A/C/CwK/IND that contain “A” and “IND” with character string “omit”. Which line of code below would achieve that?
A. healthSDG$HS_version[healthSDG$"A/C/CwK/IND" %in% c("A", "IND")] <- "omit"
# Answer A is incorrect. `healthSDG$HS_version` should be replaced with `healthSDG$"A/C/CwK/IND"`.
B. healthSDG$"A/C/CwK/IND"[healthSDG$"A/C/CwK/IND" %in% c("A", "IND")] <- omit
# Answer B is incorrect. Character string "omit" should be indicated within quotation marks.
C. healthSDG$"A/C/CwK/IND"[healthSDG$"A/C/CwK/IND" %in% c("A", "IND")] <- "omit"
# Answer C is correct.
D. healthSDG$"A/C/CwK/IND"[healthSDG$"A/C/CwK/IND" %in% c(A, IND)] <- "omit"
# Answer D is incorrect. Character strings "A" and "IND" should be indicated within quotation marks.
Question 3.4: Now you want to drop rows that are marked with string “omit” in omit column. Which line of code below would achieve that?
A. healthSDG <- healthSDG[healthSDG$"A/C/CwK/IND"=="omit", ]
# Answer A is incorrect. `!` is missing within square brackets.
B. healthSDG <- healthSDG[!healthSDG$"A/C/CwK/IND"=="omit", ]
# Answer B is correct.
C. healthSDG <- healthSDG[!healthSDG$"A/C/CwK/IND"!="omit", ]
# Answer C is incorrect. There are two `!` within square brackets. Only one should be used.
D. healthSDG <- healthSDG[!healthSDG$"A/C/CwK/IND"=="omit" ]
# Answer D is incorrect. There is `,` missing within square brackets.
Question 3.5: In healthSDG
data frame you want to add an additional column row containing row numbers. Which line of code below would achieve that?
A. healthSDG$row <- 1:nrow(healthSDG)
# Answer A is correct.
B. healthSDG$row <- 1:ncol(healthSDG)
# Answer B is incorrect. Need to use function `nrow`, not `ncol`.
C. $row <- 1:nrow(healthSDG)
# Answer C is incorrect. Need to use `healthSDG$row` to indicate column __row__ of `healthSDG`.
D. healthSDG$row <- 1-nrow(healthSDG)
# Answer D is incorrect. Need to use `:` to indicate a range.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these updated objects you will need for further quizzes and practice sessions.
healthSDG <- practice_matching[practice_matching$SDG=="SDG_3", ]
healthSDG$subject <- "Good health and wellbeing"
healthSDG$`A/C/CwK/IND`[healthSDG$"A/C/CwK/IND" %in% c("A", "IND")] <- "omit"
healthSDG <- healthSDG[!healthSDG$"A/C/CwK/IND"=="omit", ]
healthSDG$row <- 1:nrow(healthSDG)
1.3 Clean up the matching
data frame (continued)
Some strings in cells in columns NTM, HS, Keywords, neg_Keywords, and Target contain an extra “;” at their end. For example, see the first cell in the NTM column currently containing the string “A1; A5; A8; C1; C3; P6;”. The same is true for values in row 1 of columns HS, Keywords, and neg_Keywords.
We will use a for loop
function to go through each cell in relevant columns one by one to check whether there is an extra semicolon present and to remove it. The for (i in 1:nrow(matching)
line below sets the sequence from which to derive index i
indicating a specific row in a given column.
The loop then does the following for NTM column:
temp
that contains a string in the ith cell in NTM columntrimws
last_letter
containing only the last symbol of the character string stored in temp
by using function substr
to extract a substring with given parameters and function nchar
to determine the number of the last character in a given character stringif...else
statement, which reads as if the symbol in last_letter
object is “;” then overwrite temp
with itself, excluding the last symbol. Notice, that we use nchar(temp)-1
to exclude the last character in a string stored in temp
temp
Open Help file for substr
function to see that it takes the following three arguments: substr(x, start, stop)
, where x
is text or a character vector, start
is first element to be extracted/replaced, and stop
is the last element to be extracted/replaced. start
and stop
can be specified as numbers or as functions that generate numbers based on certain criteria. In our case number for stop
is generated with function nchar(temp)
that gives the total number of characters in a string in temp
object. nchar(temp)-1
gives the total number of characters in a string in temp
object minus 1.
The for loop
then goes on to do the same for , HS, Keywords, neg_Keywords, and Target columns.
Note that for columns Keywords and neg_Keywords we introduce additional conditional if...else
statement, which reads as if temp
is not empty, then run the code that follows. This is due to the fact that some cells in these two columns are empty, and we don’t need to change them. To check, whether temp
is empty, we use function is.na
.
With the last line of code using function rm
we remove all redundant objects.
for (i in 1:nrow(matching)){
# for NTM
temp <- matching$NTM[i]
temp <- trimws(temp)
last_letter <- substr(temp, nchar(temp), nchar(temp))
if(last_letter==";"){
temp <- substr(temp, 1, nchar(temp)-1)
}
matching$NTM[i] <- temp
# for HS
temp <- matching$HS[i]
temp <- trimws(temp)
last_letter <- substr(temp, nchar(temp), nchar(temp))
if (last_letter==";"){
temp <- substr(temp, 1, nchar(temp)-1)
}
matching$HS[i] <- temp
# for KWs
temp <- matching$Keywords[i]
## do the same as for NTM and HS columns, only if "temp" contains a string
if (!is.na(temp)){
temp <- trimws(temp)
last_letter <- substr(temp, nchar(temp), nchar(temp))
if(last_letter==";"){
temp <- substr(temp, 1, nchar(temp)-1)
}
matching$Keywords[i] <- temp
}
# for neg KWs
temp <- matching$neg_Keywords[i]
if (!is.na(temp)){
temp <- trimws(temp)
last_letter <- substr(temp, nchar(temp), nchar(temp))
if (last_letter==";"){
temp <- substr(temp, 1, nchar(temp)-1)
}
matching$neg_Keywords[i] <- temp
}
# for Targets
temp <- matching$Target[i]
temp <- trimws(temp)
last_letter <- substr(temp, nchar(temp), nchar(temp))
if(last_letter==";"){
temp <- substr(temp, 1, nchar(temp)-1)
}
matching$Target[i] <- temp
}
# remove temporary objects
rm(temp, last_letter, i)
Question 4.1: You want to print the first 3 characters from a string in a cell in row 8 of column SDG in practice_matching
data frame. Which line of code below would achieve that?
A. substr(practice_matching[8, SDG], 1, 3)
# Answer A is incorrect. Name of a column should be indicated within quotation marks.
B. substr(practice_matching[8, "SDG"], 1, 3)
# Answer B is correct.
C. substr(practice_matching[8, SDG], 1:3)
# Answer C is incorrect. Name of a column should be indicated within quotation marks. Start and stop character numbers should be indicated as separate arguments and not as a range.
D. substr(practice_matching[8, "SDG"], 1:3)
# Answer D is incorrect. Start and stop character numbers should be indicated as separate arguments and not as a range.
Question 4.2: In practice_matching
data frame you now want to overwrite strings in SDG column with only numbers of these SDGs. Which line of code below would achieve that?
A. substr(practice_matching$SDG, 5, nchar(practice_matching$SDG))
# Answer A is incorrect. Need to include `practice_matching$SDG <-` to indicate, what is being overwritten.
B. practice_matching$SDG <- substr(practice_matching$SDG, 5, nchar(SDG))
# Answer B is incorrect. Should use `nchar(practice_matching$SDG)` as an argument for stop character number.
C. practice_matching$SDG < substr(practice_matching$SDG, 5, nchar(practice_matching$SDG))
# Answer C is incorrect. Should use `<-` and not `<`.
D. practice_matching$SDG <- substr(practice_matching$SDG, 5, nchar(practice_matching$SDG))
# Answer D is correct.
Question 4.3: Take a look at HSversion
data frame that we created earlier. You now want to use a simple for loop
to overwrite “HS4” in column HS_version with “2012”. Which line of code below would achieve that?
A. for(i in 1-nrow(HSversion)){HSversion$HS_version[[i]] <-2012}
# Answer A is incorrect. Should use `:` to indicate a range.
B. for(i %in% 1:nrow(HSversion)){HSversion$HS_version[[i]] <-2012}
# Answer B is incorrect. Should use `in` and not `%in%` to indicate a sequence for `for loop`.
C. for(i in 1:nrow(HSversion)){HSversion$HS_version[[i]] <-2012}
# Answer C is correct.
D. for(i in 1:nrow(HSversion)){HS_version[[i]] <-2012}
# Answer D is incorrect. Should use `HSversion$HS_version` to reference column of an object.
Question 4.4: Take a look at HSversion
data frame again. You now want to use a for loop
with nested if..else
statement to overwrite character string “NA” in column Keywords with value NA
. Which line of code below would achieve that?
A. for(i in 1:nrow(HSversion)){
if(HSversion$Keywords[[i]]=="NA"){
HSversion$Keywords[[i]]<-NA
}
}
# Answer A is correct.
B. for(i in 1:nrow(HSversion)){
if(HSversion$Keywords[[i]]==NA){
HSversion$Keywords[[i]]<-NA
}
}
# Answer B is incorrect. Where `NA` is a character string, it should be indicated within quotation marks.
C. for(i in 1:nrow(HSversion)){
if(HSversion$Keywords[[i]]=="NA"){
HSversion$Keywords[[i]]<-NA
}
# Answer C is incorrect. One closing curly bracket is missing.
D. for(i in 1:nrow(HSversion)){
if(HSversion$Keywords[[i]]="NA"){
HSversion$Keywords[[i]]=NA
}
}
# Answer D is incorrect. `==` should be used for equality. `=` can be used for assigning value instead of `<-`.
Question 4.5: How many cells in column Keywords in HSversion
object are empty? Use is.na
and sum
functions to check.
A. 4
# Answer A is incorrect.
B. 10
# Answer B is incorrect.
C. 15
# Answer C is incorrect.
D. 16
# Answer D is correct.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.
substr(practice_matching[8, "SDG"], 1, 3)
practice_matching$SDG <- substr(practice_matching$SDG, 5, nchar(practice_matching$SDG))
for(i in 1:nrow(HSversion)){
HSversion$HS_version[[i]] <-2012
}
for(i in 1:nrow(HSversion)){
if(HSversion$Keywords[[i]]=="NA"){
HSversion$Keywords[[i]]<-NA
}
}
sum(is.na(HSversion$Keywords))
1.3 Clean up the matching
data frame (continued)
Recalling the string from the first cell in the NTM column now gives us “A1; A5; A8; C1; C3; P6”. The semicolon has been removed!
Now we can expand matching
data frame into a long format by column Target by using function separate_rows
, trim any white spaces that may have been introduced into the character strings in column Target and renumber the rows in column row.
If a variable contains observations with multiple delimited values, separate_rows
function from tidyr
package separates the values and places each one in its own row.
Open Help file for separate_rows
function to see that it takes the following basic arguments: separate_rows(data, ..., sep = "[^[:alnum:].]+")
, where data
is name of a data frame object, ...
is for selection of columns to be separated, and sep = "[^[:alnum:].]+"
is used to indicate which symbol in a string should be treated as a separator delimiting collapsed values, in our case ";"
.
matching <- separate_rows(matching, Target, sep = ";")
matching$Target <- trimws(matching$Target)
matching$row <- 1:nrow(matching)
Question 5.1: Use separate_rows
function to split the current version of practice_matching
data frame by column Target with ;
treated as a separator, use trimws
to remove any leading or trailing white spaces in column Target, and then use nrow
function to determine number of rows in the resulting data frame and to choose the right answer below.
A. 231
# Answer A is incorrect.
B. 211
# Answer B is correct. The correct code is below for reference.
C. 121
# Answer C is incorrect.
D. 196
# Answer D is incorrect.
Question 5.2: Since in practice_matching
we did not remove trailing ;
in column Target, separate_rows
function generated extra rows with empty cells in column Target (i.e. space after ;
is treated as an independent string, even though it is empty). Use code to remove all rows for which cell in Target column is empty, and then use nrow
to determine number of rows in the resulting data frame and to choose the right answer below. Hint: Use ==""
to describe empty string.
A. 201
# Answer A is incorrect.
B. 196
# Answer B is incorrect.
C. 211
# Answer C is incorrect.
D. 204
# Answer D is correct. The correct code is below for reference.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these updated objects you will need for further quizzes and practice sessions.
# Question 1
practice_matching <- separate_rows(practice_matching, Target, sep = ";")
practice_matching$Target <- trimws(practice_matching$Target)
nrow(practice_matching)
# Question 2
practice_matching <- practice_matching[!practice_matching$Target=="",]
nrow(practice_matching)
In the HS column of matching
some cells contain “All HS”, indicating that all HS codes may be targeted by relevant regulations - for these we will need to include all 6-digit HS codes. Additionally, some cells contain 2- or 4-digit HS codes. For these we will need to include all underlying 6-digit codes. And lastly, as per the values in HS_version column, some HS codes are from different HS versions (2007, 2012 and 2017). For these we will need to convert HS codes to 2012 version, as this version is the one that is used in the TRAINS database of NTMs.
To do this we will need to use data frames generated from the UN Comtrade Commodity Classification table and from the table of concordance between different versions of the Harmonized Commodity Description and Coding System, referenced Data section above.
First, we create object hs_codes
into which we read the .csv file containing the UN Comtrade Commodity Classification.
hs_codes <- read.csv("HSCodesAll.csv", stringsAsFactors = FALSE)
If we look at the unique values in column Classification by using command unique()
, we will see the following values BE, H0, H1, H2, H3, H4, H5, S1, S2, S3, S4, representing different types and versions of commodity classifications. We are only interested in HS versions of 2007 (H3), 2012 (H4) and 2017 (H5), as these are used in matching
data frame.
So now we create three objects that only include rows for which value in column Classification is “H4”, “H5” or“H3”and the value in column Level is “6” (6-digit). And we are only including columns Code and Description. In the last three lines of code, we rename the columns in the resulting new objects for convenience.
hs_codes12 <- hs_codes[hs_codes$Classification=="H4"&hs_codes$Level==6, c("Code", "Description")]
hs_codes17 <- hs_codes[hs_codes$Classification=="H5"&hs_codes$Level==6, c("Code", "Description")]
hs_codes07 <- hs_codes[hs_codes$Classification=="H3"&hs_codes$Level==6, c("Code", "Description")]
names(hs_codes12) <- c("hs6", "description")
names(hs_codes17) <- c("hs6", "description")
names(hs_codes07) <- c("hs6", "description")
Then, in each object we create additional two columns with 2- and 4-digit HS codes by substringing the character stings in column hs6 with the use of function substr
. Numbers 1 and 2 or 1 and 4 indicate the first and the last characters to be extracted.
hs_codes12$hs4 <- substr(hs_codes12$hs6, 1, 4)
hs_codes12$hs2 <- substr(hs_codes12$hs6, 1, 2)
hs_codes17$hs4 <- substr(hs_codes17$hs6, 1, 4)
hs_codes17$hs2 <- substr(hs_codes17$hs6, 1, 2)
hs_codes07$hs4 <- substr(hs_codes07$hs6, 1, 4)
hs_codes07$hs2 <- substr(hs_codes07$hs6, 1, 2)
Lastly, we convert values in columns hs6, hs4 and hs2 in all three objects from characters into numeric values by using function as.numeric
. We need to do this to keep it consistent with the HS concordance table that has this data recorded as numeric values (omitting the first zero of the string), as we will see below.
hs_codes12$hs2 <- as.numeric(hs_codes12$hs2)
hs_codes12$hs4 <- as.numeric(hs_codes12$hs4)
hs_codes12$hs6 <- as.numeric(hs_codes12$hs6)
hs_codes17$hs2 <- as.numeric(hs_codes17$hs2)
hs_codes17$hs4 <- as.numeric(hs_codes17$hs4)
hs_codes17$hs6 <- as.numeric(hs_codes17$hs6)
hs_codes07$hs2 <- as.numeric(hs_codes07$hs2)
hs_codes07$hs4 <- as.numeric(hs_codes07$hs4)
hs_codes07$hs6 <- as.numeric(hs_codes07$hs6)
We can now see what new objects look like. For example, object hs_codes12
below. Notice, that we use function head
in which we indicate that we want to see first 10 rows instead of the default 6 rows. We also use function class
to check the class of data in hs6 column.
head(hs_codes12,10)
## hs6 description hs4 hs2
## 25830 10121 Horses; live, pure-bred breeding animals 101 1
## 25831 10129 Horses; live, other than pure-bred breeding animals 101 1
## 25832 10130 Asses; live 101 1
## 25833 10190 Mules and hinnies; live 101 1
## 25835 10221 Cattle; live, pure-bred breeding animals 102 1
## 25836 10229 Cattle; live, other than pure-bred breeding animals 102 1
## 25837 10231 Buffalo; live, pure-bred breeding animals 102 1
## 25838 10239 Buffalo; live, other than pure-bred breeding animals 102 1
## 25839 10290 Bovine animals; live, other than cattle and buffalo 102 1
## 25841 10310 Swine; live, pure-bred breeding animals 103 1
class(hs_codes12$hs6)
## [1] "numeric"
Question 6.1: In HS_version
you want to check the number of unique character strings stored in column SDG. Use unique
function nested within length
function to achieve that. What number did you get?
A. 9
# Answer A is incorrect.
B. 11
# Answer B is incorrect.
C. 15
# Answer C is correct.
# The correct code is below for reference.
D. 17
# Answer D is incorrect.
Question 6.2: In HS_version
data frame you want to replace column names by their numbers. Which code will achieve that?
A. All
# Answer A is correct. All lines of code are correct and achieve the same result. However, it is always better to use shorter code.
B. names(HSversion) <- 1:ncol(HSversion)
# Answer B is not entirely correct.
C. names(HSversion) <- 1:ncol(HSversion[, 1:12])
# Answer C is not entirely correct.
D. names(HSversion)<- 1:12
# Answer D is not entirely correct.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.
# Question 1
length(unique(HSversion$SDG))
# Question 2
names(HSversion) <- 1:ncol(HSversion)
names(HSversion)<- 1:12
names(HSversion) <- 1:ncol(HSversion[, 1:12])
1.4 Import and arrange auxiliary data (continued)
Now let’s create object hs_concordance
into which we read .csv file containing concordance table for all existing versions of the HS from 1992 till 2017 and then take a look at it.
hs_concordance <- read.csv(file = "HS_concordance_full.csv", stringsAsFactors = FALSE)
head(hs_concordance)
## HS92 HS96 HS02 HS07 HS12 HS17
## 1 10111 10111 10110 10110 10121 10121
## 2 10111 10111 10110 10110 10130 10130
## 3 10120 10120 10110 10110 10121 10121
## 4 10120 10120 10110 10110 10130 10130
## 5 10119 10119 10190 10190 10129 10129
## 6 10119 10119 10190 10190 10130 10130
Since the matching
data frame only contains 2007, 2012 and 2017 versions of the HS, we only need one concordance between 2007 and 2012, and the second concordance between 2017 and 2012. Let’s create two objects containing only two columns of relevant concordances. We can do this by subsetting the hs_concordance
data frame as below.
hs_concordance0712 <- hs_concordance[, c("HS07", "HS12")]
hs_concordance1712 <- hs_concordance[, c("HS17", "HS12")]
head(hs_concordance0712)
## HS07 HS12
## 1 10110 10121
## 2 10110 10130
## 3 10110 10121
## 4 10110 10130
## 5 10190 10129
## 6 10190 10130
The matching
data frame in its NTM column contains NTM classification codes at 1-, 2-, 3- and 4-digit levels. To account for possible variation in assignment of NTM codes to NTM entries in TRAINS database, any 1-, 2- or 3-digit codes need to be expanded to include themselves and all further disaggregated codes. To be able to do that, we need to import a data frame that contains the codes of the International Classification of NTMs referenced above in section Data. Use function head
to see what it looks like.
MAST_codes <- read.csv("MAST_all.csv", stringsAsFactors = FALSE)
head(MAST_codes)
## Code level level1 level2 level3 level4
## 1 A 1 A
## 2 A1 2 A A1
## 3 A11 3 A A1 A11
## 4 A12 3 A A1 A12
## 5 A13 3 A A1 A13
## 6 A14 3 A A1 A14
Another data set that we will need at a later stage is “countries_UNCTAD_iso3.csv”, which contains full names of the countries (or economies), ISO3 country codes, a column marking whether or not a given country has NTM data available in TRAINS14, and columns marking whether or not a given country/economy belongs to any classification or geographical grouping (e.g. high income countries, low income countries, land-locked developing countries, South-East Asia, ESCAP, SPECA, etc.). Note, that this data set was prepared by ESCAP in May 2019, and it may need updating with time.
To clean up the data frame a bit, we keep only rows that are marked as having NTM data in TRAINS database (contain value 1
in column has_ntms) and only columns country, iso3, has_ntms and escap. If for your analysis you may need any other additional columns from “countries_UNCTAD_iso3.csv”, make sure to include names of those columns.
countries <- read.csv(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE)
countries <- countries[countries$has_ntms==1, c("country", "iso3", "has_ntms", "escap")]
Now we can take a look at countries
data frame.
head(countries, 3)
## country iso3 has_ntms escap
## 1 Afghanistan AFG 1 1
## 4 Algeria DZA 1 0
## 11 Antigua and Barbuda ATG 1 0
Question 7.1: You want to create a new data frame object country_inc
by importing file “countries_UNCTAD_iso3.csv”. Which line of code below will achieve that?
A. countries_inc <- read.csv(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE
# Answer A is incorrect. Closing bracket is missing.
B. countries_inc <- read.csv(file = "countries_UNCTAD_iso3", stringsAsFactors = FALSE)
# Answer B is incorrect. File extension is missing from the file name.
C. countries_inc = read.csv(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE)
# Answer C is correct.
D. countries_inc <- read_excel(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE)
# Answer D is incorrect. `read.csv` should be used to import .csv file.
Question 7.2: Now subset country_inc
to only include the rows that have 1
in column has_ntms and only keep columns country, iso3, lower_middle_income, upper_middle_income, low_income, high_income. How many rows will the subset data frame country_inc
have?
A. 86
# Answer A is incorrect.
B. 74
# Answer B is incorrect.
C. 88
# Answer C is correct.
# The correct code is below for reference.
D. 77
# Answer D is incorrect.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.
# Question 1
countries_inc = read.csv(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE)
# Question 2
countries_inc <- countries_inc[countries_inc$has_ntms==1, c("country", "iso3", "lower_middle_income", "upper_middle_income", "low_income", "high_income")]
nrow(countries_inc)
Now we can use the imported auxiliary data frames to finalize preparation of data from matching
data frame. We will create a list object using matching
data frame that will be utilized to analyze NTM entries from the TRAINS database of NTMs.
First, we use function paste
to overwrite those cells in column HS of matching
data frame that contain character string “All HS” with all of the strings contained in column hs6 in hs_codes12
data frame. We use argument collapse = ";"
to combine all the separate character strings from column hs6 into one long character string, where the 6-digit HS codes are separated from each other by a semicolon.
matching$HS[matching$HS=="All HS"] <- paste(hs_codes12$hs6, collapse = ";" )
Now let’s create a list object matching.list
using matching
data frame. It will contain the same number of list items as the number of rows in the matching
data frame. The columns will transform into sub-items within each of the list items, containing the strings from the cells in the corresponding rows and columns of the matching
data frame.
To do that we use function split
to split a data frame into a list of items (based on the number of rows in it), which in turn contain the list of data elements that were stored in each of the rows of that data frame.
It’s first argument contains name of an object being split, and second argument defines the grouping and uses nested functions seq
and nrow
to generate a sequence of list items based on the number of rows in matching
.
nrow
by itself generates number of rows in matching
data frame, which is 67. seq
creates a sequence of 67 items from 1 to 67, i.e. 67 distinct items of the list object that is being created. Essentially, seq(nrow(matching))
does the same thing as 1:nrow(matching)
. Run these two lines of code to see what happens, and then run the code in the chunk below.
matching.list <- split(matching, seq(nrow(matching)))
Picture below shows what item 2 of the new matching.list
looks like. You can see that a list item contains a tibble with 1 row and 8 columns extracted from matching
data frame. You can use print
function to print list item 2 of matching.list
to see just that.
print(matching.list[[2]])
## # A tibble: 1 x 8
## SDG Target HS HS_version NTM Keywords neg_Keywords row
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
## 1 SDG_2 Target~ 0101; 01~ HS3 A1; ~ pest; pest~ pesticide; MR~ 2
Now below we include a for loop
that will do the following things:
transform content of the cells that have a string of characters separated by a semicolon into a list vector of separate character strings or numeric values. This is true for such sub-items as Target, HS, NTM, Keywords, neg_Keywords
For sub-item HS :
codes longer than 6 digits are truncated to the first 6-digit;
2- ad 4-digit codes are expanded to include all of the underlying 6-digit codes of the same HS version that is indicated in HS_version
HS codes of the HS versions other than 2012 - the default version of the TRAINS database - are converted to 2012 version
For sub-item NTM, all 1-, 2- or 3-digit NTM codes are expanded to include themselves and all underlying NTM codes at further levels of disaggregation
For sub-items Keywords and neg_Keywords all substrings that do not contain “!”, are formatted to lower case. Those that contain “!” are kept as all caps. This is done, as keyword matching with the data in the TRAINS database is generally case insensitive, with the exception of certain abbreviations that may match other words or their parts. Example: Letter combination “ODS”, which as an abbreviation that stands for “ozone depleting substances”, may be found in words “goods”, “rods”, “foods”, etc.
Notice again, that round brackets after for
are used to define the sequence for which the loop is run. Round brackets are followed by curly brackets, within which all recycled functions are included. If the loop is very long, the closing curly bracket may be separated from the opening one by many lines of code. It is important not to lose it, otherwise the loop will not run.
The for loop
below contains a number of nested conditional if...else
statements and another for loop
. As it is quite long and complex, we break it up below into its components and examine them step by step for sub-item 2 only. To do that we put a hash sign is front of the line of the for loop
, which sets the sequence for which to run the loop (number of rows in matching
data frame, from which matching.list
was created) and in front of the closing “}” at the very end of the loop. Instead we set i
equal to 2
to reference sub-item 2 of matching.list
only. Later, to run a complete loop, you will need to remove the hash signs or just copy the full loop from the complete code script provided in “complete_matching_code.R” file (see Data section).
Question 8.1: Look at healthSDG
data frame that we created earlier. For Target 3.9 “All HS” codes are indicated in HS column. You need to replace these with all of the 6-digit HS codes from hs_codes12
data frame. Which line of code below will achieve that?
A. healthSDG$HS[healthSDG$HS=='All HS'] <- paste(hs_codes12$hs6, collapse = ";" )
# Answer A is correct.
B. healthSDG[healthSDG$HS=="All HS"] <- paste(hs_codes12$hs6, collapse = ";" )
# Answer B is incorrect. Need to use `healthSDG$HS` to reference column __HS__ of `healthSDG` object.
C. healthSDG$HS[healthSDG$HS="All HS"] <- paste(hs_codes12$hs6, collapse = ";" )
# Answer C is incorrect. `==` should be used to indicate equality.
D. healthSDG$HS[healthSDG$HS=="All HS"] <- paste(hs_codes12$hs2, collapse = ";" )
# Answer D is incorrect. We are pasting 6-digit HS codes from column __hs6__, and not 2-digit HS codes from column __hs2__.
Question 8.2: Overwrite new healthSDG
object from the current iteration of matching
data frame that only contains rows for SDG 3. How many rows does this new data frame have?
A. 22
# Answer A is incorrect.
B. 27
# Answer B is correct.
# The correct code is below for reference.
C. 25
# Answer C is incorrect.
D. 29
# Answer D is incorrect.
Question 8.3: Now you want to create a list object from healthSDG
data frame. Which line of code below will achieve that?
A. health.list <- split(healthSDG, seq(nrow(healthSDG))
# Answer A is incorrect. One closing bracket is missing.
B. health.list <- split(healthSDG, seq(ncol(healthSDG)))
# Answer B is incorrect. Nested function `nrow` should be used instead of `ncol`.
C. health.list <- separate_rows(healthSDG, seq(nrow(healthSDG))
# Answer C is incorrect. Function `split` should be used here, and not `separate_rows`.
D. health.list <- split(healthSDG, seq(nrow(healthSDG)))
# Answer D is correct.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.
# Question 1
healthSDG$HS[healthSDG$HS=="All HS"] <- paste(hs_codes12$hs6, collapse = ";" )
# Question 2
healthSDG <- matching[matching$SDG=="SDG_3", ]
nrow(healthSDG)
# Question 3
health.list <- split(healthSDG, seq(nrow(healthSDG)))
First, we create an object x
by copying a long string from item 2 sub-item HS
and splitting it into a vector of character strings separated by “;”. To do this we use function strsplit
and indicate ";"
as a separator.
Then, we rewrite x
by using nested functions to consecutively trim leading and trailing white spaces, substring only the first 6 symbols of the character strings in the vector and turn the character strings into numeric values. And lastly we remove any duplicate codes in object x
by using unique
function.
Note the following:
If we use strsplit(matching.list[[i]]$HS, ";")
without adding [[1]]
, this generates a list object with one component that is a vector of split character strings. To extract split character strings as an atomic vector only, we need to add [[1]]
, which coerces a list vector to a simpler data object, that is an atomic vector. Run two versions of the code and inspect the generated object to see the difference between them.
After running x <- strsplit(matching.list[[i]]$HS, ";")[[1]]
notice that some of the HS codes within quotation marks contain a leading white space. Trimming white spaces is crucial, as they are counted just like other characters. If we do not trim them, extracting the first 6 characters will result in extracting a white space and 5 first characters.
We convert HS codes to numeric values, which removes the leading “0”, as we will need to use HS concordance data frames to convert HS codes to 2012 version, and those concordance data frames store HS codes in numeric format.
Using cat("--------- ", i, " HS --------- \n")
, which will simply print --------- i ---------
, where i
is the number from the loop sequence, will make it easier to view the progress of loop execution in the Console in the bottom left section of RStudio window.
i <- 2
#for (i in 1:nrow(matching)){
cat("--------- ", i, " --------- \n")
## --------- 2 ---------
x <- strsplit(matching.list[[i]]$HS, ";")[[1]]
x <- as.numeric(substr(trimws(x), 1, 6))
x <- unique(x)
Now x
contains 1130 strings containing unique HS codes that are not longer than 6-digit. Of them 1130 are 4-digit HS codes. We need to expand them to 6-digit HS codes, while taking into account HS version marked in sub-item HS_version
, which in this case is HS 2007.
So first we need to check the version of HS in which HS codes are recorded and treat them accordingly. Here we use a series of conditional if...else
statements. The first one reads as if the value in sub-item HS_version
of item 2 of matching.list
is equal to “HS4” then check the following condition (this leads to the next if...else
statement within), otherwise skip to the next step.
Since HS codes in item 2 are recorded in version HS 2007(HS3), we are directed to the next step. If HS version was 2012, then the code chunk below would consecutively check, whether there are any codes that are 4 or 2 digits long, and, if so, it would then use earlier created object hs_codes12
to expand them to the underlying 6-digit codes. We will examine these actions in more detail in the steps below, as they recur.
# 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
if(matching.list[[i]]$HS_version=="HS4"){
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes12[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 2. test for presence of 2-digit codes and expand them to 6-digit using the right HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes12[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
}
Before we go to the next chunk of our for loop
, let’s practice subsetting a vector of numeric values. As was the case with data frames above, we can do this by using square brackets to describe criteria, which will be used to extract or reference elements in a vector. Paste the code below into your RStudio console to view the output.
First we create two atomic vector objects y
and z
by taking the long character strings from practice_matching
from cells in column HS rows 87 and 147. We use nested functions to consecutively split one strings into several (“;” is treated as a separator), trim white spaces and convert the strings to numeric. For convenience, let’s join them into one vector yz
containing all of the strings from y
and z
, and print it to view.
y <- as.numeric(trimws(strsplit(practice_matching$HS[[87]], ";")[[1]]))
z <- as.numeric(trimws(strsplit(practice_matching$HS[[147]], ";")[[1]]))
yz <- c(y,z)
print(yz)
## [1] 8401 8402 8403 8404 8405 8406 8407 8408 8409 8410
## [11] 8411 8412 8413 8414 8415 8416 8417 8418 8419 8420
## [21] 8421 8422 8423 8424 8425 8426 8427 8428 8429 8430
## [31] 8431 8432 8433 8434 8435 8436 8437 8438 8439 8440
## [41] 8441 8442 8443 8444 8445 8446 8447 8448 8449 8450
## [51] 8451 8452 8453 8454 8455 8456 8457 8458 8459 8460
## [61] 8461 8462 8463 8464 8465 8466 8467 8468 8469 8470
## [71] 8471 8472 8473 8474 8475 8476 8477 8478 8479 8480
## [81] 8481 8482 8483 8484 8486 8487 8501 8502 8503 8504
## [91] 8505 8506 8507 8508 8509 8510 8511 8512 8513 8514
## [101] 8515 8516 8517 8518 8519 8521 8522 8523 8525 8526
## [111] 8527 8528 8529 8530 8531 8532 8533 8534 8535 8536
## [121] 8537 8538 8539 8540 8541 8542 8543 8544 8545 8546
## [131] 8547 8548 392210 392220 392290 691010 691090 732410 732421 732429
## [141] 732490 381300 84 85 86 87 88 89
Now let’s use subsetting with square brackets to check for the strings that are 2-, 4-, 6- and more than 6 digits. Since our HS codes are all numeric values and since the leading zeros of HS codes got dropped, we can use the following criteria:
Number of digits in HS code | Likely numbers in our HS code vector | Criteria for selection to be used in R |
---|---|---|
2 digits | 1 to 99 | [yz<=99] |
4 digits | 100 to 9999 | [yz>99&yz<=9999] |
6 digits | 10000 to 999999 | [yz>9999&yz<=999999] |
more than 6 digits | 1000000 and above | [yz>999999] |
yz[yz<=99]
## [1] 84 85 86 87 88 89
yz[yz>99&yz<=9999]
## [1] 8401 8402 8403 8404 8405 8406 8407 8408 8409 8410 8411 8412 8413 8414
## [15] 8415 8416 8417 8418 8419 8420 8421 8422 8423 8424 8425 8426 8427 8428
## [29] 8429 8430 8431 8432 8433 8434 8435 8436 8437 8438 8439 8440 8441 8442
## [43] 8443 8444 8445 8446 8447 8448 8449 8450 8451 8452 8453 8454 8455 8456
## [57] 8457 8458 8459 8460 8461 8462 8463 8464 8465 8466 8467 8468 8469 8470
## [71] 8471 8472 8473 8474 8475 8476 8477 8478 8479 8480 8481 8482 8483 8484
## [85] 8486 8487 8501 8502 8503 8504 8505 8506 8507 8508 8509 8510 8511 8512
## [99] 8513 8514 8515 8516 8517 8518 8519 8521 8522 8523 8525 8526 8527 8528
## [113] 8529 8530 8531 8532 8533 8534 8535 8536 8537 8538 8539 8540 8541 8542
## [127] 8543 8544 8545 8546 8547 8548
yz[yz>9999&yz<=999999]
## [1] 392210 392220 392290 691010 691090 732410 732421 732429 732490 381300
yz[yz>999999]
## numeric(0)
To count how many 2-, 4- and 6-digit HS codes are present in vector yz
, we combine the above code with length
function.
length(yz[yz<=99])
## [1] 6
length(yz[yz>99&yz<=9999])
## [1] 132
length(yz[yz>9999&yz<=999999])
## [1] 10
length(yz[yz>999999])
## [1] 0
We can use function any
to check whether any of the elements in a vector conform to a certain criterion. We will use this function later on.
any(yz[yz<=99])
## [1] TRUE
any(yz[yz>99&yz<=9999])
## [1] TRUE
any(yz[yz>9999&yz<=999999])
## [1] TRUE
any(yz[yz>999999])
## [1] FALSE
You can use function which
to show the elements of a vector that conform to a given criterion, or to negate it to show those elements that do not conform to it.
yz[which(yz>9999&yz<=999999)]
## [1] 392210 392220 392290 691010 691090 732410 732421 732429 732490 381300
yz[-which(yz>9999&yz<=999999)]
## [1] 8401 8402 8403 8404 8405 8406 8407 8408 8409 8410 8411 8412 8413 8414
## [15] 8415 8416 8417 8418 8419 8420 8421 8422 8423 8424 8425 8426 8427 8428
## [29] 8429 8430 8431 8432 8433 8434 8435 8436 8437 8438 8439 8440 8441 8442
## [43] 8443 8444 8445 8446 8447 8448 8449 8450 8451 8452 8453 8454 8455 8456
## [57] 8457 8458 8459 8460 8461 8462 8463 8464 8465 8466 8467 8468 8469 8470
## [71] 8471 8472 8473 8474 8475 8476 8477 8478 8479 8480 8481 8482 8483 8484
## [85] 8486 8487 8501 8502 8503 8504 8505 8506 8507 8508 8509 8510 8511 8512
## [99] 8513 8514 8515 8516 8517 8518 8519 8521 8522 8523 8525 8526 8527 8528
## [113] 8529 8530 8531 8532 8533 8534 8535 8536 8537 8538 8539 8540 8541 8542
## [127] 8543 8544 8545 8546 8547 8548 84 85 86 87 88 89
With function which
we can set forth criteria for selecting elements from a vector. In code chunk below we first create an object a
, which only contains those elements of yz
that correspond to 2-digit HS codes. And then we create object b
, which contains all elements of yz
excluding those that correspond to 2-digit HS codes. We will use this function later on.
Use the last line to remove all the object that were created in this practice session and that are redundant now.
a <- yz[which(yz<=99)]
a
## [1] 84 85 86 87 88 89
b <- yz[-which(yz<=99)]
b
## [1] 8401 8402 8403 8404 8405 8406 8407 8408 8409 8410
## [11] 8411 8412 8413 8414 8415 8416 8417 8418 8419 8420
## [21] 8421 8422 8423 8424 8425 8426 8427 8428 8429 8430
## [31] 8431 8432 8433 8434 8435 8436 8437 8438 8439 8440
## [41] 8441 8442 8443 8444 8445 8446 8447 8448 8449 8450
## [51] 8451 8452 8453 8454 8455 8456 8457 8458 8459 8460
## [61] 8461 8462 8463 8464 8465 8466 8467 8468 8469 8470
## [71] 8471 8472 8473 8474 8475 8476 8477 8478 8479 8480
## [81] 8481 8482 8483 8484 8486 8487 8501 8502 8503 8504
## [91] 8505 8506 8507 8508 8509 8510 8511 8512 8513 8514
## [101] 8515 8516 8517 8518 8519 8521 8522 8523 8525 8526
## [111] 8527 8528 8529 8530 8531 8532 8533 8534 8535 8536
## [121] 8537 8538 8539 8540 8541 8542 8543 8544 8545 8546
## [131] 8547 8548 392210 392220 392290 691010 691090 732410 732421 732429
## [141] 732490 381300
As a final thing, let’s use function any
to check if any elements of vector b
are present among the elements of vector a
. Given the criteria that were used to create these two vector objects, the result is naturally FALSE
.
Use the last line to remove all the object that were created in this practice session and that are redundant now.
any(b %in% a)
## [1] FALSE
rm(y,z,yz,a,b)
1.6 Truncate long HS codes to 6-digit codes (continued)
Now let’s get back to our for loop
.
The chunk below uses the if...else
statement to check, whether the HS codes extracted from item 2 are marked as HS version of 2007. Since it is, the code leads us to the following if...else
statements within. Statement 1 checks, whether vector x
contains any strings that are 3 or 4 digits long.15
If so, the code then creates object x4
by implementing a series of transformations with object x
described in the code string x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes07[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
:
data.frame
and argument x=x[x>99&x<=9999]
to extract 3- and 4-digit long strings from x
and turn them into a x4
data frame object, consisting of one column x containing the extracted HS codesmerge
to merge the resulting x4
data frame with another data frame created in an earlier step - hs_codes07
, which contains 2-, 4- and 6-digit HS codes of HS 2007. These two data frames are merged by those columns of each containing 4-digit HS codes. This essentially adds to data frame x4
a new hs6 column that contains 6-digit HS codes that correspond to the 4-digit HS codes already present in x4
.x4
object with them (this turns x4
back into a vector containing new 6-digit HS codes).In the next code string, we use negation of function which
to remove 3- and 4-digit long strings from x
.16. And then we rewrite x
by joining objects x
and x4
.
Statement 2 checks, whether vector x
contains any strings that are 2 or less digits long. 17 If there are, the code takes us through a series of transformations that follow the same logic as the ones described for transformation of 4-digit codes above. Since, there are no 2-digit HS codes in the examined strings, the code takes us to the following step - conversion of HS codes to HS 2012 version.
First, we remove any duplicates from x
using unique
. Then, we turn x
into a data frame and merge it with a data frame created in an earlier step hs_concordance0712
, which contains a table of corresponding 6-digit codes from versions of 2007 and 2012. Note argument all.x=TRUE
in function merge
. It results in extra rows being added to the output, one for each row in x
that has no matching row in y
. This way if there is no value in HS07 column of hs_concordance0712
that matches a value in x column in the original object x
, then the corresponding cell in column HS12 in the resulting object x
will be left empty. We then overwrite these empty cells with the corresponding values from column x of the resulting object x
, to make sure none of the HS codes are lost.
In the next step we overwrite x
with only the unique HS codes from column HS12 of the resulting data frame.
if(matching.list[[i]]$HS_version=="HS3"){
# 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes07[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes07[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
# 3. convert HS 2007 codes to HS 2012
x <- unique(x)
x <- data.frame(x)
x <- merge(x,hs_concordance0712, by.x="x", by.y="HS07", all.x=TRUE)
x$HS12[is.na(x$HS12)] <- x$x[is.na(x$HS12)]
x <- unique(x$HS12)
}
The following code chunk tests, whether the examined string of HS codes is marked as HS 2017 version. If it is, then in takes us through a series of transformations that follow the same logic as the ones described immediately above for HS codes marked as HS 2007. But since our examined string is not HS 2017, the code will skip to the step that follows after this code chunk (see last line of code below), which converts the resulting atomic vector ‘x’ to a list vector with function list
and writes it to sub-item HS of item 2 of matching.list
.
if(matching.list[[i]]$HS_version=="HS5"){
# 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes17[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes17[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
# 3. convert HS 2017 codes to HS 2012
x <- unique(x)
x <- data.frame(x)
x <- merge(x,hs_concordance1712, by.x="x", by.y="HS17", all.x=TRUE)
x$HS12[is.na(x$HS12)] <- x$x[is.na(x$HS12)]
x <- unique(x$HS12)
}
# 4. write the resulting HS codes into the corresponding subitem in "matching.list"
matching.list[[i]]$HS <- list(x)
A quick look with length(matching.list[[i]]$HS[[1]])
at sub-item HS of item 2, tells us that it now contains 4857 strings, as opposed to 1130 it had before. And using unique(nchar(matching.list[[i]]$HS[[1]]))
tells us that sub-item HS of item 2 only contains HS codes that are 5- and 6-digit long.
You will notice that in matching
two rows are marked to likely have a mix of HS codes from versions of 2012 and 2017. For those cases at the research stage it was not possible to determine, the exact version of the HS (for details see the ESCAP working paper ‘Exploring linkages between non-tariff measures and the Sustainable Development Goals: A global concordance matrix and application to Asia and the Pacific’.18) All those codes are 6 digit, and for the purpose of this analysis they are assumed as 2012 version of the HS. The code keeps them as they are.
Question 9.1: Create a new object a
and use a succession of functions strsplit
, trimws
, as.numeric
and length
(or nest these functions) to determine the number of unique HS codes stored in row 1 column HS of a current iteration of healthSDG
data frame. What is the number?
A. 305
# Answer A is correct.
# The correct code is below for reference.
B. 525
# Answer B is incorrect.
C. 258
# Answer C is incorrect.
D. 369
# Answer D is incorrect.
Question 9.2: Use code to check, whether object a
has any values that correspond to 2-digit HS codes. How many are they?
A. 57
# Answer A is incorrect.
B. 39
# Answer B is incorrect.
C. 19
# Answer C is incorrect.
D. 0
# Answer D is correct.
# The correct code is below for reference.
Question 9.3: a
now contains HS codes that are marked as HS3 (2007). Write code to convert the codes to 2012 version of HS.
A. 230
# Answer A is incorrect.
B. 310
# Answer B is correct.
# The correct code is below for reference.
C. 218
# Answer C is incorrect.
D. 256
# Answer D is incorrect.
The code stored under the button below is not necessary for further training and quizzes. The chunk includes the correct code strings from the questions above, and the last line removes the redundant objects.
# Question 1
a <- strsplit(healthSDG$HS, ";")[[1]]
a <- trimws(a)
a <- as.numeric(a)
length(a)
# Question 2
length(a[a<=99])
# Question 3
a <- data.frame(a)
a <- merge(a,hs_concordance0712, by.x="a", by.y="HS07", all.x=TRUE)
a$HS12[is.na(a$HS12)] <- a$a[is.na(a$HS12)]
a <- unique(a$HS12)
length(a)
rm(a)
For this step we will manipulate strings in sub-item NTM of item 2 of matching.list
and use object MAST_codes
that we created earlier to extract the corresponding NTM codes from.
First, we create an object temp_NTM
by copying a long string in sub-item NTM item 2, splitting it into a vector of character strings separated by “;” and trimming all leading and trailing white spaces from each substring. To do this we use strsplit
and trimws
. We then create object temp_NTM_full
, which is a copy of temp_NTM
. temp_NTM
with serve as a source from which to derive individual NTM codes, and temp_NTM_full
will be used for gradually populating the expanded NTM codes.
Here we will use a for loop
for the sequence that is a number of character strings in temp_NTM
.
To help keep better track of what is being done, the loop first prints the nth NTM code from temp_NTM
. For item 2 the first NTM code is “A1”. Then the code saves the NTM code into a new object this_mst
, and saves a number of characters in this_mst
as a numeric object n_char_numeric
, which is “2” in the examined case.
Then the code creates an object this_level
where it pastes together the word “level” and the value that is found in the MAST_codes
data frame in the cell of the column level that corresponds to the NTM code currently stored in this_mst
. This way we get “level2” stored in object this_level
, which coincides with one of the column titles in MAST_codes
data frame.
To extract the NTM subcodes that are under code “A1”, we create a temporary data frame object MAST_codes_temp
, which contains all rows of MAST_codes
, the column this_level
of which contains the code that is stored in this_mst
. Then we overwrite this_mst
with unique NTM codes contained in column Code of MAST_codes_temp
that are longer than the value stored in n_char_numeric
based on the corresponding value in column level. Them we overwrite temp_NTM_full
by combining values from this_mst
and temp_NTM_full
. The latter now contains the 7 NTM codes that fall under the first NTM code of temp_NTM
, as well as the remaining not yet expanded NTM codes of temp_NTM
. Those will be expanded in the next cycles of this loop execution. The last line of code in this for loop
removes the temporary objects n_char_numeric
, this_level
, and MAST_codes_temp
.
You may wonder why we need to create so many strangely named objects and stuff the code with them. It is because each cycle (or iteration) of the loop uses functions that take multiple arguments, specific values of which are dependent of the NTM code that is being processed. In paragraph above we described processing of NTM code A1. One of the following iterations of the loop will likely process NTM code B851. Try running the code within the loop’s curly brackets for that code only and see how the values of each variable change.
Once this for loop
runs all of its cycles for the number of NTM codes stored in temp_NTM
, the code will overwrite temp_NTM_full
with itself keeping only the unique NTM codes. And then all the resulting NTM codes are stored into sub-item NTM of item 2 of matching.list
as a list vector. Them we use rm
to remove the rest of the temporary objects created in this for loop
. Since this loop is part of the bigger for loop
the code will return in due course to this loop to expand all NTM codes contained in all items of matching.list
.
temp_NTM <- trimws(strsplit(matching.list[[i]]$NTM, ";")[[1]])
temp_NTM_full <- temp_NTM
for (n in 1:length(temp_NTM)){
print(temp_NTM[n])
this_mst <- temp_NTM[n]
n_char_numeric <- as.numeric(nchar(this_mst))
this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
temp_NTM_full <- c(this_mst, temp_NTM_full)
rm(n_char_numeric, this_level, MAST_codes_temp)
}
temp_NTM_full <- unique(temp_NTM_full)
matching.list[[i]]$NTM <- list(temp_NTM_full)
rm(temp_NTM, n,temp_NTM_full)
Question 10.1: Write code to expand NTM code E3. How many unique codes will be generated? As the last line of code remove all created temporary objects.
A. 10
# Answer A is incorrect.
B. 11
# Answer B is incorrect.
C. 13
# Answer C is correct.
# The correct code is below for reference.
D. 15
# Answer D is incorrect.
Question 10.2: Write nested for loops
to expand all NTM codes in all items of health.list
. How many NTM codes will be recorded in item 23 after such transformation?
A. 34
# Answer A is correct.
# The correct code is below for reference.
B. 25
# Answer B is incorrect.
C. 18
# Answer C is incorrect.
D. 41
# Answer D is incorrect.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and some of these objects you will need for further quizzes and practice sessions.
# Question 1
this_mst <- "E3"
n_char_numeric <- as.numeric(nchar(this_mst))
this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
this_mst <- unique (this_mst)
length(this_mst)
rm(this_mst, n_char_numeric, this_level, MAST_codes_temp)
# Question 2
for(i in 1:nrow(healthSDG)){
temp_NTM <- trimws(strsplit(health.list[[i]]$NTM, ";")[[1]])
temp_NTM_full <- temp_NTM
for (n in 1:length(temp_NTM)){
#print(temp_NTM[n])
this_mst <- temp_NTM[n]
n_char_numeric <- as.numeric(nchar(this_mst))
this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
temp_NTM_full <- c(this_mst, temp_NTM_full)
rm(n_char_numeric, this_level, MAST_codes_temp)
}
temp_NTM_full <- unique(temp_NTM_full)
health.list[[i]]$NTM <- list(temp_NTM_full)
rm(temp_NTM, n,temp_NTM_full, this_mst)
}
length(health.list[[23]]$NTM[[1]])
The last section of the big for loop
finalizes preparation of keywords and negative keywords in matching.list
.
We use conditional if...else
statement to check if the sub-item Keywords in item 2 contains any strings. If it does, the statement takes us through the rest of the code within the curly brackets.
First we create a temporary object kw
by copying a long string in sub-item Keywords of item 2, splitting it into a vector of character strings separated by “;” and trimming all leading and trailing white spaces from each substring.
The next line, which uses function tolower
, overwrites kw
with itself by formatting all substrings that do not contain “!” in them to lower case. To specify, which strings need to be formatted, we use code kw[!grepl("!", kw)]
, in which !
before grepl
signals negation and function grepl
searches for a specific pattern in the substrings in kw
. As a result, those keywords, which are abbreviations recorded as all caps and marked with “!” will remain as all caps and will be used for case sensitive keyword matching. The rest of keywords will be used for case insensitive matching. This is done as certain abbreviations may match other words or their parts. Example: Letter combination “ODS”, which as an abbreviation stands for “ozone depleting substances”, may be found in words “goods”, “rods”, “foods”, etc.
Then all the resulting keywords are stored into sub-item Keywords of item 2 of matching.list
as a list vector. Then we use rm
to remove the temporary object kw
.
The same is done for negative keywords.
# Keywords
if(!is.na(matching.list[[i]]$Keywords)){
kw <- trimws(strsplit(matching.list[[i]]$Keywords, ";")[[1]])
kw[!grepl("!", kw)] <- tolower(kw[!grepl("!", kw)])
matching.list[[i]]$Keywords <- list(kw)
kw
rm(kw)
}
#Negative keywords
if(!is.na(matching.list[[i]]$neg_Keywords)){
nkw <- trimws(strsplit(matching.list[[i]]$neg_Keywords, ";")[[1]])
nkw[!grepl("!", nkw)] <- tolower(nkw[!grepl("!", nkw)])
matching.list[[i]]$neg_Keywords <- list(nkw)
nkw
rm(nkw)
}
#}
Question 11.1: We want to see how many strings in row 40 of matching
data frame are abbreviations marked with !
. Which line of code will achieve that? Note that ;
can be used to put two lines of code in one line.
A. KeyW <- trimws(strsplit(matching$Keywords[[40]], ";")[[1]]); length(KeyW[grepl("!")])
# Answer A is incorrect. `grepl` function is missing an argument indicating a character vector or an object that should be searched for matches.
B. KeyW <- trimws(strsplit(matching$Keywords[[40]], ";")[[1]]); length(grepl("!", KeyW))
# Answer B is incorrect. We use `grepl` function to subset character vector `KeyW`, thus we need to use square brackets for subsetting like so: `KeyW[grepl("!")]`.
C. KeyW <- trimws(strsplit(matching$Keywords[[40]], ";")[[1]]); length(KeyW[grepl("!", KeyW)])
# Answer C is correct.
D. KeyW <- trimws(strsplit(matching$Keywords[[40]], ";")[[1]]); length(KeyW[!grepl("!", KeyW)])
# Answer D is incorrect. Unnecessary negation is indicated with `!` within the square brackets in front of `grepl`.
Question 11.2: Write code to check the number of non-abbreviations in row 38 of matching
data frame. Choose the correct number below.
A. 63
# Answer A is correct.
# The correct code is below for reference.
B. 67
# Answer B is incorrect.
C. 54
# Answer C is incorrect.
D. 58
# Answer D is incorrect.
Once you are done with the quiz, you can check the correct code under this button.
# Question 1
KeyW <- trimws(strsplit(matching$Keywords[[40]], ";")[[1]]); length(KeyW[grepl("!", KeyW)])
# Question 2
KeyW <- trimws(strsplit(matching$Keywords[[38]], ";")[[1]]); length(KeyW[!grepl("!", KeyW)])
rm(KeyW)
matching.list
objectSo, we examined all of the code chunks within the large for loop
that finalizes the setting up of the matching.list
. Here is what item 2 of the matching.list
looks like after all the transformations.
If you run a complete loop for the complete sequence of i (currently, inactivated by hash symbols in code chunk above), all relevant sub-items within all items of matching.list
will be transformed this way.
In fact the code for the complete loop is under this button. Make sure to paste it in RStudio and run it before you move on with the training. It will take a few minutes. Look through it. Do you understand what each line of code does?
# creating `matching.list` once again. we need to do this once again, as we need `matching.list` objects unchanged, so as to demonstrate the workings of the complete full loop
matching.list <- split(matching, seq(nrow(matching)))
for (i in 1:nrow(matching)){
#cat("--------- ", i, " --------- \n")
# HS
x <- strsplit(matching.list[[i]]$HS, ";")[[1]]
x <- as.numeric(substr(trimws(x), 1, 6))
x <- unique(x)
# 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
if(matching.list[[i]]$HS_version=="HS4"){
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes12[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 2. test for presence of 2-digit codes and expand them to 6-digit using the right HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes12[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
}
if(matching.list[[i]]$HS_version=="HS3"){
# 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes07[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes07[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
# 3. convert HS 2007 codes to HS 2012
x <- unique(x)
x <- data.frame(x)
x <- merge(x,hs_concordance0712, by.x="x", by.y="HS07")
x <- unique(x$HS12)
}
if(matching.list[[i]]$HS_version=="HS5"){
# 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes17[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes17[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
# 3. convert HS 2017 codes to HS 2012
x <- data.frame(x)
x <- merge(x,hs_concordance1712, by.x="x", by.y="HS17")
x <- unique(x$HS12)
}
# 4. write the resulting HS codes into the corresponding subitem in "matching.list"
matching.list[[i]]$HS <- list(x)
# NTMs
temp_NTM <- trimws(strsplit(matching.list[[i]]$NTM, ";")[[1]])
temp_NTM_full <- temp_NTM
for (n in 1:length(temp_NTM)){
#print(temp_NTM[n])
this_mst <- temp_NTM[n]
n_char_numeric <- as.numeric(nchar(this_mst))
this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
temp_NTM_full <- c(this_mst, temp_NTM_full)
rm(n_char_numeric, this_level, MAST_codes_temp)
}
temp_NTM_full <- unique(temp_NTM_full)
matching.list[[i]]$NTM <- list(temp_NTM_full)
rm(temp_NTM, n,temp_NTM_full)
# Keywords
if(!is.na(matching.list[[i]]$Keywords)){
kw <- trimws(strsplit(matching.list[[i]]$Keywords, ";")[[1]])
kw[!grepl("!", kw)] <- tolower(kw[!grepl("!", kw)])
matching.list[[i]]$Keywords <- list(kw)
kw
rm(kw)
}
#Negative keywords
if(!is.na(matching.list[[i]]$neg_Keywords)){
nkw <- trimws(strsplit(matching.list[[i]]$neg_Keywords, ";")[[1]])
nkw[!grepl("!", nkw)] <- tolower(nkw[!grepl("!", nkw)])
matching.list[[i]]$neg_Keywords <- list(nkw)
nkw
rm(nkw)
}
}
rm(x,x2,x4,i,this_mst)
for loop
’s and if...else
statementsTry writing a long for loop
to modify health.list
in the same way that we modified matching.list
in the sections above. If the task seems too intimidating to do from scratch, refer to the complete for loop
in a file with a complete code algorithm (see Data section).
Important: In one of the quizzes above we already implemented such transformation for NTM codes in health.list
. You can see that by clicking onto health.list
object in the Environment tab in the top right corner and then on any of the list items. You will see that NTM subitem is already stored as a list. And we need to achieve the same result for all other subitems of each item. Thus, the loop that you will put together should not have the code for processing NTM codes, otherwise an error message will be generated, and the code will stop running.
First, try to put together or edit the code from the complete algorithm by yourself. Once you gave it a chance click the button below to see the complete code. Look through it. Do you understand what each line of code does?
for (i in 1:nrow(healthSDG)){
cat("--------- ", i, " --------- \n")
# HS
x <- strsplit(health.list[[i]]$HS, ";")[[1]]
x <- as.numeric(substr(trimws(x), 1, 6))
x <- unique(x)
# 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
if(health.list[[i]]$HS_version=="HS4"){
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes12[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 2. test for presence of 2-digit codes and expand them to 6-digit using the right HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes12[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
}
if(health.list[[i]]$HS_version=="HS3"){
# 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes07[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes07[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
# 3. convert HS 2007 codes to HS 2012
x <- unique(x)
x <- data.frame(x)
x <- merge(x,hs_concordance0712, by.x="x", by.y="HS07", all.x=TRUE)
x$HS12[is.na(x$HS12)] <- x$x[is.na(x$HS12)]
x <- unique(x$HS12)
}
if(health.list[[i]]$HS_version=="HS5"){
# 1. test for presence of 4-digit codes and expand them to 6-digit using the right HS version
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes17[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 2. test for presence of 2-digit codes and expand them to 6-digitusing the right HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes17[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
# 3. convert HS 2017 codes to HS 2012
x <- unique(x)
x <- data.frame(x)
x <- merge(x,hs_concordance1712, by.x="x", by.y="HS17", all.x=TRUE)
x$HS12[is.na(x$HS12)] <- x$x[is.na(x$HS12)]
x <- unique(x$HS12)
}
# 4. write the resulting HS codes into the corresponding subitem in "health.list"
health.list[[i]]$HS <- list(x)
# Keywords
if(!is.na(health.list[[i]]$Keywords)){
kw <- trimws(strsplit(health.list[[i]]$Keywords, ";")[[1]])
kw[!grepl("!", kw)] <- tolower(kw[!grepl("!", kw)])
health.list[[i]]$Keywords <- list(kw)
kw
rm(kw)
}
#Negative keyworkds
if(!is.na(health.list[[i]]$neg_Keywords)){
nkw <- trimws(strsplit(health.list[[i]]$neg_Keywords, ";")[[1]])
nkw[!grepl("!", nkw)] <- tolower(nkw[!grepl("!", nkw)])
health.list[[i]]$neg_Keywords <- list(nkw)
nkw
rm(nkw)
}
}
rm(x,x2,x4,i,this_mst)
Since all the transfomation has been already been implemented to all items of matching.list
, we can get the properly transformed health.list
by just subsetting the matching.list
. What code could work for that?
health.list <- matching.list[3:29]; names(health.list) <-1:27
Now we need to import the data frame that contains the data from the UNCTAD’s TRAINS database of NTMs into R, and them clean up and organize the data. Similarly to the instances above, we use read_excel
to do that. However, this time we need to include argument skip=9
to skip the first nine rows of the sheet titled as “new”, as they do not contain any data relevant to us.
Please note that the excel file with NTM data has been modified to include columns with translations of non-English descriptions into English.19
ntms_all <- read_excel("UNCTAD_i-tip_report_ALL_Measures_x.xlsx", skip=9, sheet="new")
If you look at the column names with names(ntms_all)
, you will see that some of the columns need to be removed, namely: Category, In force, Withdrawn, Also domestic, …17. We can do this by using pipe operator %>%
and function select
of dplyr
package to select specific columns that we want to keep, as is shown below. Using pipe operator %>%
instead of nested functions within multiple round brackets makes the code look more readable.
Note, that a string of code can use a few %>%
and be quite long. In this case you can break it up into a few lines. Each of such lines should end in %>%
, otherwise the code won’t run.
ntms_all <- ntms_all %>% select("Country imposing",
"Partner affected",
"NTM Code",
"HS",
"Measure description",
"National legal basis" ,
"Product description",
"Source",
"Measure description_English",
"Product description_English",
"Source_English",
"National legal basis_English")
Keyword and negative keyword matching will be implemented for all descriptive columns of ntms_all
. So, for convenience we merge all four descriptive columns: first, the original ones and then the translated ones (the latter are marked with _"English"). For that we use function unite
of tidyr
package and include the following arguments:
ntms_all
to specify the source data frame"Description"
to specify the name of the new columnc()
with the names of the columns to be unitedsep=" "
to indicate that a space should be inserted between the combined character stringsremove=TRUE
to indicate that the source columns should be removed from the data framentms_all <- unite(ntms_all, "Description", c(
`Measure description`,
`National legal basis` ,
`Product description`,
`Source`
), sep = " ", remove = TRUE)
ntms_all <- unite(ntms_all, "Description_English", c(
"Measure description_English",
"Product description_English",
"Source_English",
"National legal basis_English"
), sep = " ", remove = TRUE)
For any of the empty united columns, the string “NA” was included in the new column. And if all four columns were empty, four consecutive “NA”s were recorded into the resulting column, like so “NA NA NA NA”. Now, those cells which in column Descriptions do not contain “NA NA NA NA” and for which the corresponding cells in column Description_English do not contain “NA NA NA NA”, need to be overwritten with the content of the corresponding cells in column Description_English. That is, we need to overwrite descriptions in language other than English with the descriptions translated into English.
Having done that, we need to delete now redundant column Description_English. See the relevant code strings below.
ntms_all[ntms_all$Description_English!="NA NA NA NA","Description"] <-
ntms_all[ntms_all$Description_English!="NA NA NA NA","Description_English"]
ntms_all <- ntms_all[,-which(names(ntms_all)=="Description_English")]
And finally, we use pipe operator %>%
and function rename
of dplyr
package to rename some of the columns for more convenient reference. We can also remove some of the objects that we won’t need from workspace, to make the it less cluttered. As the last step you can take a look at ntms_all
with head
function.
ntms_all <- ntms_all %>% rename(reporter = `Country imposing`, partner = `Partner affected`, NTM = `NTM Code`, description="Description")
rm(hs_codes07, hs_codes17, hs_concordance, hs_concordance0712, hs_concordance1712)
head(ntms_all)
## # A tibble: 6 x 5
## reporter partner NTM HS description
## <chr> <chr> <chr> <chr> <chr>
## 1 Venezuela, B~ Argentina, ~ B81 300310, 300320, ~ Resolution MERCOSUR /~
## 2 Afghanistan All Members P11 0102, 2903, 2904~ Livestock is banned f~
## 3 Afghanistan All Members P11 4403 Cuts from all kind of~
## 4 Afghanistan All Members P13 06, 07, 08, 0909~ To export from Afghan~
## 5 Afghanistan All Members P13 27, 71, 72 You can only export m~
## 6 Afghanistan All Members P13 27, 71, 72 In addition to the mi~
Question 12.1: You want to rename columns in HSversion
object we created earlier and to keep only columns 2 through 10. Take a look at that object. Which line of code will achieve that?
A.
HSversion <- HSversion %>%
rename(sdg='2', target='3', product_description='4', list_description='5', hs='6', hsversion='7', ntm='8', kw='9', nkw='10') %>%
select(2,10)
# Answer A is incorrect. In our case the argument in `select` function should include a range of column numbers from 2 through to 10, like so: `select(2:10)`. `select (2, 10)` extracts only two columns: 2 and 10.
B.
HSversion <- HSversion %>%
select(2:10) %>%
rename(sdg='2', target='3', product_description='4', list_description='5', hs='6', hsversion='7', ntm='8', kw='9', nkw='10')
# Answer B is correct.
C. Both.
# Answer C is incorrect.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code string from the question above, and the object you will need for further quizzes and practice sessions.
HSversion <- HSversion %>% rename(sdg='2', target='3', product_description='4', list_description='5', hs='6', hsversion='7', ntm='8', kw='9', nkw='10') %>% select(2:10)
Below we include a new long for loop
that carries out a number of actions to implement matching of data available for China in TRAINS database of NTMs, now stored in ntms_all
, against the SDG-HS-NTM Concordance Matrix, now stored in matching.list
.
As was done with previous long loop, we will break it up into steps that will contain other smaller loops. For demonstration purposes, we will only run the loop for a limited subset of NTM data. To deactivate the loop we pasted hash sign in front of for
s and in front of the closing curly brackets of the loops. The complete code for the loop is available in “complete_matching_code.R” file (see Data section).
ntms_all
data frameThe list of countries, for which NTM data is available in the TRAINS database of NTMs is stored in countries
data frame along with the ISO3 codes for countries. The for loop
below will run the code for all countries for which NTM data is available, as is determined by the opening line for (c in 1:nrow(countries))
, But for demonstration purposes we only select China, which is in row 20 of the countries
data frame. So we assign value of 20
to object c
. Then we create two objects this_country
and this_countryiso3
to store name of the country “China” and its ISO3 code “CHN” respectively.
Then with the last line of code below we subset a new data frame ntms
that contains all NTM data for the cth reporter, which is China in our case, that are available in ntms_all
data frame.
Note: if you want to run analysis of all NTMs for only one country, you can set the sequence of for loop
only for that country like so: for (c=20)
for China. The full index of countries in available in countries
data frame.
Additionally, countries_UNCTAD_iso3.csv
, from which we generated object countries
contains columns with integer values 1 or 0 indicating that some countries are part of certain classification, geographical or economic integration groupings. Therefore, this loop below may be run for such country groupings as well. For example, if you would like to run the code for SPECA countries20 for which NTM data is available in TRAINS database, you could use the following code lines to set up the for loop
.
# import the source excel file
# countries <- read.csv(file = "countries_UNCTAD_iso3.csv", stringsAsFactors = FALSE)
# subset the data frame to only contain rows for countries that are part of SPECA and columns with country names and ISO3 codes
# speca <- countries[countries$speca==1, c(1,2) ]
# set up the `for loop`
# for (c in 1:nrow(speca)){rest of the loop}
Back to our code:
c <-20 # 20 for China
#for (c in 1:nrow(countries)){
this_countryiso3 <- countries$iso3[c]
this_country <- countries$country[c]
cat( "------ ",this_countryiso3 ," ------\n"); flush.console()
## ------ CHN ------
ntms <- ntms_all[ntms_all$reporter==this_country, ]
Now, we create a new object ntms.list
based on the ntms
data frame with number of items equal to the number of rows of ntms
and the sub-items containing the data from the individual columns of the corresponding rows.
ntms.list <- split(ntms, seq(nrow(ntms)))
The following steps of the loop will be examined for China’s NTM stored in item 5998 of ntms.list
, which has the following description:
print(ntms.list[[5998]]$description)
## [1] "The following documents should be provided to the administration of environmental protection of the State Council for the application for exporting hazardous waste: […] 7. Users of the importing country (region) validate certification or permit of hazardous waste treatment or utilization. [...] 11. Business permit of exporters. If exports specify in the collection, storage, treatment or utilization of hazardous waste, they must provide the permit for hazardous waste business. <U+5371><U+9669><U+5E9F><U+7269><U+51FA><U+53E3><U+6838><U+51C6><U+7BA1><U+7406><U+529E><U+6CD5>(<U+7B2C>47<U+53F7><U+4EE4>) Hazardous Waste listed in Annexes I, II, VIII, IX of the Basel Convention, refering to Annex 8 Ministry of environmental Protection of the People's Republic of China - <U+5371><U+9669><U+5E9F><U+7269><U+51FA><U+53E3><U+6838><U+51C6><U+7BA1><U+7406><U+529E><U+6CD5>(<U+7B2C>47<U+53F7><U+4EE4>)"
As was already done before for matching.list
, we check if the HS codes stored in item 5998 are longer than 6-digits and truncate those to 6 first digits. We also expand all codes that are 4 and 2 digit long to the underlying 6-digit HS codes using 2012 version of the HS (within the TRAINS database of NTMs HS codes are assumed to be registered in HS 2012).
For this purpose we use a nested for loop
for the sequence of rows present in object ntms
. As before, we use hash sign before the opening and the closing lines of the loop to deactivate it and, instead, set we i
as equal to 5998. To run the loop for all rows of ntms
you will need to remove the hash signs.
i<-5998
# 1. loop through each row of NTM data in the ntms.list
#for (i in 1:nrow(ntms)){
# 2. split the string stored in subitem "HS" of the ith item into a vector or separate character strings stored in object "x"
x <- list(strsplit(ntms.list[[i]]$HS, ",")[[1]])[[1]]
# 3. overwrite x with itself removing potentially empty strings
x <- x[!is.na(x)]
# 4. truncate any HS codes longer than 6-digit, and overwrite x as a numeric vector removing any duplicating strings
x <- unique(as.numeric(substr(trimws(x), 1, 6)))
# 5. test for presence of 4-digit codes and expand them to 6-digit using the 2012 HS version
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes12[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 6. test for presence of 2-digit codes and expand them to 6-digit using the 2012 HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes12[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
# 7. remove all duplicates and write the resulting set of HS codes to subitem "HS" of the ith item as a list vector
x <- unique(x)
ntms.list[[i]]$HS <- list(x)
#cat( "------ ",i ," ------\n");flush.console()
#}
# 8. remove redundant objects
rm(x,x2,x4,i)
In the next step we create an empty data frame df
that will be used to populate results of matching for an individual country - China in our case.
The df
will contain 13 columns that will eventually contain the following information about each NTM linked to an SDG:
ntms.list
matching
matching.list
matching.list
matching.list
matching.list
matching.list
Note, that for each column we include an argument specifying the class of the data to be stored.
df <- data.frame(measure=integer(),
matching=integer(),
nt_length = integer(),
matchhs6share = numeric(),
hs6s = character(),
matchNTM = logical(),
NTM = character(),
matchkw = character(),
matchkwlength = integer(),
matchkwshare = numeric(),
matchnkw = character(),
matchnkwlength = integer(),
matchnkwshare = numeric(),
stringsAsFactors=FALSE)
ntms.list
against the SDG-HS-NTM concordance strings in matching.list
Now we finally come to the steps when we implement the actual matching of NTMs from TRAINS database against the SDG-HS-NTM concordance strings. Below n
stands for the item number in the ntms.list
, while m
is the item number in matching.list
. For our example, we continue with the NTM number 5998, as above, and use item number 36 from matching.list
which references use of NTMs to regulate trans-boundary movement of hazardous chemicals and waste.
The first nested loop runs for the sequence of rows in ntms
data frame and creates the following temporary objects from ntms.list
:
nt
contains HS codes for nth NTM of country c stored in ntms.list
as an atomic vectorthis_mst
contains NTM code for nth NTM of country c stored in ntms.list
expanded to include itself and all underlying NTM codes at all further disaggregated levels of ICNTM (same as was done earlier for matching.list
). This is done to eliminate cases where matching fails due to certain NTMs being registered in TRAINS database using NTM codes at a higher level of disaggregation than is envisaged in matching.list
. This processing is similar to what have been done earlier in section 1.7 above, which describes processing of NTM codes in matching.list
description
contains the text from description column formatted to lower case21description_caps
contains the text from description column without formatting to lower caseNotice, the following 2 lines of code below:
description <- gsub("[\r\n]", "", description)
description <- iconv(description, from = 'UTF-8', to = 'ASCII//TRANSLIT')
In very few cases, NTM description contains text pasted from a table and preserves some symbols that relate to the source formatting. Such symbols may result in error messages and halt execution of code. These two lines remove such symbols.
The first line uses gsub
function to find and remove all line breaks and newlines, which are referenced with regular expressions /r
and /n
. For more details on basic regular expressions in R used for manipulating character strings look, for example, here.
The second line uses function iconv
to convert character encoding from UTF-8 to ASCII//TRANSLIT, which removes accents, but keeps the letters.
n=5998; m=36
#for (n in 1:nrow(ntms)){
nt <- ntms.list[[n]]$HS[[1]]
this_mst <- ntms.list[[n]]$NTM # NTM code of the nth measure from country c
n_char_numeric <- as.numeric(nchar(this_mst))
this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
rm(n_char_numeric, this_level, MAST_codes_temp)
description <- ntms.list[[n]]$description
description <- gsub("[\r\n]", "", description)
description <- iconv(description, from = 'UTF-8', to = 'ASCII//TRANSLIT')
description_caps <- description
description <- tolower(description)
The further nested loop does the following for the sequence of rows in matching
data frame:
mt
from matching.list
which contains all HS codes from mth item of matching.list
as an atomic vectorntn
which contains those HS codes from nt
that are present in mt
: in our examined case, all 754 HS codes present in the 5998th NTM of China match the HS codes present in the 36th item of the matching.list
matchhs6share
, which contains the value for share of a country c’s 6-digit HS codes stored in nt
that match codes stored in mt
#for (m in 1:nrow(matching)){
mt <- matching.list[[m]]$HS[[1]]
ntn <- nt[nt %in% mt]
matchhs6share <- length(ntn)/length(nt)*100
matchNTM
which stores a logical value for whether or not NTM code present in the nth NTM of country c matches NTM code in the mth item in matching.list
, which in our case is TRUENTM
which contains the NTM code present in the nth NTM of country c matchNTM <- any(this_mst %in% matching.list[[m]]$NTM[[1]])
NTM <- ntms.list[[n]]$NTM
kw
contains keywords from mth item of matching.list
as an atomic vector with keywords marked with exclamation marks removed22kw_caps
contains only the keywords marked with exclamation marks (the exclamations marks themselves are removed)nkw
and nkw_caps
similarly to the aboveOnce again, this is done to separate keywords that are abbreviations (all-caps character strings) and that will be used for case-sensitive matching, from all other keywords that will be used for case-insensitive matching. Notice, that to create a list of all-caps keywords we use function grepl
to include criterion to find and extract only those keywords that have !
in them. And later, we use function gsub
on kw_caps
vector to find and remove all !
from the character strings. And then, we use grepl
function once again to remove from kw
vector all keywords that contain !
.
kw <- matching.list[[m]]$Keywords[[1]]
kw_caps <- kw[grepl("!", kw)]
kw_caps <- gsub("!","",kw_caps)
kw <- kw[!grepl("!", kw)]
nkw <- matching.list[[m]]$neg_Keywords[[1]]
nkw_caps <- nkw[grepl("!", nkw)]
nkw_caps <- gsub("!","",nkw_caps)
nkw <- nkw[!grepl("!", nkw)]
For some of the items in matching.list
the keywords are not necessary and are not provided. For them the matching is considered complete if there are only 1 matching HS code and 1 matching NTM code present in a given NTM. So below we include a conditional if...else
statement that prescribes two alternative treatments of a country’s NTM description, depending on availability of keywords in matching.list
.
If length of vector kw
is equal to 1 and is empty, then the following temporary objects should be created: * matchkw
with value “NO KW INCL”; * matchkwlength
with value 1, * matchkwshare
with value 100.
Otherwise, the following objects should be created:
matchkw
with strings that contain those words and words combinations from an NTM description that strictly match the words and words combinations that are present in description
and description_caps
matchkwlength
with a value equal to the length of vector matchkw
indicating a number of matching expressionsmatchkwshare
with a value for the share of keywords provided in mth item of matching.list
present in the description of nth NTM of country cNotice the use of function stri_detect_fixed
which is used to look for case-sensitive matches for a certain pattern (second argument) in a specified characters string (first argument). Case-sensitivity is a default setting for this function.
The following if...else
statements indicates, that if matchkwlength
is equal to 0, then matchkw
should be made empty.
if(length(kw)==1&(is.na(kw[[1]]))){
matchkw <- "NO KW INCL"
matchkwlength <- 1
matchkwshare <- 100
} else
{
matchkw <- kw[stri_detect_fixed(description,kw)]
matchkw_caps <- kw_caps[stri_detect_fixed(description_caps,kw_caps)]
matchkw <- c(matchkw, matchkw_caps)
matchkwshare <- length(matchkw)/length(kw)*100
matchkwlength <- length(matchkw)
if(length(matchkw)==0){matchkw <- ""}
}
Absence of negative keywords in a given concordance string of matching.list
does not affect anything. In fact most of the items in matching.list
do not include any negative keywords. So, we create objects matchnkwlength
, matchnkwshare
and matchnkw
with default values of 0
, 0
or ``.
However, to take into account those concordance strings of matching.list
that do contain negative keywords, we use the following conditional if...else
statement, that is similar to the one used for keywords above.
matchnkwlength <- 0
matchnkwshare <- 0
matchnkw <- ""
if (length(nkw)>0&!(is.na(nkw[[1]]))){
matchnkw <- nkw[stri_detect_fixed(description,nkw)]
matchnkw_caps <- nkw_caps[stri_detect_fixed(description,nkw_caps)]
matchnkw <- c(matchnkw, matchnkw_caps)
matchnkwshare <- length(matchnkw)/length(nkw)*100
matchnkwlength <- length(matchnkw)
if(length(matchnkw)==0){matchnkw <- ""}
}
Now let’s store the results in the currently empty data frame df
, by creating an additional empty row in the data frame using argument [(nrow(df)+1),]
and populating it with the values from the temporary objects generated above.
As the loops take each of the multiple entries of a country’s NTMs through each of the 67 concordance strings, printing the line __ helps to keep track of the progress made.
Function flush.console
is not necessary, but it ensures that the display of output in the R console is current, even if output buffering is on.
df[(nrow(df)+1),] <-
c(
n,
m,
length(ntn),
matchhs6share,
paste(ntn,collapse=";"),
matchNTM,
NTM,
paste(matchkw,collapse=";"),
matchkwlength,
matchkwshare,
paste(matchnkw,collapse=";"),
matchnkwlength,
matchnkwshare
)
cat( "------ c: ", this_countryiso3, ", n: ", n ,", m: ", m ," ------\n");
flush.console()
#}
#}
Below we add new columns or overwrite the existing ones with the following values:
TRUE
;Note, that we are specifically interested in those NTMs that contain a simultaneous match to one of the possible values outlined in concordance strings in matching.list
, while lacking any of the negative keywords (if they are provided in matching.list
). Such NTMs in column index would contain value 3. The rest we can remove, with last line of code in the vignette below, which indicates that in df
we only keep those rows that contain value 3 in column index.
df$hs_match <- as.integer(df$matchhs6share > 0)
df$kw_match <- as.integer(df$matchkwshare > 0)
df$matchNTM <- as.integer(as.logical(df$matchNTM))
df$nkw_match <- as.integer(df$matchnkwshare > 0)
df$index <- df$hs_match + df$kw_match + df$matchNTM - df$nkw_match
df <- df[df$index==3, ]
As the result of the actions above we know that China’s 5998th NTM is a full match to a concordance string 36, which assigns this NTM to SDG Target 12.4 of responsible management of hazardous chemicals and waste, as it contains 754 HS codes, 3 keywords and an NTM code that matched the possible values stored in the concordance string 36, while no negative keywords were listed for concordance string 36 of matching.list
.
To finalize analysis summary for our sample, we create a new data frame df2
, in which we merge df
by column matching and matching
by column row. This way the resulting df2
includes information on which SDG target each matched NTM is linked to.
Now we can export the summary df2
by using write.csv()
, in the argument of which we indicate what to export (df2
), how to name the exported file (paste0("countries/", "detailed", this_countryiso3, ".csv")
), not to include row names(row.names = FALSE
) and what to store in empty cells (na="").
Take a close look at the first argument in the nested paste0
function. It has a forward slash in it - "countries/"
. This is actually a name of the sub-folder in our working directory into which we are saving the generated country files. This sub-folder has to be created in advance, otherwise you will get an error message. The following two arguments are elements of the file name, the second part of which will depend on the string stored in this_countryiso3
object at the start of a given loop cycle. And the last mandatory element is the file extension ".csv"
.
And then we remove all now redundant objects.
df2 <- merge(df, matching[c("row", "Target")], by.x="matching", by.y="row")
write.csv(df2, paste0("countries/", "detailed", this_countryiso3, ".csv"), row.names = FALSE, na="")
rm(c, description, description_caps, kw, kw_caps,m, matchkw, matchkw_caps, matchkwlength, matchnkw, matchnkw_caps, matchnkwlength, matchNTM, mt, n, nkw, nkw_caps, nt, NTM, ntms, ntn, this_country, this_countryiso3,this_mst, matchhs6share, matchkwshare, matchnkwshare, matchHS, hs6s, ntms.list)
#}
As mentioned earlier, the complete matching loop for all countries for which data is available in ntms_all
data frame is provided in the R file with the complete code algorithm and also under the button below. Take a look at it first. Do you understand what each line of code does?
Running the code for all the countries and even just for a selected group of countries may take a long time, depending of the power of your PC. Alternatively, you can revise the first line of the loop for (c in 1:nrow(countries)){
to for (c in 1:1){
and only run the complete loop for Afghanistan. It should take just a couple of minutes, and you will be able to see the complete country file generated by this loop and saved in folder countries
in your working directory.
For the purpose of this training and for your convenience we have already run the complete loop for all 88 countries. The complete set of country files is available here in a zip file. Download the zip file and extract it into the countries
subfolder of your working directory (make sure to overwrite all files that are already present in that folder). That is where the output files would have been saved if we ran the complete loop for all countries. Let’s pretend that that is what just happened ;-)
## ----for loop setup 1 ----
for (c in 1:nrow(countries)){
this_countryiso3 <- countries$iso3[c]
this_country <- countries$country[c]
cat( "------ ",this_countryiso3 ," ------\n");flush.console()
ntms <- ntms_all[ntms_all$reporter==this_country, ]
## ----for loop setup 2 ----
ntms.list <- split(ntms, seq(nrow(ntms)))
## ----for loop setup 3 ----
# 1. loop through each row of NTM data in the ntms.list
for (i in 1:nrow(ntms)){
# 2. split the string stored in subitem "HS" of the ith item into a vector or separate character strings stored in object "x"
x <- list(strsplit(ntms.list[[i]]$HS, ",")[[1]])[[1]]
# 3. overwrite x with itself removing potentially empty strings
x <- x[!is.na(x)]
# 4. truncate any HS codes longer than 6-digit, and overwrite x as a numeric vector removing any duplicating strings
x <- unique(as.numeric(substr(trimws(x), 1, 6)))
# 5. test for presence of 4-digit codes and expand them to 6-digit using the 2012 HS version
if (length(x[x>99&x<=9999])>0){
x4 <- merge(data.frame(x = x[x>99&x<=9999]), hs_codes12[,c("hs4", "hs6")], by.x="x", by.y="hs4")[,"hs6"]
x <- x[-which(x>99&x<=9999)]
x <- c(x,x4)
}
# 6. test for presence of 2-digit codes and expand them to 6-digit using the 2012 HS version
if (length(x[x<=99])>0){
x2 <- merge(data.frame(x = x[x<=99]), hs_codes12[,c("hs2", "hs6")], by.x="x", by.y="hs2")[,"hs6"]
x <- x[-which(x<=99)]
x <- c(x,x2)
}
# 7. remove all duplicates and write the resulting set of HS codes to subitem "HS" of the ith item as a list vector
x <- unique(x)
ntms.list[[i]]$HS <- list(x)
cat( "------ ",i ," ------\n");flush.console()
}
# 8. remove redundant objects
rm(x,x2,x4,i)
## ----for loop setup 4 ----
df <- data.frame(measure=integer(),
matching=integer(),
nt_length = integer(),
matchhs6share = numeric(),
hs6s = character(),
matchNTM = logical(),
NTM = character(),
matchkw = character(),
matchkwlength = integer(),
matchkwshare = numeric(),
matchnkw = character(),
matchnkwlength = integer(),
matchnkwshare = numeric(),
stringsAsFactors=FALSE)
## ----for loop matching 1 ----
for (n in 1:nrow(ntms)){
nt <- ntms.list[[n]]$HS[[1]]
this_mst <- ntms.list[[n]]$NTM # NTM code of the nth measure from country c
n_char_numeric <- as.numeric(nchar(this_mst))
this_level <- paste0("level", MAST_codes[MAST_codes$Code==this_mst, "level"])
MAST_codes_temp <- MAST_codes[(MAST_codes[, this_level]==this_mst),]
this_mst <- unique(MAST_codes_temp[MAST_codes_temp$level>=n_char_numeric,"Code"])
rm(n_char_numeric, this_level, MAST_codes_temp)
description <- ntms.list[[n]]$description
description <- gsub("[\r\n]", "", description)
description <- iconv(description, from = 'UTF-8', to = 'ASCII//TRANSLIT')
description_caps <- description
description <- tolower(description)
## ----for loop matching 2 ----
for (m in 1:nrow(matching)){
mt <- matching.list[[m]]$HS[[1]]
ntn <- nt[nt %in% mt]
matchhs6share <- length(ntn)/length(nt)*100
## ----for loop matching 3 ----
matchNTM <- any(this_mst %in% matching.list[[m]]$NTM[[1]])
NTM <- ntms.list[[n]]$NTM
## ----for loop matching 4 ----
kw <- matching.list[[m]]$Keywords[[1]]
kw_caps <- kw[grepl("!", kw)]
kw_caps <- gsub("!","",kw_caps)
kw <- kw[!grepl("!", kw)]
nkw <- matching.list[[m]]$neg_Keywords[[1]]
nkw_caps <- nkw[grepl("!", nkw)]
nkw_caps <- gsub("!","",nkw_caps)
nkw <- nkw[!grepl("!", nkw)]
## ----for loop matching 5 ----
if(length(kw)==1&(is.na(kw[[1]]))){
matchkw <- "NO KW INCL"
matchkwlength <- 1
matchkwshare <- 100
} else
{
matchkw <- kw[stri_detect_fixed(description,kw)]
matchkw_caps <- kw_caps[stri_detect_fixed(description_caps,kw_caps)]
matchkw <- c(matchkw, matchkw_caps)
matchkwshare <- length(matchkw)/length(kw)*100
matchkwlength <- length(matchkw)
if(length(matchkw)==0){matchkw <- ""}
}
## ----for loop matching 6 ----
matchnkwlength <- 0
matchnkwshare <- 0
matchnkw <- ""
if (length(nkw)>0&!(is.na(nkw[[1]]))){
matchnkw <- nkw[stri_detect_fixed(description,nkw)]
matchnkw_caps <- nkw_caps[stri_detect_fixed(description,nkw_caps)]
matchnkw <- c(matchnkw, matchnkw_caps)
matchnkwshare <- length(matchnkw)/length(nkw)*100
matchnkwlength <- length(matchnkw)
if(length(matchnkw)==0){matchnkw <- ""}
}
## ----for loop matching 7 ----
df[(nrow(df)+1),] <-
c(
n,
m,
length(ntn),
matchhs6share,
paste(ntn,collapse=";"),
matchNTM,
NTM,
paste(matchkw,collapse=";"),
matchkwlength,
matchkwshare,
paste(matchnkw,collapse=";"),
matchnkwlength,
matchnkwshare
)
cat( "------ c: ", this_countryiso3, ", n: ", n ,", m: ", m ," ------\n");
flush.console()
}
}
## ----for loop matching 8 ----
df$hs_match <- as.integer(df$matchhs6share > 0)
df$kw_match <- as.integer(df$matchkwshare > 0)
df$matchNTM <- as.integer(as.logical(df$matchNTM))
df$nkw_match <- as.integer(df$matchnkwshare > 0)
df$index <- df$hs_match + df$kw_match + df$matchNTM - df$nkw_match
df <- df[df$index==3, ]
## ----for loop matching 9 ----
df2 <- merge(df, matching[c("row", "Target")], by.x="matching", by.y="row")
write.csv(df2, paste0("countries/", "detailed", this_countryiso3, ".csv"), row.names = FALSE, na="")
rm(c, description, description_caps, kw, kw_caps,m, matchkw, matchkw_caps, matchkwlength, matchnkw, matchnkw_caps, matchnkwlength, matchNTM, mt, n, nkw, nkw_caps, nt, NTM, ntms, ntn, this_country, this_countryiso3,this_mst, matchhs6share, matchkwshare, matchnkwshare, matchHS, hs6s, ntms.list)
}
Question 13.1: In healthSDG
we want to replace targets with their numbers. What line of code below would achieve that?
A. healthSDG$Target < gsub("Target", "", healthSDG$Target)
# Answer A is incorrect. `<-` should be used instead of `<`.
B. healthSDG$Target <- gsub("Target", "", Target)
# Answer B is incorrect. Should use `healthSDG$Target` to reference column __Target__ of object `healthSDG`.
C. healthSDG$Target <- gsub("Target", , healthSDG$Target)
# Answer C is incorrect. Replacement character string should be indicated within quotation marks, even if it is empty.
D. healthSDG$Target <- gsub("Target", "", healthSDG$Target)
# Answer D is correct.
Question 13.2: We want to see if HS code 263628 is associated with any of the SDG targets. What line of code would achieve that?
A. grepl(263628, healthSDG$HS)
# Answer A is incorrect. Character string should be indicated within quotation marks.
B. grepl("263628", healthSDG$HS)
# Answer B is correct.
C. grepl("263628", "HS")
# Answer C is incorrect. `healthSDG$HS` should be used to reference column __HS__ of object `healthSDG`.
D. grepl("263628" %in% healthSDG$HS)
# Answer D is incorrect. Arguments of function `grepl` are indicated incorrectly.
Question 13.3: Now we want to see what unique SDG targets have HS code 263628 associated with them. Please write code and choose the correct answer below.
A. " 3.1" " 3.2" " 3.3" " 3.5" " 3.7" " 3.8" " 3.b"
# Answer A is incorrect.
B. " 3.1" " 3.2" " 3.3" " 3.4" " 3.7" " 3.9" " 3.b"
# Answer B is incorrect.
C. " 3.1" " 3.2" " 3.3" " 3.4" " 3.7" " 3.8" " 3.b"
# Answer C is correct.
# The correct code is below for reference.
D. " 3.1" " 3.2" " 3.3" " 3.4" " 3.7" " 3.8" " 3.a"
# Answer D is incorrect.
Question 13.4: Use functions strsplit
, trimws
, length
and operator %in%
to check if any of the HS codes in row 20 of healthSDG
are present among HS codes recorded in row 1 of healthSDG
. You can either nest the functions to make one string of code, or write separate lines of code for each of separate steps. What is the correct answer?
A. 0
# Answer A is incorrect.
B. 36
# Answer B is incorrect.
C. 85
# Answer C is correct.
# The correct code is below for reference.
D. 51
# Answer D is incorrect.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.
# Question 1
healthSDG$Target <- gsub("Target", "", healthSDG$Target)
# Question 2
grepl("263628", healthSDG$HS)
# Question 3
unique(healthSDG$Target[grepl("263628", healthSDG$HS)])
# Question 4
length(trimws(strsplit(healthSDG$HS[[20]], ";")[[1]]) %in% trimws(strsplit(healthSDG$HS[[1]], ";")[[1]]))
Now we have a set of 88 .csv files with the results of SDG-HS-NTM matching for all of the countries present in TRAINS database of NTMs. The .csv files only contain data on NTMs that were matched to various lines of matching.list
. The files have been stored in countries
folder inside your working directory (we set it in section 1.1 of this module) . We can now do a number of things that will help us review the results of matching and to create illustrative plots.
However, first let’s take a look at matching results for all of China’s NTMs. Having run the full loop for all China we can see that the total of 2662 unique NTMs are a full match to various concordance strings of matching.list
.
You will notice that the resulting data frame for China contains 6635 rows corresponding to each instance of matched NTM, while we analyzed 7357 NTMs of which only 2662 unique NTMs are in fact found to match SDG targets.
This discrepancy is due to a number of reasons:
some concordance strings in matching
data frame included more than one target. For examples refer to a screenshot of the unedited imported matching
data frame in section Import the SDG-HS-NTM concordance matrix into RStudio above. We later expanded matching
data frame into long format by column Target. Therefore, when summarizing the results and preparing graphs we will use code to take this into account by removing duplicate rows in summary data frames.
some concordance strings contained alternative sets of HS codes yielded by research. These were expanded into long format manually in the source Excel files during the research stage. For example, see concordance strings for Target 12.4 in matching
. Again, when summarizing the results and preparing plots we will use code to take this into account by removing duplicate rows in summary data frames.
some of the NTMs may match more than one target, because, as per their description, they intend to address more than one issue. This is especially true for regulation of trade in so called sensitive goods, the broad group of which may include: hazardous substances and waste, medicines, toxins, viruses, microorganisms, chemicals, wastes, narcotics, living modified organisms, alien species, radioactive substances, guns, weapons, explosives and their precursors, dual use technologies, tobacco, rough diamonds, endangered species of flora and fauna, and cultural and historical heritage items, etc. In this case simultaneous matching with a few concordance strings cannot be considered as duplication.
Let’s first review in detail what .csv files we have generated as a result of matching. We can use list.files
function to retrieve a vector of .csv file names.
files_csv <- list.files("countries", pattern = "\\.csv$")
In some cases it may be useful to indicate relative path to a file or a folder, i.e. relative to the preset working directory. In this case the first line of code would look like files_csv <- list.files("./countries", pattern = "\\.csv$")
, where ./
refers to the working directory and can be used by itself.
We then use data.frame
function to turn it into a data frame containing a file name and a ISO3 country codes. The latter we derive from the names of the files by using substr
function to extract the last three symbols before the file extension (all characters from 9th till 11th).
Notice, argument pattern = "\\.csv$"
used with list.files
function. It uses regular expression "\\.csv$"
indicating, that we are selecting files what have ‘.csv’ at the end of their names. If the folder does not contain any files with extensions other than .csv, we can actually drop that argument.
Note , that when working with character strings in R, .
is used to represent any character, except new line. Therefore, if we want to detect a period in a character string, we need to include \\
right before it. $
is used to indicate that the character combination is to be found at the end of the file names. For more details on basic regular expressions in R used for manipulating character strings refer, for example, to https://stringr.tidyverse.org/articles/regular-expressions.html
files_csv <- data.frame(files_csv, stringsAsFactors = FALSE)
files_csv$ISO3 <- substr(files_csv$files_csv, 9, 11)
names(files_csv) <-c("file_name", "ISO3")
head(files_csv)
## file_name ISO3
## 1 detailedAFG.csv AFG
## 2 detailedARE.csv ARE
## 3 detailedARG.csv ARG
## 4 detailedATG.csv ATG
## 5 detailedAUS.csv AUS
## 6 detailedBEN.csv BEN
Note that since ISO3 code for EU is a two character string, the corresponding string in column 2 now contains a ‘.’, that we need to remove.
print(files_csv[29,2])
## [1] "EU."
We can use gsub
function to find “.” in the resulting strings in ISO3 and to remove them.23
files_csv$ISO3 <- gsub("\\.", "", files_csv$ISO3)
print(files_csv[29,2])
## [1] "EU"
Question 14.1: You want to retrieve all files and folders from your working directory into a data frame that has file names in column 1, which you want to name as __names_. Which line of code will achieve that?
A.
files <- data.frame(list.files(getwd), stringsAsFactors = FALSE)
names(files) <- "names"
# Answer A is incorrect. `getwd()` should be used instead of `getwd`.
B.
files <- data.frame(list.files(getwd()), stringsAsFactors = FALSE)
names(files) <- "names"
# Answer B is correct.
C.
files <- data.frame(list.files(getwd()), stringsAsFactors = FALSE) %>% names("names")
# Answer C is incorrect. Argument of function `names` should contain the name of the data frame object it is applied to. Code overwriting the name of the column is also missing.
D.
files <- data.frame(list.files(getwd()), stringsAsFactors = FALSE)
names <- "names"
# Answer D is incorrect. Argument of function `names` should contain the name of the data frame object it is applied to.
Question 14.2: In files
data frame you want to keep only those rows that contain names of the files, and not the folders (file names have file extensions that have a dot in them), and then to check the number of the resulting rows. Which line of code will achieve that?
A.
files == files[grepl("\\.", files$names), ,drop=FALSE]
nrow(files)
# Answer A is correct.
B.
files <- files[grepl(".", files$names), ,drop=FALSE]
nrow(files)
# Answer B is incorrect. `.` in R represents any character except a newline. So if we need to indicate a literal `.`, it has to be preceeded by two escape signs like so `\\.`.
C.
files <- files[grepl("\\.", names), ,drop=FALSE];
nrow(files)
# Answer C is incorrect. `files$names` should be used to reference column __names__ of object `files`.
D.
files <- files[grepl("\\.", files$names), ,drop=FALSE]
nrow()
# Answer D is incorrect. Argument specifying the name of a data frame object has to be indicated for function `nrow()`.
Question 14.3: To reduce code you want to retrieve the number of all files (and not folders) from your working directory in one code string. Which line of code will achieve that?
A. nrow(data.frame(list.files(getwd(), pattern="\\."), stringsAsFactors = FALSE))
# Answer A is not entirely correct.
B. All
# Answer B is correct.
C. data.frame(list.files(getwd(), pattern="\\."), stringsAsFactors = FALSE) %>% nrow()
# Answer C is not entirely correct.
D. data.frame(list.files("./", pattern="\\."), stringsAsFactors = FALSE) %>% nrow()
# Answer D is not entirely correct.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above.
# Question 1
files <- data.frame(list.files(getwd()), stringsAsFactors = FALSE); names(files) <- "names"
# Question 2
files <- files[grepl("\\.", files$names), ,drop=FALSE]; nrow(files)
# Question 3
nrow(data.frame(list.files(getwd(), pattern="\\."), stringsAsFactors = FALSE))
data.frame(list.files(getwd(), pattern="\\."), stringsAsFactors = FALSE) %>% nrow()
data.frame(list.files("./", pattern="\\."), stringsAsFactors = FALSE) %>% nrow()
ntms_all
data framentms_all
data frame that we created earlier contains data on all of the NTMs present in TRAINS database of NTMs. We can now populate it with the data from the .csv files that we generated, so that the ntms_all
contains data on all of the NTMs: those that were found to be linked to SDGs and those that were found not to have such linkage.
In ntms_all
let’s now add a column numbering all the measures within each individual reporter country. Later, this will allow us merging ntms_all
and the generated .csv files by both country and NTM number.
First, we use function unique
to extract the list of all unique countries present in reporter column of ntms_all
. And then we use for loop
, which for each of the unique countries stored in unique_countries
object adds measure column to ntms_all
with measure indices from 1 up to a total number of NTMs of a given country stored in ntms_all
.
unique_countries <- unique(ntms_all$reporter)
for (j in 1:length(unique_countries)){
ntms_all[ntms_all$reporter==unique_countries[j],"measure"] <- 1:nrow(ntms_all[ntms_all$reporter==unique_countries[j],])
}
For further convenience we add an iso3 column to ntms_all
by merging it with countries
data frame. We use left_join
function, which does the same thing as merge
function that we used earlier. The first two arguments indicate the names of the objects to merge, and argument by=c()
specifies the column names of the two data frames, by which both objects should be merged.
Notice, that both in left_join
and in merge
you can indicate a specific subset of a data frame to be merged. In our case, we use countries[,c("country", "iso3")]
to indicate that we merge ntms_all
with a subset of countries
data frame that has only two columns: countries and iso3.
ntms_all <- left_join(ntms_all, countries[,c("country", "iso3")], by=c("reporter" = "country"))
Before we populate ntms_all
with the results of SDG-HS-NTM matching, let’s add five more columns with empty cells as below.
ntms_all$matching <- NA
ntms_all$NTM_temp <- NA
ntms_all$Target <- NA
ntms_all$hs6s <- NA
ntms_all$matchkw <- NA
Below we include for loop
what will populate ntms_all
with data from the generated files, the list of which is stored in object files_csv
.
For the sequence of all the generated .csv files, it does the following:
temp_file
which imports the fth .csv file from the specified directorytemp_file
with itself preserving only rows for which values in column matchnkwshare are 0, only keeping columns matching, measures, NTM, hs6s, matchkw, and Targetntms_all_temp
only containing NTMs for the fth country, and removing columns matching, NTM_temp, hs6s, matchkw and Targetntms_all
of all the rows for country fntms_all_temp
with the temp_file
by column measure of eachntms_all_temp
to the bottom of ntms_all
by creating the necessary number of rows at the bottom of ntms_all
and pasting ntms_all_temp
data into themfor (f in 1:nrow(files_csv)){
temp_file <- read.csv(paste0("countries/", files_csv$file_name[f]), stringsAsFactors = FALSE)
temp_file <- temp_file[temp_file$matchnkwshare==0, c("matching", "measure", "NTM", "Target", "hs6s", "matchkw")]
ntms_all_temp <- ntms_all[ntms_all$iso3==files_csv$ISO3[f], -c(which(names(ntms_all)%in%c("matching", "NTM_temp", "Target", "hs6s", "matchkw")))]
ntms_all <- ntms_all[ntms_all$iso3!=files_csv$ISO3[f],]
ntms_all_temp <- left_join(ntms_all_temp, temp_file, by="measure")
nextrow <- nrow(ntms_all)+1
ntms_all[nextrow:(nextrow+nrow(ntms_all_temp)-1),] <- ntms_all_temp
rm(ntms_all_temp)
# cat( "(f:)",f ,"\n");flush.console()
}
rm(f, nextrow, unique_countries,j, df, df2, MAST_codes, hs_codes, hs_codes12, matching.list)
write.csv(ntms_all, "ALL_NTMs_after_matching_191217.csv", row.names = FALSE, col.names = TRUE)
#ntms_all <- read.csv("ALL_NTMs_after_matching_191217.csv", stringsAsFactors = FALSE, header = TRUE)
Look below at the two lines of code extracted from the code chunk above, and specifically at nrow(ntms_all)+1
and nextrow:(nextrow+nrow(ntms_all_temp)-1)
. Note: Don’t paste these two lines of code into your R script, as they are already present in the code chunk above.
nextrow <- nrow(ntms_all)+1
ntms_all[nextrow:(nextrow+nrow(ntms_all_temp)-1),] <- ntms_all_temp
If you already went through the Training on using R for trade analysis you may remember that R can be used for arithmetic operations both with numbers, and objects that have numbers stored in them, and functions that generate numbers. This is exactly what is done here.
First, object nextrow
is created that stores a sum of a number for the amount of rows in ntms_all
data frame in a given cycle of for loop
execution and 1. In the first run of the loop that number would be 61652+1=61653
. We need +1
, as we need to post the rows from ntms_all_temp
into the next row created after the existing last row in ntms_all
.
To set a range of numbers we use :
, as in 1:10
, which refers to a range of numbers from 1 through to 10. Likewise, in [nextrow:(nextrow+nrow(ntms_all_temp)-1),]
we refer to all rows from a number stored in nextrow
through to a number that is the sum of a number stored in nextrow
plus the number of rows in ntms_all_temp
at a given cycle of for loop
execution minus 1. In the first run of the loop that would be 61653+285-1=61937
. So nextrow:(nextrow+nrow(ntms_all_temp)-1)
sets the range of rows from 61653 through to 61937 into which to paste the data.
Since these rows do not exist yet, referencing them and writing data into them creates them at the bottom of ntms_all
.
Now ntms_all
contains data on all of the NTMs in TRAINS database: those that were found to be matched to any of the list items of matching.list
and those that were found not to have any matches.
Target column contains information on what SDG target a given NTM was linked to. Some of the Targets have more than one dedicated concordance strings in matching.list
, so we have a very detailed presentation of data. We can see, which HS codes, NTM codes and keywords of a given NTM matched to what list item of matching.list
.
We exported this data frame as an .csv file into the working directory using write.csv
function, as per the code above. However, as we continue our data review and move on to creating graphs, we will need to remove duplicates as necessary.
Question 15.1: In practice_matching
data frame there is a total of 204 rows. The rows are assigned index of 0 or 1 in its first column. You want to create a new column index_number numbering all the measures within each of the individual categories. Which of the lines of code will achieve that?
A.
unique_index <- unique(practice_matching[[1]])
for (j in 1:length(unique_index)){
practice_matching[practice_matching[[1]]==unique_index[j],"index_number"] <- 1:nrow(practice_matching[practice_matching[[1]]==unique_index[j],])
}
# Answer A is correct.
B.
unique_index <- unique(practice_matching[[1]])
for (j in 1:nrow(unique_index)){
practice_matching[practice_matching[[1]]==unique_index[j],"index_number"] <- 1:nrow(practice_matching[practice_matching[[1]]==unique_index[j],])
}
# Answer B is incorrect. Function `length` should be used to set the sequence of `for loop` in this case, as object `unique_index` is a vector.
C.
unique_index <- unique(practice_matching[[1]])
for (j in 1:length(unique_index)){
practice_matching[practice_matching[[1]]==unique_index[j],"index_number"] <- nrow(practice_matching[practice_matching[[1]]==unique_index[j],])
}
# Answer C is incorrect. Need to indicate a range of numbers to be assigned to the specified rows, like so: `1:nrow(practice_matching[practice_matching[[1]]==unique_index[j],])}`.
D.
unique_index <- unique(practice_matching[[1]])
for (j in 1:length(unique_index)){
practice_matching[practice_matching[[1]]==unique_index[j],index_number] <- 1:nrow(practice_matching[practice_matching[[1]]==unique_index[j],])
}
# Answer D is incorrect. Name of the column in square brackets has to be indicated within quotation marks.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the question above, and the object you will need for further quizzes and practice sessions.
# Question 1
unique_index <- unique(practice_matching[[1]])
for (j in 1:length(unique_index)){
practice_matching[practice_matching[[1]]==unique_index[j],"index_number"] <- 1:nrow(practice_matching[practice_matching[[1]]==unique_index[j],])
}
rm(unique_index)
To create a bar chart for distribution of NTMs addressing at least one SDG target, as the one shown below, we can use ntms_all
to create a data frame ntms_all_Target
, in which we remove columns matching, hs6s and matchkw and keep only the unique rows (i.e. remove duplicate rows).
ntms_all_Target <- ntms_all[, -which(names(ntms_all) %in% c("matching", "hs6s", "matchkw"))]
ntms_all_Target <- unique(ntms_all_Target)
We can now use functions from dplyr
package to group NTMs by Target, remove row for which value in Target is not available, and then we can use barplot
function to create a bar plot. Run ?barplot
to see details on the purpose of various arguments of barplot
function. In our case we specified such arguments as values to plot, title of the plot, color theme for the bars, source for bar labels, orientation of text in bar labels, size of text in bar labels and in label of axis y, text to label axis y. You can use more or less arguments to customize your bar chart.
Note, that when you are using a function from a particular package, it may sometimes be useful to specify that package in the code, like we did below by appending dplyr::
in front of functions count
and group_by
. It is important to do, when you have loaded more than one package that contain functions that share one name.
By_target <- ntms_all_Target %>% dplyr::group_by(Target) %>% dplyr::count() %>% filter(!is.na(Target))
barplot(By_target$n, main="NTMs distribution by SDG targets", col=rainbow(25),names.arg = By_target$Target, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Number of NTMs directly addressing SDGs", ylim=c(0,10000))
It would be more informative to put SDG targets in their order. For that we add a new column Number to By_target
into which we extract the target numbers converted to numbers. We use function nchar
to generate argument for the end symbol within function substr
.
Notice that some of the Target numbers contain letters in them, such as 3.a. These cannot be converted to numeric format, and are left empty. We can replace NA
with a number that would put the target in the correct row once we sort the data frame. In our case we use 3.91 to put this row behind Target 3.9. Then we use order
function to order the data frame by column Number, and run the barplot
function once again.
Now all SDG targets are in the right order and look nice on the bar plot. To make it more informative we can add a column with brief titles for the targets that reflect their sense, and use those as labels. We will do similar task for other graphs below.
If R functionality for making charts seems too confusing or limited, you can always use function write.table
to copy the data to clipboard and then to paste it into excel file for further manipulation. We will do that for one of the charts below.
By_target$Number <- as.numeric(substr(By_target$Target, 8, nchar(By_target$Target)))
which <- which(By_target$Target=="Target 3.a")
By_target$Number[[which]] <- 3.91
By_target <- By_target[order(By_target$Number),]
barplot(By_target$n, main="NTMs distribution by SDG targets", col=rainbow(25),names.arg = By_target$Target, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Number of NTMs directly addressing SDGs", ylim=c(0,10000))
However, one needs to bear in mind that at the target level there is a lot of duplication as the result of the methodology used to generate the concordance strings. Some of the concordance strings in the original matrix are assigned to a group of targets within the same Goal. At the same time only certain targets within only 9 SDGs have a concordance strings dedicated to them, so it is good to see what specific objectives NTMs linked to SDGs aim to address.
In this connection, it may be more useful to group NTMs linked to SDG targets by their subject areas. Based on the methodology, we suggest using the groupings as per the table below.
Goal | Subject area | Rows in matching df |
---|---|---|
Goal 2 | Food security | 1-2 |
Goal 3 | Medicines & medical technologies | 3:21 |
Goal 3 | Food safety | 25 |
Goal 3 | Good nutrition | 26 |
Goal 3 | Drugs & psychotropics | 22 |
Goal 3 | Alcohol consumption | 27 |
Goal 3 | Vehicle safety | 28 |
Goal 3 | Tobacco use | 23, 24 |
Goal 3 | Poison exposure | 29 |
Goal 6 | Water efficiency | 30, 31 |
Goal 7 | Energy efficiency | 32 |
Goal 11 | Air & noise pollution | 33, 34 |
Goal 11 | Cultural and historical heritage | 35 |
Goal 12 | Waste & hazardous chemicals | 36:51 |
Goal 12 | Sustainable processes and production methods (PPMs) | 52:54 |
Goal 14 | Illegal, unregulated and unreported fishing (IUU fishing) | 55, 56 |
Goal 15 | Illegal timber | 57 |
Goal 15 | Endangered species of flora and fauna | 58 |
Goal 15 | Living modified organisms (LMOs) | 59 |
Goal 16 | Arms and weapons | 60:64 |
Goal 16 | Conflict minerals | 65, 66 |
Goal 16 | Counterfeit goods | 67 |
Thus, we add an empty column titled area to ntms_all
and fill it with the subject area labels by using the data from matching column. Notice, that we add \n
to wrap the longer labels.
ntms_all$area <- NA
ntms_all$area[ntms_all$matching %in% c(1,2)] <- "Food\nsecurity"
ntms_all$area[ntms_all$Target %in% c(3:21)] <- "Medicines &\ntechnologies"
ntms_all$area[ntms_all$matching %in% c(25)] <- "Food\nsafety"
ntms_all$area[ntms_all$matching %in% c(26)] <- "Good\nnutrition"
ntms_all$area[ntms_all$matching %in% c(28)] <- "Vehicle\nsafety"
ntms_all$area[ntms_all$matching %in% c(27)] <- "Alcohol\nconsumption"
ntms_all$area[ntms_all$matching %in% c(22)] <- "Narcotics &\npsychotropics"
ntms_all$area[ntms_all$matching %in% c(29)] <- "Poison\nexposure"
ntms_all$area[ntms_all$matching %in% c(23, 24)] <- "Tobacco\nuse"
ntms_all$area[ntms_all$matching %in% c(30, 31)] <- "Water\nefficiency"
ntms_all$area[ntms_all$matching %in% c(32)] <- "Energy\nefficiency"
ntms_all$area[ntms_all$matching %in% c(33, 34)] <- "Air & noise\npollution"
ntms_all$area[ntms_all$matching %in% c(35)] <- "Cultural\nheritage"
ntms_all$area[ntms_all$matching %in% c(36:51)] <- "Waste &\nchemicals"
ntms_all$area[ntms_all$matching %in% c(52:54)] <- "Sustainable\nPPMs"
ntms_all$area[ntms_all$matching %in% c(55:56)] <- "IUU fishing"
ntms_all$area[ntms_all$matching %in% c(57)] <- "Illegal\ntimber"
ntms_all$area[ntms_all$matching %in% c(58)] <- "Endangered\nspecies"
ntms_all$area[ntms_all$matching %in% c(59)] <- "LMOs"
ntms_all$area[ntms_all$matching %in% c(60:64)] <- "Arms &\nweapons"
ntms_all$area[ntms_all$matching %in% c(65:66)] <- "Conflict\nminerals"
ntms_all$area[ntms_all$matching %in% c(67)] <- "Counterfeit\ngoods"
We also add a column Goal in which we generate information about the Goal number a given NTM is related to by using the data in column Target. We first use substr
to extract 8th and 9th symbol from strings stored in Target. We then use gsub
to find “.” in the resulting strings in Goal and to remove them.24 This way we get just the Goal number. Then in a separate column Goaln we paste the numbers from Goal converted to numbers. And in Goal we use paste0
to paste together “Goal” and the goal number. Data in Goaln will be uses later for sorting the rows, and data in Goal will be used for labels.
ntms_all$Goal <- substr(ntms_all$Target, 8,9)
ntms_all$Goal <- gsub("\\.", "", ntms_all$Goal)
ntms_all$Goaln <- as.numeric(ntms_all$Goal)
ntms_all$Goal <- paste0("Goal ", ntms_all$Goal)
Now we create a data frame object named ntms_all_area
based on ntms_all
in which we remove columns matching, hs6s, matchkw, and Target, and then we use unique
to only keep the unique rows.
ntms_all_area <- ntms_all[, -which(names(ntms_all) %in% c("matching", "hs6s", "matchkw", "Target"))]
ntms_all_area <- unique(ntms_all_area)
We can now use functions from dplyr
package to group NTMs by Target, remove row for which value in Target is not available. We then sort the rows by column Goaln and use barplot
function to create a bar plot.
By_area <- ntms_all_area %>% dplyr::group_by(area, Goal, Goaln) %>% dplyr::count() %>% dplyr::filter(!is.na(area))
By_area <- By_area[order(By_area$Goaln),]
barplot(By_area$n, main="Distribution of NTMs matching SDG Targets by subject areas", col="sky blue",names.arg = By_area$area, las=2, cex.names= 0.8, cex.axis = 0.8, xlim = c(0,24), width=1, ylab = "Number of NTMs directly addressing SDGs", ylim = c(0, 10000))
Note, you can save a graph as a new R object, or export a graph as a .png file. After you ran all the code in the chunk below, check your working directory for the .png files with the name “plotbyarea.png”.
# saving bar plot as an object
plot_byarea <-barplot(By_area$n, main="Distribution of NTMs matching SDG Targets by subject areas", col="sky blue",names.arg = By_area$area, las=2, cex.names= 0.8, cex.axis = 0.8, xlim = c(0,24), width=1, ylab = "Number of NTMs directly addressing SDGs", ylim = c(0, 10000))
# export bar plot as a .png file
png(filename="plotbyarea.png")
barplot(By_area$n, main="Distribution of NTMs matching SDG Targets by subject areas", col="sky blue",names.arg = By_area$area, las=2, cex.names= 0.8, cex.axis = 0.8, xlim = c(0,24), width=1, ylab = "Number of NTMs directly addressing SDGs", ylim = c(0, 10000))
dev.off()
We can also use write.table
function, as below, to copy to clipboard and then paste the data from By_area
data frame into Excel where we can create even nicer graph, which would also label SDGs to which the plotted subject areas are related. See the resulting graph below.
write.table(head(By_area, nrow(By_area)), "clipboard", sep="\t", row.names=FALSE)
Before we proceed further we remove some redundant objects.
rm(By_area, By_target, files_csv, matching, ntms_all_area, ntms_all_Target)
Now let’s summarize the NTMs linked to SDGs by Goal rather than by Target. We create a data frame object named ntms_all_Goal
based on ntms_all
in which we remove columns matching, hs6s, matchkw, area and Target, and then we use unique
to only keep the unique rows. We can now use functions from dplyr
package to group NTMs by Goal, keeping column Goaln as well, remove row for which value in Goaln is “NA”, and arrange the rows by Goaln. And then we use barplot
function to create a bar plot.
ntms_all_Goal <- ntms_all[, -which(names(ntms_all) %in% c("matching", "hs6s", "matchkw", "Target", "area"))]
ntms_all_Goal <- unique(ntms_all_Goal)
By_goal <- ntms_all_Goal %>% dplyr::group_by(Goal, Goaln) %>% dplyr::count() %>% filter((!is.na(Goaln))) %>% arrange(Goaln)
barplot(By_goal$n, main="Distribution of NTMs by Goals", col="sky blue",names.arg = By_goal$Goal, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Number of NTMs directly addressing SDGs", ylim=c(0,12000))
To make the bar chart more informative we can add a column with titles of the Goals, and use those as bar labels. Notice, that we add \n
to wrap the longer labels.
SDG_titles <- c("ZERO\nHUNGER", "GOOD HEALTH\n& WELL-BEING", "CLEAN WATER\n&SANITATION", "CLEAN &\nAFFORDABLE\nENERGY", "SUSTAINABLE\nCITIES &\nCOMMUNITIES", "RESPONSIBLE\nCONSUMPTION\n& PRODUCTION", "LIFE UNDER\nWATER", "LIFE ON\nLAND", "PIECE,\nJUSTICE &\nSTRONG\nINSTITUTIONS")
By_goal$Title <- SDG_titles
barplot(By_goal$n, main="Distribution of NTMs by Goals", ylab = "Number of NTMs directly addressing SDGs", col="sky blue",names.arg = By_goal$Title, las=2, cex.names= 0.7, cex.axis = 0.8, ylim=c(0,12000))
Question 16.1: You want to create new object ntmsall_hi
by extracting from ntms_all
only rows for economies that are marked as high income economies in countries_inc
. Which code will achieve that?
A. All
# Answer A is correct.
B.
hi_c <- countries_inc[countries_inc$high_income==1, 1:2]
ntmsall_hi <- ntms_all[ntms_all$reporter %in% hi_c$country,]
# Answer B is not entirely correct.
C. ntmsall_hi <- ntms_all[ntms_all$reporter %in% countries_inc$country[countries_inc$high_income!=0],]
# Answer C is not entirely correct.
D. ntmsall_hi <- ntms_all[ntms_all$reporter %in% countries_inc$country[countries_inc$high_income==1],]
# Answer D is not entirely correct.
Question 16.2: Now you want to create object ntmsall_hi_goal
that would copy ntmsall_hi
excluding columns “matching”, “hs6s”, “matchkw”, “Target”, “area” and then keep only the unique rows.
A.
ntmsall_hi_goal <- ntmsall_hi[, -which(names(ntmsall_hi) %in% c("matching", "hs6s", "matchkw", "Target", "area"))] %>% ntmsall_hi_goal <- unique(ntmsall_hi_goal)
# Answer A is incorrect. Incorrect use of pipe operator `%>%`.
B.
ntmsall_hi_goal <- unique(ntmsall_hi[, -which(names(ntmsall_hi) %in% c("matching", "hs6s", "matchkw", "Target", "area")])
# Answer B is incorrect. Closing bracket for function `which` is missing within square brackets.
C.
ntmsall_hi_goal <- ntmsall_hi[, -which(names(ntmsall_hi) %in% c("matching", "hs6s", "matchkw", "Target", "area"))] %>% unique()
# Answer C is correct.
D. All
# Answer D is incorrect. Not all answers are correct.
Question 16.3: You now want to use dplyr
functions to overwrite ntmsall_hi_goal
by consecutively grouping data into rows with the same value in columns Goal and Goaln, counting the number of rows with the same values, filtering out the row that has NA
value in column Goaln and then arranging the rows in ascending order by column Goaln. Is the code below complete?
Code:
ntmsall_hi_goal %>% dplyr::group_by(Goal, Goaln) %>% dplyr::count() %>% filter((!is.na(Goaln))) %>% arrange(Goaln)
A. Yes, it is complete.
# Answer A is incorrect.
B. It is missing a bracket.
# Answer B is incorrect.
C. It does not generate or overwrite any object.
# Answer C is correct.
D. Argument for filter
has a mistake.
# Answer D is incorrect.
Question 16.4: You used the code below to create a bar plot of distribution of NTMs imposed by high income economies and directly addressing SDGs by Goals. What argument in barplot
function you need to adjust to shorten y axis?
Code:
barplot(hi_by_goal$n, main="Distribution of NTMs by Goals in high income economies", ylab = "Number of NTMs directly addressing SDGs", col="sky blue",names.arg = hi_by_goal$Goal, las=2, cex.names= 0.7, cex.axis = 0.8, ylim=c(0,12000))
A. cex.axis
# Answer A is incorrect. `cex.axis` regulates size of the numeric axis labels.
B. ylab
# Answer B is incorrect. `ylab` specifies label for y axis.
C. ylim
# Answer C is correct.
D. cex.names
# Answer D is incorrect. `cex.names` regulates size of bar labels.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.
# Question 1
#1
hi_c <- countries_inc[countries_inc$high_income==1, 1:2]; ntmsall_hi <- ntms_all[ntms_all$reporter %in% hi_c$country,]
#2
ntmsall_hi <- ntms_all[ntms_all$reporter %in% countries_inc$country[countries_inc$high_income!=0],]
#3
ntmsall_hi <- ntms_all[ntms_all$reporter %in% countries_inc$country[countries_inc$high_income==1],]
# Question 2
#3
ntmsall_hi_goal <- ntmsall_hi[, -which(names(ntmsall_hi) %in% c("matching", "hs6s", "matchkw", "Target", "area"))] %>% unique()
# Question 3
hi_by_goal <- ntmsall_hi_goal %>% dplyr::group_by(Goal, Goaln) %>% dplyr::count() %>% filter((!is.na(Goaln))) %>% arrange(Goaln)
# Question 4
barplot(hi_by_goal$n, main="Distribution of NTMs by Goals in high income economies", ylab = "Number of NTMs directly addressing SDGs", col="sky blue",names.arg = hi_by_goal$Goal, las=2, cex.names= 0.7, cex.axis = 0.8, ylim=c(0,5000))
Another way to summarize NTMs that directly address SDGs is by their types as per the taxonomy laid out in the International Classification of Non-tariff Measures (version of 2012; for details see MAST_all.csv that we utilized earlier). There are the following distinct groups of NTMs:
Broad groups | NTM types | NTM letter (ICNTM Chapter) |
---|---|---|
Technical | SPS | A |
Technical | TBT | B |
Technical | Pre-shipment inspection | C |
Non-technical | Quantity control | E |
Non-technical | Contingent trade protective | D |
Non-technical | Price control | F |
Non-technical | Other | G,H,I,J,K,L,M,N,O |
Export | All export | P |
Using ntms_all_Goal
data frame we create a new one ntms_Goal
in which we remove all rows, for which data in column Goaln is not available, i.e. we remove all rows for NTMs that were not found to be linked to SDGs. We then use substr
function to create column NTM_letter which only contains the first letter from NTM_temp. This is the letter that indicates the chapter of ICNTM to which a given NTM belongs.
With the following lines of code we create two additional columns broad and NTMtype to which we add strings describing NTM types and broad NTM groups based on the chapter letter in column NTM_letter. Notice, that we add \n
to wrap the longer labels.
ntms_Goal <- ntms_all_Goal[!is.na(ntms_all_Goal$Goaln),]
ntms_Goal$NTM_letter <- substr(ntms_Goal$NTM_temp, start = 1, stop = 1)
ntms_Goal$broad <- NA
ntms_Goal$broad[ntms_Goal$NTM_letter %in% c("A", "B", "C")] <- "Technical"
ntms_Goal$broad[ntms_Goal$NTM_letter %in% c("P")] <- "Export"
ntms_Goal$broad[!ntms_Goal$NTM_letter %in% c("A", "B", "C", "P")] <- "Non-technical"
ntms_Goal$NTMtype <- NA
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("A")] <- "SPS"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("B")] <- "TBT"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("C")] <- "Pre-shipment\ninspection"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("E")] <- "Quantity\ncontrol"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("D")] <- "Contingent\ntrade\nprotective"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("F")] <- "Price\ncontrol"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("G","H","I","J","K","L","M","N","O")] <- "Other"
ntms_Goal$NTM_letter[ntms_Goal$NTMtype=="Other"] <- "G"
ntms_Goal$NTMtype[ntms_Goal$NTM_letter %in% c("P")] <- "All export"
We can now use functions from dplyr
package to group NTMs by NTMtype, keeping columns NTM_letter and broad, and arrange the rows by NTM_letter. We then add column share in which we store the results of calculating a share of a given type of NTMs within the total types of NTMs linked to SDGs. And then we use barplot
function to create a bar plot.
By_ntm <- ntms_Goal %>% group_by(NTMtype, NTM_letter, broad) %>% count() %>% arrange(NTM_letter)
By_ntm$share <- By_ntm$n/sum(By_ntm$n)*100
barplot(By_ntm$share, main="NTMs directly addressing SDGs by NTM types", col=c("red", "orange", "yellow", "green", "blue", "purple" ),names.arg = By_ntm$NTMtype, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Share of NTMs directly addressing SDGs, %", ylim=c(0,60))
We can also use write.table
function, as below, to copy and paste the data from By_ntm
data frame into Excel where we can create even nicer graph, which would also label broad NTM groups to which the plotted NTM types belong. See the resulting graph below.
write.table(head(By_ntm, nrow(By_ntm)), "clipboard", sep="\t", row.names=FALSE)
Question 17.1: Refer to the code provided in this section and write code that will transform data in object ntmsall_hi_goal
generated earlier by removing all rows that have NA
value in column Goaln, add column NTM_letter that will contain NTM letter extracted from values in column NTM_temp, add column broad that will contain broad NTM groups, and add column NTMtype that will contain NTM types, the latter two both based on the values in column NTM_letter, and save it into object ntms_hi_goal
.
Then use code to check the number of NTMs that are marked as SPS and TBT in column NTMtype. What is that number?
A. 9416
# Answer A is incorrect.
B. 10560
# Answer B is incorrect.
C. 9832
# Answer C is incorrect.
D. 10959
# Answer D is correct.
# The correct code is below for reference.
Question 17.2: Refer to the code provided in this section and write code that will transform data in object ntms_hi_goal
and generate a bar plot for distribution of NTMs imposed by high income economies and directly addressing SDGs by NTM types.
What is the share of “All export” measures?
A. Approx. 18
# Answer A is incorrect.
B. Approx. 15
# Answer B is incorrect.
C. Approx. 10
# Answer C is correct.
# The correct code is below for reference.
D. Approx. 5
# Answer D is incorrect.
Once you are done with the quiz, make sure to paste in your R script and run the code stored under the button below, and then continue with the training. The chunk includes the correct code strings from the questions above, and these objects you will need for further quizzes and practice sessions.
# Question 1
ntms_hi_goal <- ntmsall_hi_goal[!is.na(ntmsall_hi_goal$Goaln),]
ntms_hi_goal$NTM_letter <- substr(ntms_hi_goal$NTM_temp, start = 1, stop = 1)
ntms_hi_goal$broad <- NA
ntms_hi_goal$broad[ntms_hi_goal$NTM_letter %in% c("A", "B", "C")] <- "Technical"
ntms_hi_goal$broad[ntms_hi_goal$NTM_letter %in% c("P")] <- "Export"
ntms_hi_goal$broad[!ntms_hi_goal$NTM_letter %in% c("A", "B", "C", "P")] <- "Non-technical"
ntms_hi_goal$NTMtype <- NA
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("A")] <- "SPS"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("B")] <- "TBT"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("C")] <- "Pre-shipment\ninspection"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("E")] <- "Quantity\ncontrol"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("D")] <- "Contingent\ntrade\nprotective"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("F")] <- "Price\ncontrol"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("G","H","I","J","K","L","M","N","O")] <- "Other"
ntms_hi_goal$NTM_letter[ntms_hi_goal$NTMtype=="Other"] <- "G"
ntms_hi_goal$NTMtype[ntms_hi_goal$NTM_letter %in% c("P")] <- "All export"
nrow(ntms_hi_goal[ntms_hi_goal$NTMtype=="SPS"|
ntms_hi_goal$NTMtype=="TBT",])
# Question 2
hi_ntm <- ntms_hi_goal %>% group_by(NTMtype, NTM_letter, broad) %>% count() %>% arrange(NTM_letter)
hi_ntm$share <- hi_ntm$n/sum(hi_ntm$n)*100
hi_ntm_graph <- barplot(hi_ntm$share, main="NTMs directly addressing SDGs by NTM types", col=c("red", "orange", "yellow", "green", "blue", "purple" ),names.arg = hi_ntm$NTMtype, las=2, cex.names= 0.8, cex.axis = 0.8, ylab = "Share of NTMs directly addressing SDGs, %", ylim=c(0,60))
hi_ntm[hi_ntm$NTMtype=="All export", "share"][[1]]
Section 3 above shows different ways, in which you can summarize and graphically present data generated by the application of SDG-HS-NTM Concordance Matrix to the UNCTAD’s database of NTMs. Of course, these are not exhaustive.
It is also possible to extract products and product groups at 2-, 4- and 6-digit level that are most frequently addressed by NTMs that are associated with all or specific SDGs or their targets. It is possible to see, what NTM types addressing all or specific SDGs are applied to a product or a product group globally, regionally or by a country. It is possible to identify those NTM types that are most often used to address SDG-related issues.
It is also possible to calculate a few indicators commonly used to quantify the intensity of non-tariff measures specifically for SDG-related NTMs: coverage ratio, frequency index and prevalence score27; or an indicator that measures the similarity between NTM policies across member States and sectors: regulatory distance28. For detailed method of using R and the available NTM and trade data sets to calculate the above indicators for all NTMs, you can refer to Training on using R for trade analysis. You can apply this method to calculate these indicators for SDG-linked NTMs as well.
The SDG-HS-NTM concordance matrix is based on the methodology jointly developed by ESCAP and UNCTAD and is described in ESCAP working paper available from here.
HS codes, NTM codes and keywords were defined based on literature, international treaties and data available from TRAINS. You can edit these by adding or removing rows, keywords, NTM and HS codes, and see how that changes the output. When adding new HS codes, NTM codes, or keywords, make sure to separate them with ;
and to add opening and closing !
to all-caps abbreviations in keywords.
You may even add more concordance strings, if you are able to identify additional definitive linkages between SDGs and regulations imposed on traded goods. In any case, the SDG-HS-NTM Concordance Matrix can be a living and evolving tool that can be adjusted depending on the needs or on the emerging new knowledge.
This training was prepared under the overall supervision of Mia Mikic, Director, Trade, Investment and Innovation Division (TIID) of the United Nations Economic and Social Commission for Asia and the Pacific (ESCAP), and Yann Duval, Chief of Trade Policy and Facilitation (TPFS), TIID, ESCAP. This module was developed by Alexey Kravchenko, Economic Affairs Officer, TPFS, TIID, ESCAP and Maria Semenova, Consultant, TPFS, TIID, ESCAP. The authors are grateful for the technical support provided by Hrisyana Doitchinova, and for the testing feedback received from Andrea Dalla Rosa.
To Install R
Now that R is installed, you need to download and install RStudio.
To Install RStudio
To Install the SDSFoundations Package
To Install R * Open an internet browser and go to www.r-project.org. * Click the “download R” link in the middle of the page under “Getting Started.” * Select a CRAN location (a mirror site) and click the corresponding link. * Click on the “Download R for (Mac) OS X” link at the top of the page. * Click on the file containing the latest version of R under “Files.” * Save the .pkg file, double-click it to open, and follow the installation instructions. * Now that R is installed, you need to download and install RStudio.
To Install RStudio * Go to www.rstudio.com and click on the “Download RStudio” button. * Click on “Download RStudio Desktop.” * Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder.
To Install the SDSFoundations Package * Download SDSFoundations to your desktop (make sure it has the “.tgz” extension). * Open RStudio. * Click on the Packages tab in the bottom right window. * Click “Install.” * Select install from “Package Archive File.” * Select the SDSFoundations package file from your desktop. * Click install. You are done! You can now delete the SDSpackage file from your desktop.
UNCTAD. International classification of non-tariff measures. 2012 version. Available at https://unctad.org/en/PublicationsLibrary/ditctab20122_en.pdf?user=46↩
More information on the Sustainable Development Goals is available at https://sustainabledevelopment.un.org/?menu=1300↩
Kravchenko, Alexey; Semenova, Maria; Lee, Seul and Duval, Yann (2019). Exploring linkages between non-tariff measures and the Sustainable Development Goals: A global concordance matrix and application to Asia and the Pacific. Trade, Investment and Innovation Working Paper No. 04/19, ESCAP: Bangkok.↩
Guidelines to collect data on official non-tariff measures are available at https://unctad.org/en/PublicationsLibrary/ditctab2014d4_en.pdf↩
UNCTAD. International classification of non-tariff measures. 2012 version. Available at https://unctad.org/en/PublicationsLibrary/ditctab20122_en.pdf?user=46↩
In 2019 a new version of ICNTM was launched, which preserved the overall structure of the classification, while introducing a few very notable changes and expanding taxonomy of NTMs under a few chapters.↩
MAST team is comprised of Food and Agriculture Organization of the United Nations, International Monetary Fund, International Trade Centre, Organization for Economic Cooperation and Development, United Nations Conference on Trade and Development, United Nations Industrial Development Organization, World Bank, and World Trade Organization↩
Asia-Pacific Trade and Investment Report 2019: Navigating Non-tariff Measures towards Sustainable Development, Chapter 1: Why non-tariff measures matter for sustainable development. Available at https://www.unescap.org/sites/default/files/APTIR2019_Chapter1.pdf↩
Some of this data is regularly updated, and newer versions may become available from their publishers.↩
Kravchenko, Alexey; Semenova, Maria; Lee, Seul and Duval, Yann (2019). Exploring linkages between non-tariff measures and the Sustainable Development Goals: A global concordance matrix and application to Asia and the Pacific. Trade, Investment and Innovation Working Paper No. 04/19, ESCAP: Bangkok.↩
The following color coding is used in the excel concordance matrix: (1) concordance strings that describe direct linkage and are clear enough to enable further quantitative analysis are not color-coded (in the excel SDG-HS-NTM Concordance Matrix they are also marked as “C” for “Clear” or “CwK” for “Clear with keywords”); (2) concordance strings that are too ambiguous, or for which their impact on NTMs is considered indirect, or for which not enough information is available, are color coded in grey (in the excel SDG-HS-NTM Concordance Matrix they are marked as “A” for “Ambiguous” and “IND” for “Indirect”).↩
Notice, when we run head(matching)
the first line of the output contains # A tibble: 6 x 8
, stating that matching
object is a tibble. Tibble is a type of data frames that was developed specifically for use in R to allow for better management of data. You can read about the characteristicts and capabilities of tibbles here. Essentially, tibbles and data frames are very similar, but respond somewhat differently to same types of manipulation, including subsetting. For example, if an object is a data frame code dataframe[8,8]
generates a vector containing one value from the cell in row 8 and column 8 (a one-dimensional object). If the same object is a tibble, code tibble[8,8]
generates a matrix with 1 row and 1 column (a two-dimensional object). So, if we need to extract the data from that cell as a vector, we need to coerce it to one dimension, which is done by adding [[1]]
at the end of that code, e.g. tibble[8,8][[1]]
.↩
Negative keywords are the words and word combinations that have to be absent in an analyzed NTM entry in the TRAINS database.↩
Note that some countries that are marked as having no NTMs belong to an economic integration grouping, such as the European Union, so NTM data are available for the grouping rather than for the individual countries. In some other cases countries are marked as having zero NTMs due to the lack of NTM data for those countries in TRAINS database.↩
3-digit codes are, in fact, 4-digit codes, in which first zeros got omitted when the codes were converted from characters to numeric values. These are the codes from HS chapters 01 through 09.↩
Negation can be signaled using “-” or “!”.↩
1-digit codes are, in fact, 2-digit codes, in which first zeros got omitted when the codes were converted from characters to numeric values. These are the codes from HS chapters 01 through 09.↩
Kravchenko, Alexey; Semenova, Maria; Lee, Seul and Duval, Yann (2019). Exploring linkages between non-tariff measures and the Sustainable Development Goals: A global concordance matrix and application to Asia and the Pacific. Trade, Investment and Innovation Working Paper No. 04/19, ESCAP: Bangkok.↩
This is relevant for such countries as Argentina, Benin, Plurinational State of Bolivia, Brazil, Burkina Faso, Chile, Colombia, Costa Rica, Côte d’Ivoire, Cuba, Ecuador, El Salvador, Guatemala, Guinea, Honduras, Mali, Mexico, Nicaragua, Niger, Panama, Paraguay, Peru, Senegal, Togo, Uruguay, Bolivarian Republic of Venezuela.↩
The United Nations Special Programme for the Economies of Central Asia (SPECA) was launched in 1998 to strengthen subregional cooperation in Central Asia and its integration into the world economy. The countries of SPECA are Afghanistan, Azerbaijan, Kazakhstan, Kyrgyzstan, Tajikistan, Turkmenistan and Uzbekistan.↩
As was explained earlier, Keywords and neg_Keywords that do not contain “!”, are have been formatted to lower case. Those that contain “!” were kept as all caps. This is done, as keyword matching with the data in the TRAINS database is generally case insensitive, with the exception of certain abbreviations that may match other words or their parts. Example: Letter combination “ODS”, which as an abbreviation that stands for “ozone depleting substances”, may be found in words “goods”, “rods”, “foods”, etc.↩
Exclamation marks are used to signal words that are abbreviations. They are all caps and matching for them has to be case sensitive.↩
When working with character strings in R, .
is used to represent any character, except new line. Therefore, if we want to detect a period in a character string, we need to include \\
right before it. For more details on basic regular expressions in R used for manipulating character strings refer, for example, to https://stringr.tidyverse.org/articles/regular-expressions.html↩
When working with character strings in R, .
is used to represent any character, except new line. Therefore, if we want to detect a period in a character string, we need to include \\
right before it. For more details on basic regular expressions in R used for manipulating character strings refer, for example, to https://stringr.tidyverse.org/articles/regular-expressions.html↩
If data is continuous, rather than discrete, R will create gradient color bars chart.↩
Available at https://www.unescap.org/publications/APTIR2019↩
UNCTAD. Non-tariff measures: Economic assessment and policy options for development. 2018. Available on page 92 here↩
UNCTAD. Deep regional integration and non-tariff measures: A methodology for data analysis. Policy issues in international trade and commodities research study series No. 69. 2015. Available on page 2 and onward here↩