Module 3. Introduction to Text Based Prediction and Classification

Introduction

In this Module you will learn

  • How to define a text-based classification problem
  • Basics of using artificial neural networks (ANNs) for prediction and classification
  • Basics of using text as an input
  • Demonstration in R 1: Revisiting the LPI
  • Demonstration in R 2: Classifying NTMs

Estimated time requirement

  • Video lecture: 60 minutes
  • Studying the code: 3-6 hours1. Note that execution of some of the code in this module will be time-consuming.
  • Problem set: 1-2 hours 1
  • Doing the quiz: 1-2 hours1

1 This estimation depends on your level of familiarity with R

Learning materials

Download Module 3 materials from here.

  • Lecture 3 folder contains the presentation, dataset and R scripts featured in the video lecture, as well as the dataset to be used for the problem set.

This module actively uses keras package, which allows to use R to access TensorFlow platform for machine learning. After listening to the video lecture and running the demonstration code, please refer to this webpage to learn more about it and to better understand some of the key elements that go into building basic ANNs demonstrated in this module.

Especially helpful in the context of Module 3 are the following sections:

Additionally, run the following code to learn more about R packages and functions used to set up, compile, run and evaluate ANN models:

# to view Help files on keras and tfruns packages
help(package = "keras")
help(package = "tfruns")

# to unpack some of the key functions from both packages used to build ANNs in Module 3
?keras_model_sequential()
?layer_dense()
?layer_dropout()
?compile.keras.engine.training.Model()  # compile() function of keras package
?fit.keras.engine.training.Model() #  fit() function of keras package
?evaluate.keras.engine.training.Model() #  evaluate() function of keras package

?tuning_run() 
?ls_runs()
?tfruns::flags()

Video lecture

Watch Module 3 lecture video, and study the replication script featured in it as you listen.

Lecture 3

Problem set

Problem: Lecture 3

  1. Read the same UNCTAD TRAINS data used for the demonstration.
  2. Retain the descriptive text, and the HS field. That field lists the HS codes of products affected by a measure.
  3. Isolate the first two characters of each entry in the HS field, i.e. the first two characters of the first product listed.
  4. Transform those two characters to numerical values. Verify that you have a list of numbers 1-99, corresponding to the HS 2-digit code for the first listed product affected by each measure.
  5. Remove NA entries from the data.
  6. Using the demonstration code as a guide, use an artificial neural network to try and classify text descriptions by HS 2-digit code.
  7. To the extent possible, experiment with different hyper-parameter sets to obtain a model that performs well.
  8. When you have a preferred model, evaluate it on the testing sample.

Quiz

Now you can attempt QUIZ 3.

You can take the quiz multiple times. You will need to obtain a score of at least 75% in order to get a pass on Quiz 3.