Chapter 5 Gravity models

Please keep in mind that from this section onward we will learn how to analyse trade data in R. Specifically, this section is designed to teach you how to estimate gravity models in R and will thus assume you are familiar with the theory behind them and their use. If you are interested in an in-depth paper about the different types of gravity models - check ESCAP’s original guide in STATA or the recent version in R. UN ESCAP also has an online gravity tool.

Let’s quickly remind ourselves the use of gravity models:

Gravity models link trade flows with economic size and inversely with trade costs. Hence, gravity models can capture patterns in international trade and production. Policy researchers can use them to estimate the trade impacts of various trade-related policies, from traditional tariffs to new “behind-the-border” measures. (From ESCAP’s guide on gravity models found here.)

In this chapter we will cover the most basic form of the gravity model - the linear model. We will show you how to obtain STATA-like robust clustered standard errors. We will finish the analysis with the Poisson Pseudo-Maximum Likelihood Estimator (PPML).

5.1 Linear gravity model

5.1.1 Linear regression

Let’s start with the most basic gravity model: linear regression using the lm command. (lm stands for linear model.)

Let’s first generate an x and y value by using the rnorm function. (Check Section 1.2 to remind yourself what this function does or run the code ?rnorm.)

y <- rnorm(100,1,5)
x <- y * 2 + rnorm(100,0,1)

Inside the brackets, on the left side of the tilde (~), is the output variable and on the right side the input variable(s). In this case we estimate the equation y = x + c, where c is a constant which we find once the program is ran. reg1 is the name we have given our regression.

reg1 <- lm(y~x)
reg1
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##    0.002012     0.502247

We can access regression output with the function summary. This gives us all the basic output one needs from a regression, such as the values of the estimates, the standard errors, p-values, etc. (Remember summary can be used on both data sets and model objects in R.)

Let’s run summary for reg1. The first line of output gives us the model specification, e.g. Call: lm(formula = y ~ x). Afterwards, descriptive statistics are provided for the Residuals, such as the minimum (Min) and maximum values (Max), the median (Median) and the first (1Q) and third quartile (3Q). For instance the Median value in this case is 0.04081.

Next we get the Coefficients. We have the coefficient value under Estimate, after it the standard error (Std. Error), the t-statistic (t value) and the p-value (Pr(>|t|)). For reg1 the coefficient of the intercept is 0.017038 and the x coefficient is 0.486928. The x coefficient is significant as shown under column Pr(>|t|).

Below the Coefficients information, is a legend for the significance level Signif. codes. For example, 0.001 ‘**’ means that ** provide 0.1% of significance.

Finally, we have information on the Residual standard error, the R squared Multiple R-squared and Adjusted R-squared, and the F-statistic and its p-value. For reg1 the R squared is 0.9892.

(This summary output will vary for the type of model used. The more complicated the model, the more information will be provided generally speaking.)

summary(reg1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24012 -0.32035 -0.01573  0.37339  1.25208 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 0.002012   0.052788   0.038                0.97    
## x           0.502247   0.005181  96.935 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5184 on 98 degrees of freedom
## Multiple R-squared:  0.9897, Adjusted R-squared:  0.9896 
## F-statistic:  9396 on 1 and 98 DF,  p-value: < 0.00000000000000022

The attributes function shows you the full list of attributes you can pull out from the summary function used on reg1:

attributes(summary(reg1))
## $names
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 
## 
## $class
## [1] "summary.lm"

For example you can just pull out the \(R^2\) alone:

summary(reg1)$r.squared
## [1] 0.9896781

Quiz

Let’s practice!

Which answer correctly:
- declares two new variables with specifications x <- rnorm(100,2,5) and y <- x + rnorm(100,1,5)
- calculates the regression equation y = x + c
- asks R for the regression’s output summary?

(Hint: y = x + c, c is a constant which we will find out once the program is ran, don’t worry about it, it should not be in your code!)

A. x <- rnorm(100,2,5)
    y <- x + rnorm(100,1,5)
    reg2 <- lm(y~x)
    summary(reg2)

#This is correct. Good job!

B. reg2 <- lm(y~x)
    summary(reg2)

#Answer B is incorrect because it fails to include the declaration of x and y, therefore the correct answer is A.

C. x <- rnorm(100,2,5)
    y <- x + rnorm(100,1,5)

#Answer C is incorrect because it only declares the variables x and y but does not specify the regression nor expression for the summary, therefore the correct answer is A.

5.1.2 Linear gravity model continued

Let’s work with some real trade data! We begin by importing the data in R with read.csv. You can download it here.

gd <- read.csv("data/short_data.csv",
               stringsAsFactors = FALSE)

We see the list of variables in the data set with the name command.

names(gd)
##  [1] "X"                 "reporter"          "partner"          
##  [4] "year"              "export_ij"         "import_ij"        
##  [7] "trade_ij"          "export_i"          "import_i"         
## [10] "trade_i"           "export_j"          "import_j"         
## [13] "trade_j"           "ctcij"             "ntcij"            
## [16] "tariffrateij_sa"   "tariffrateij_wa"   "tariff_ij_sa"     
## [19] "tariff_ij_wa"      "tariffrateji_sa"   "tariffrateji_wa"  
## [22] "tariff_ji_sa"      "tariff_ji_wa"      "ga_tariff_ijji_sa"
## [25] "ga_tariff_ijji_wa" "dist"              "contig"           
## [28] "comlang_off"       "comlang_ethno"     "colony"           
## [31] "comcol"            "curcol"            "col45"            
## [34] "smctry"            "landlocked_i"      "landlocked_j"     
## [37] "landlocked_ij"     "rta"               "gdp_i"            
## [40] "gdpgrowth_i"       "gdppc_i"           "gdppcgrowth_i"    
## [43] "rgdp_i"            "rgdppc_i"          "gdp_j"            
## [46] "gdpgrowth_j"       "gdppc_j"           "gdppcgrowth_j"    
## [49] "rgdp_j"            "rgdppc_j"          "dtf_i"            
## [52] "tab_i"             "startbiz_i"        "credit_i"         
## [55] "invest_i"          "tax_i"             "contract_i"       
## [58] "dtf_j"             "tab_j"             "startbiz_j"       
## [61] "credit_j"          "invest_j"          "tax_j"            
## [64] "contract_j"        "lsci_i"            "lsci_j"           
## [67] "lpi_i"             "lpi_j"

Gravity models link trade flows with economic size and inversely with trade costs. For our first gravity model in R, let’s check the relationship between the exports (export_ij) and distance between the exporting and importing countries (dist):

lm(gd$export_ij ~ gd$dist)
## 
## Call:
## lm(formula = gd$export_ij ~ gd$dist)
## 
## Coefficients:
## (Intercept)      gd$dist  
##  1094335800       -73723

As we saw in the beginning of Section 5.1, we can create an object in R to store the results:

reg1 <- lm(gd$export_ij ~ gd$dist)

If you just call reg1 by itself, you won’t see much. Remember, for more detailed information we use summary:

summary(reg1)
## 
## Call:
## lm(formula = gd$export_ij ~ gd$dist)
## 
## Residuals:
##          Min           1Q       Median           3Q          Max 
##  -1090653864   -758408424   -507452840   -200191922 409695392687 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 1094335800   16216752   67.48 <0.0000000000000002 ***
## gd$dist         -73723       1932  -38.16 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5307000000 on 374236 degrees of freedom
##   (137806 observations deleted due to missingness)
## Multiple R-squared:  0.003875,   Adjusted R-squared:  0.003872 
## F-statistic:  1456 on 1 and 374236 DF,  p-value: < 0.00000000000000022

Trade values frequently contain zeros. How do we see how many rows in gd have zero in the column export_ij?

nrow shows the total number of rows:

nrow(gd)
## [1] 512044

The code below should give us the values that are zero, but because there are also missing values in export_ij, it shows the number of rows of zeros and missing values combined:

nrow(gd[gd$export_ij==0,])
## [1] 120815

A better way is to use length which gives the length of a variable list:

length(which(gd$export_ij==0))
## [1] 6409

We use the function which to signify that we want to count only the observations where gd$export_ij==0 is TRUE.

length(which(gd$export_ij==0)) is better than nrow(gd[gd$export_ij==0,]) because it counts only the zeros and excludes the missing values (NA). This is why the number of observations is less with length.

We can find the length of gd:

length(gd)
## [1] 68

We get 68 number of variables.

The number of observation are:

length(gd$export_ij)
## [1] 512044

We can check the number of missing values:

length(gd$export_ij[is.na(gd$export_ij)])
## [1] 114406

Let’s assume the missing values of export_ij are zero. How do we change their value from missing to zero?

Remember we need to select the correct condition, which in this case is is.na(gd$export_ij) but also tell R for which variable it should change the values - export_ij:

gd[is.na(gd$export_ij),"export_ij"] <- 0

Double checking this worked:

length(gd$export_ij[is.na(gd$export_ij)])
## [1] 0

Yes it did! Now we have 0 rows in export_ij which are missing.

Let’s regress once again distance with exports to compare:

reg2 <- lm(gd$export_ij ~ gd$dist)
summary(reg2)
## 
## Call:
## lm(formula = gd$export_ij ~ gd$dist)
## 
## Residuals:
##          Min           1Q       Median           3Q          Max 
##   -939688372   -624180945   -404594794   -132356225 409761225877 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 943289085   13155595   71.70 <0.0000000000000002 ***
## gd$dist        -65972       1497  -44.07 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4698000000 on 478830 degrees of freedom
##   (33212 observations deleted due to missingness)
## Multiple R-squared:  0.004039,   Adjusted R-squared:  0.004037 
## F-statistic:  1942 on 1 and 478830 DF,  p-value: < 0.00000000000000022

There is a difference!

We can try to regress the logs of the same variables in order to check the percentage change:

reg3 <- lm(log(gd$export_ij) ~ log(gd$dist))

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, …) : NA/NaN/Inf in ‘y’

We get an error, because we cannot take the logarithm of zero. For the purposes of the exercise, we replace all zero export_ij with 0.0001. (A quick reminder: that the default of the log function in R is natural logarithm unless changed!)

gd[gd$export_ij==0,"export_ij"] <- 0.0001
reg3 <- lm(log(gd$export_ij) ~ log(gd$dist))
summary(reg3)
## 
## Call:
## lm(formula = log(gd$export_ij) ~ log(gd$dist))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.756  -3.341   3.907   7.381  19.156 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  36.37332    0.15809   230.1 <0.0000000000000002 ***
## log(gd$dist) -3.09398    0.01816  -170.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.51 on 478830 degrees of freedom
##   (33212 observations deleted due to missingness)
## Multiple R-squared:  0.05717,    Adjusted R-squared:  0.05717 
## F-statistic: 2.904e+04 on 1 and 478830 DF,  p-value: < 0.00000000000000022

This result says that for every 1% increase in distance, export_ij is associated with 3.09% decrease. (This is also known as the elasticity of exports with respect to distance.)

We can reproduce the above, but instead of specifying gd$ in front of every variable, we can also specify the data source as one of the parameters in the lm function. Both ways work and provide the same results, it is up to personal preference which one to choose.

reg4 <- lm(log(export_ij) ~ log(dist), data = gd)
summary(reg4)
## 
## Call:
## lm(formula = log(export_ij) ~ log(dist), data = gd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.756  -3.341   3.907   7.381  19.156 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 36.37332    0.15809   230.1 <0.0000000000000002 ***
## log(dist)   -3.09398    0.01816  -170.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.51 on 478830 degrees of freedom
##   (33212 observations deleted due to missingness)
## Multiple R-squared:  0.05717,    Adjusted R-squared:  0.05717 
## F-statistic: 2.904e+04 on 1 and 478830 DF,  p-value: < 0.00000000000000022

Quiz

So far so good. What is the coefficient for log(export_ij) if we run log(trade_ij) ~ log(dist) + log(export_ij)? (Hint: Make sure the new variable that is included - trade_ij does not have any missing values or zeros as you will take the logarithm of it. If it is the case, do the same as we did with export_ij.)

A. 0.30
B. 0.43
C. 0.55
D. 0.67

#We can see that `trade_ij` has zeros. We can change that the same way we changed it for `export_ij`: 
gd[gd$trade_ij==0,"trade_ij"] <- 0.0001

#Now we can run the following gravity model: 
lm(log(trade_ij) ~ log(dist) + log(export_ij), data = gd) 

#We can see that the coefficient for log(export_ij) is 0.2976 or 0.30, hence A is the correct answer.

5.1.3 Linear Gravity model with more variables and interaction terms

Let’s create a new regression with more variables:

reg5 <- lm(log(export_ij) ~ log(dist) + 
             log(gdp_i) + 
             log(gdp_j) +
             rta +
             log(ga_tariff_ijji_sa)
             , data = gd)
summary(reg5)
## 
## Call:
## lm(formula = log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) + 
##     rta + log(ga_tariff_ijji_sa), data = gd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.334  -1.688   1.713   4.409  19.737 
## 
## Coefficients:
##                          Estimate Std. Error  t value            Pr(>|t|)    
## (Intercept)            -62.352427   0.273433 -228.036 <0.0000000000000002 ***
## log(dist)               -2.516327   0.018142 -138.703 <0.0000000000000002 ***
## log(gdp_i)               2.489102   0.006209  400.867 <0.0000000000000002 ***
## log(gdp_j)               1.381105   0.006204  222.598 <0.0000000000000002 ***
## rta                      1.322291   0.042079   31.424 <0.0000000000000002 ***
## log(ga_tariff_ijji_sa)  -0.224746   0.242981   -0.925               0.355    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.107 on 301412 degrees of freedom
##   (210626 observations deleted due to missingness)
## Multiple R-squared:  0.4224, Adjusted R-squared:  0.4224 
## F-statistic: 4.409e+04 on 5 and 301412 DF,  p-value: < 0.00000000000000022

We can export the coefficients to a clipboard so you can “Ctrl/Cmd + V” (paste) in excel:

write.table(
  summary(reg5)$coefficients #you select just coefficients, it will also work
  , "clipboard", sep="\t", row.names=FALSE)

We can also add interaction terms in order to capture joint variable effects by specifying the interaction in I operator:

reg6 <- lm(log(export_ij) ~ log(dist) + 
             log(gdp_i) + 
             log(gdp_j) +
             rta +
             I(rta*log(gdp_j)) + #put interaction in I operator
             I(dist^2) + #first order polynomial  - non-linear transformation
             log(ga_tariff_ijji_sa)
           , data = gd)
summary(reg6)
## 
## Call:
## lm(formula = log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) + 
##     rta + I(rta * log(gdp_j)) + I(dist^2) + log(ga_tariff_ijji_sa), 
##     data = gd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.160  -1.726   1.697   4.412  20.456 
## 
## Coefficients:
##                                 Estimate        Std. Error  t value
## (Intercept)            -63.9832459652895   0.3271008450554 -195.607
## log(dist)               -2.9782272151947   0.0263709439899 -112.936
## log(gdp_i)               2.5488480880571   0.0062645188343  406.871
## log(gdp_j)               1.5342760391309   0.0067795539273  226.309
## rta                     23.4357915076500   0.4081369407485   57.421
## I(rta * log(gdp_j))     -0.8966090092956   0.0163094395377  -54.975
## I(dist^2)                0.0000000062601   0.0000000002558   24.469
## log(ga_tariff_ijji_sa)  -1.1094688930275   0.2419365488330   -4.586
##                                    Pr(>|t|)    
## (Intercept)            < 0.0000000000000002 ***
## log(dist)              < 0.0000000000000002 ***
## log(gdp_i)             < 0.0000000000000002 ***
## log(gdp_j)             < 0.0000000000000002 ***
## rta                    < 0.0000000000000002 ***
## I(rta * log(gdp_j))    < 0.0000000000000002 ***
## I(dist^2)              < 0.0000000000000002 ***
## log(ga_tariff_ijji_sa)           0.00000452 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.063 on 301410 degrees of freedom
##   (210626 observations deleted due to missingness)
## Multiple R-squared:  0.4295, Adjusted R-squared:  0.4295 
## F-statistic: 3.242e+04 on 7 and 301410 DF,  p-value: < 0.00000000000000022

5.1.4 STATA-like robust clustered standard errors

Estimates from Ordinary least squares (OLS) are consistent, unbiased, and efficient if these three necessary and sufficient conditions are met:
1). The standard errors are homoskedastic.
2). There is orthogonality between the errors.
3). None of the explanatory variables is a linear combination of the other explanatory variables.

In practice however, it is not uncommon to have cases where we have heteroskedastic standard errors. In order to control for this we use robust standard errors. Robust standard errors are arbitrary patterns of heteroskedasticity in the data and therefore a simple and effective way of fixing violations of the second OLS assumption.

In the gravity model context, errors are likely to be correlated by country pair. So it is important to allow for clustering by country pair. To do this, it is necessary to specify a clustering variable that separately identifies each country pair independently of the direction of trade. An example is distance, which is unique to each country pair but is identical for both directions of trade.

(Once again for the full details of this model - check ESCAP’s gravity model guide in R.)

Having this in mind, for linear gravity models, robust clustered standard errors are a way to obtain better estimates. In R we can use the function lm_robust from the estimatr package.

Let’s first install and then activate the package. Uncomment the code to install it by removing the # and running install.packages("estimatr").

#install.packages("estimatr")
library(estimatr)
## Warning: package 'estimatr' was built under R version 3.5.3

You can obtain STATA-like robust clustered standard errors by using the lm_robust. We specify the equation log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) + rta + log(ga_tariff_ijji_sa) as usual. What is different from the lm function is that here we can specify which variable to be our cluster and the type of standard errors we want. In the example below, we chose our cluster to be distance: clusters = dist and to obtain STATA-like robust standard errors: se_type = "stata". (As usual for the full details of the specifications of lm_robust run ??lm_robust and select the help page titled estimatr::lm_robust)

reg_robust <- lm_robust(log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) + rta + log(ga_tariff_ijji_sa), 
          data = gd, clusters = dist, se_type = "stata")
summary(reg_robust)
## 
## Call:
## lm_robust(formula = log(export_ij) ~ log(dist) + log(gdp_i) + 
##     log(gdp_j) + rta + log(ga_tariff_ijji_sa), data = gd, clusters = dist, 
##     se_type = "stata")
## 
## Standard error type:  stata 
## 
## Coefficients:
##                        Estimate Std. Error  t value
## (Intercept)            -62.3524    0.86939 -71.7201
## log(dist)               -2.5163    0.05655 -44.4980
## log(gdp_i)               2.4891    0.01948 127.7587
## log(gdp_j)               1.3811    0.01766  78.2119
## rta                      1.3223    0.10906  12.1240
## log(ga_tariff_ijji_sa)  -0.2247    0.64297  -0.3495
##                                                      Pr(>|t|) CI Lower CI Upper
## (Intercept)            0.000000000000000000000000000000000000  -64.057  -60.648
## log(dist)              0.000000000000000000000000000000000000   -2.627   -2.405
## log(gdp_i)             0.000000000000000000000000000000000000    2.451    2.527
## log(gdp_j)             0.000000000000000000000000000000000000    1.346    1.416
## rta                    0.000000000000000000000000000000001213    1.109    1.536
## log(ga_tariff_ijji_sa) 0.726687161813804882726230971456971020   -1.485    1.036
##                           DF
## (Intercept)            12607
## log(dist)              12607
## log(gdp_i)             12607
## log(gdp_j)             12607
## rta                    12607
## log(ga_tariff_ijji_sa) 12607
## 
## Multiple R-squared:  0.4224 ,    Adjusted R-squared:  0.4224 
## F-statistic:  4266 on 5 and 12607 DF,  p-value: < 0.00000000000000022

Let’s compare these results to reg5 which has the exact same specifications except the standard errors are not robust nor clustered. We see that the significance of the coefficients remains the same under both methods except for the variable rta. This is due to the fact that lm_robust produces higher standard error values than lm as they are robust.

summary(reg5)
## 
## Call:
## lm(formula = log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) + 
##     rta + log(ga_tariff_ijji_sa), data = gd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.334  -1.688   1.713   4.409  19.737 
## 
## Coefficients:
##                          Estimate Std. Error  t value            Pr(>|t|)    
## (Intercept)            -62.352427   0.273433 -228.036 <0.0000000000000002 ***
## log(dist)               -2.516327   0.018142 -138.703 <0.0000000000000002 ***
## log(gdp_i)               2.489102   0.006209  400.867 <0.0000000000000002 ***
## log(gdp_j)               1.381105   0.006204  222.598 <0.0000000000000002 ***
## rta                      1.322291   0.042079   31.424 <0.0000000000000002 ***
## log(ga_tariff_ijji_sa)  -0.224746   0.242981   -0.925               0.355    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.107 on 301412 degrees of freedom
##   (210626 observations deleted due to missingness)
## Multiple R-squared:  0.4224, Adjusted R-squared:  0.4224 
## F-statistic: 4.409e+04 on 5 and 301412 DF,  p-value: < 0.00000000000000022

Quiz

What is the value for the standard errors of log(export_ij) if we run log(trade_ij) ~ log(dist) + log(export_ij) with STATA-like robust standard errors? (Hint: Check how we ran the command above and change the equation accordingly!)

A. 0.0015
B. 0.0197
C. 0.0201
D. 0.1756

# We run the command:
lm_robust(log(trade_ij) ~ log(dist) + log(export_ij), data = gd, clusters = dist, se_type = "stata") 

# The value for the standard errors for log(export_ij) is 0.001470578 or 0.0015, hence answer A is the correct answer.

5.2 Poisson Pseudo-Maximum Likelihood Estimator (PPML)

The PPML model is often used in trade analysis because it can include trade values equal to zero, have fixed effects and easy interpretation. We will go over the basics of the PPML model in R. Once again, see for more information ESCAP’s gravity model guide in R.

The PPML function in R can be found in the gravity package:

#install.packages("gravity")
library(gravity)
## Warning: package 'gravity' was built under R version 3.5.3
## 
## Attaching package: 'gravity'
## The following object is masked from 'package:stats':
## 
##     nls

Here we use export_ij as an output variable, we specify the distance variable in our data (dist in this case) and add any additional variables (e.g. partner GDP: gdp_j and reporter GDP: gdp_i). One important point is that when we run the distance variable in PPML in R, we transform it using the natural logarithm, i.e. log in R.

ppml("export_ij", dist="dist", c('gdp_j' ,'gdp_i' ), data=gd)
## 
## Call:  y_ppml ~ dist_log + gdp_j + gdp_i
## 
## Coefficients:
##         (Intercept)             dist_log                gdp_j  
## 28.0022547089188549  -1.0674026797574185   0.0000000000003008  
##               gdp_i  
##  0.0000000000002836  
## 
## Degrees of Freedom: 435085 Total (i.e. Null);  435082 Residual
##   (43746 observations deleted due to missingness)
## Null Deviance:       1423000000000000 
## Residual Deviance: 724400000000000   AIC: NA

As mentioned, one of the advantages of PPML is that it can handle zeros so let’s change export_ij back to zero:

gd[gd$export_ij==0.0001,"export_ij"] <- 0
ppml("export_ij", dist="dist", c('gdp_j' ,'gdp_i' ), data=gd)
## 
## Call:  y_ppml ~ dist_log + gdp_j + gdp_i
## 
## Coefficients:
##         (Intercept)             dist_log                gdp_j  
## 28.0022547089189793  -1.0674026797574607   0.0000000000003008  
##               gdp_i  
##  0.0000000000002836  
## 
## Degrees of Freedom: 435085 Total (i.e. Null);  435082 Residual
##   (43746 observations deleted due to missingness)
## Null Deviance:       1423000000000000 
## Residual Deviance: 724400000000000   AIC: NA

First of all, notice that the distance is logged.

Second, in this case there is no change between the coefficients of the PPML model with 0.0001 and 0. This is most likely due to the fact that the sample with zeros is relatively small.

Quiz

Let’s do a quick practice! We run the exact same model as above, but add rta to the list of variables. What is the coefficient for rta? Make sure you have run all the necessary code from this section!

A. 1.549107
B. 25.417498
C. -0.795194
D. 1.103045

# Run the following code if not already ran:
gd[gd$export_ij==0.0001,"export_ij"] <- 0

# Next we specify the PPMl model as follows:
ppml("export_ij", dist="dist", c('gdp_j' ,'gdp_i', 'rta' ), data=gd)

# Once you run all the codes above, you see that the coefficient of `rta` is 1.103045, so D is the correct answer.