# Chapter 5 **Gravity models**

Please keep in mind that from this section onward we will learn how to analyse trade data in R. Specifically, this section is designed to teach you how to estimate gravity models in R and will thus assume you are familiar with the theory behind them and their use. If you are interested in an in-depth paper about the different types of gravity models - check ESCAP’s original guide in STATA or the recent version in R. UN ESCAP also has an online gravity tool.

Let’s quickly remind ourselves the use of gravity models:

**Gravity models link trade flows with economic size and inversely with trade costs. Hence, gravity models can capture patterns in international trade and production. Policy researchers can use them to estimate the trade impacts of various trade-related policies, from traditional tariffs to new “behind-the-border” measures.** (From ESCAP’s guide on gravity models found here.)

In this chapter we will cover the most basic form of the gravity model - the linear model. We will show you how to obtain STATA-like robust clustered standard errors. We will finish the analysis with the Poisson Pseudo-Maximum Likelihood Estimator (PPML).

## 5.1 Linear gravity model

### 5.1.1 Linear regression

Let’s start with the most basic gravity model: linear regression using the `lm`

command. (`lm`

stands for linear model.)

Let’s first generate an `x`

and `y`

value by using the `rnorm`

function. (Check Section 1.2 to remind yourself what this function does or run the code `?rnorm`

.)

```
y <- rnorm(100,1,5)
x <- y * 2 + rnorm(100,0,1)
```

Inside the brackets, on the left side of the tilde (`~`

), is the output variable and on the right side the input variable(s). In this case we estimate the equation `y = x + c`

, where `c`

is a constant which we find once the program is ran. `reg1`

is the name we have given our regression.

```
reg1 <- lm(y~x)
reg1
```

```
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 0.002012 0.502247
```

We can access regression output with the function `summary`

. This gives us all the basic output one needs from a regression, such as the values of the estimates, the standard errors, p-values, etc. (Remember `summary`

can be used on both data sets and model objects in R.)

Let’s run `summary`

for `reg1`

. The first line of output gives us the model specification, e.g. `Call: lm(formula = y ~ x)`

. Afterwards, descriptive statistics are provided for the `Residuals`

, such as the minimum (`Min`

) and maximum values (`Max`

), the median (`Median`

) and the first (`1Q`

) and third quartile (`3Q`

). For instance the `Median`

value in this case is 0.04081.

Next we get the `Coefficients`

. We have the coefficient value under `Estimate`

, after it the standard error (`Std. Error`

), the t-statistic (`t value`

) and the p-value (`Pr(>|t|)`

). For `reg1`

the coefficient of the `intercept`

is 0.017038 and the `x`

coefficient is 0.486928. The `x`

coefficient is significant as shown under column `Pr(>|t|)`

.

Below the `Coefficients`

information, is a legend for the significance level `Signif. codes`

. For example, `0.001 ‘**’`

means that `**`

provide 0.1% of significance.

Finally, we have information on the `Residual standard error`

, the R squared `Multiple R-squared`

and `Adjusted R-squared`

, and the `F-statistic`

and its `p-value`

. For `reg1`

the R squared is 0.9892.

(This `summary`

output will vary for the type of model used. The more complicated the model, the more information will be provided generally speaking.)

`summary(reg1)`

```
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24012 -0.32035 -0.01573 0.37339 1.25208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.002012 0.052788 0.038 0.97
## x 0.502247 0.005181 96.935 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5184 on 98 degrees of freedom
## Multiple R-squared: 0.9897, Adjusted R-squared: 0.9896
## F-statistic: 9396 on 1 and 98 DF, p-value: < 0.00000000000000022
```

The `attributes`

function shows you the full list of attributes you can pull out from the `summary`

function used on `reg1`

:

`attributes(summary(reg1))`

```
## $names
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
##
## $class
## [1] "summary.lm"
```

For example you can just pull out the \(R^2\) alone:

`summary(reg1)$r.squared`

`## [1] 0.9896781`

### Quiz

Let’s practice!

Which answer correctly:

- declares two new variables with specifications `x <- rnorm(100,2,5)`

and `y <- x + rnorm(100,1,5)`

- calculates the regression equation y = x + c

- asks R for the regression’s output `summary`

?

(Hint: y = x + c, c is a constant which we will find out once the program is ran, don’t worry about it, it should not be in your code!)

A. `x <- rnorm(100,2,5)`

`y <- x + rnorm(100,1,5)`

`reg2 <- lm(y~x)`

`summary(reg2)`

`#This is correct. Good job!`

B. `reg2 <- lm(y~x)`

`summary(reg2)`

`#Answer B is incorrect because it fails to include the declaration of x and y, therefore the correct answer is A.`

C. `x <- rnorm(100,2,5)`

`y <- x + rnorm(100,1,5)`

`#Answer C is incorrect because it only declares the variables x and y but does not specify the regression nor expression for the summary, therefore the correct answer is A.`

### 5.1.2 Linear gravity model continued

Let’s work with some real trade data! We begin by importing the data in R with `read.csv`

. You can download it here.

```
gd <- read.csv("data/short_data.csv",
stringsAsFactors = FALSE)
```

We see the list of variables in the data set with the `name`

command.

`names(gd)`

```
## [1] "X" "reporter" "partner"
## [4] "year" "export_ij" "import_ij"
## [7] "trade_ij" "export_i" "import_i"
## [10] "trade_i" "export_j" "import_j"
## [13] "trade_j" "ctcij" "ntcij"
## [16] "tariffrateij_sa" "tariffrateij_wa" "tariff_ij_sa"
## [19] "tariff_ij_wa" "tariffrateji_sa" "tariffrateji_wa"
## [22] "tariff_ji_sa" "tariff_ji_wa" "ga_tariff_ijji_sa"
## [25] "ga_tariff_ijji_wa" "dist" "contig"
## [28] "comlang_off" "comlang_ethno" "colony"
## [31] "comcol" "curcol" "col45"
## [34] "smctry" "landlocked_i" "landlocked_j"
## [37] "landlocked_ij" "rta" "gdp_i"
## [40] "gdpgrowth_i" "gdppc_i" "gdppcgrowth_i"
## [43] "rgdp_i" "rgdppc_i" "gdp_j"
## [46] "gdpgrowth_j" "gdppc_j" "gdppcgrowth_j"
## [49] "rgdp_j" "rgdppc_j" "dtf_i"
## [52] "tab_i" "startbiz_i" "credit_i"
## [55] "invest_i" "tax_i" "contract_i"
## [58] "dtf_j" "tab_j" "startbiz_j"
## [61] "credit_j" "invest_j" "tax_j"
## [64] "contract_j" "lsci_i" "lsci_j"
## [67] "lpi_i" "lpi_j"
```

Gravity models link trade flows with economic size and inversely with trade costs. For our first gravity model in R, let’s check the relationship between the exports (`export_ij`

) and distance between the exporting and importing countries (`dist`

):

`lm(gd$export_ij ~ gd$dist)`

```
##
## Call:
## lm(formula = gd$export_ij ~ gd$dist)
##
## Coefficients:
## (Intercept) gd$dist
## 1094335800 -73723
```

As we saw in the beginning of Section 5.1, we can create an object in R to store the results:

`reg1 <- lm(gd$export_ij ~ gd$dist)`

If you just call `reg1`

by itself, you won’t see much. Remember, for more detailed information we use `summary`

:

`summary(reg1)`

```
##
## Call:
## lm(formula = gd$export_ij ~ gd$dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1090653864 -758408424 -507452840 -200191922 409695392687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1094335800 16216752 67.48 <0.0000000000000002 ***
## gd$dist -73723 1932 -38.16 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5307000000 on 374236 degrees of freedom
## (137806 observations deleted due to missingness)
## Multiple R-squared: 0.003875, Adjusted R-squared: 0.003872
## F-statistic: 1456 on 1 and 374236 DF, p-value: < 0.00000000000000022
```

Trade values frequently contain zeros. How do we see how many rows in `gd`

have zero in the column `export_ij`

?

`nrow`

shows the total number of rows:

`nrow(gd)`

`## [1] 512044`

The code below should give us the values that are zero, but because there are also missing values in `export_ij`

, it shows the number of rows of zeros **and** missing values combined:

`nrow(gd[gd$export_ij==0,])`

`## [1] 120815`

A better way is to use `length`

which gives the length of a variable list:

`length(which(gd$export_ij==0))`

`## [1] 6409`

We use the function `which`

to signify that we want to count only the observations where `gd$export_ij==0`

is `TRUE`

.

`length(which(gd$export_ij==0))`

is better than `nrow(gd[gd$export_ij==0,])`

because it counts only the zeros and excludes the missing values (`NA`

). This is why the number of observations is less with `length`

.

We can find the length of `gd`

:

`length(gd)`

`## [1] 68`

We get 68 number of variables.

The number of observation are:

`length(gd$export_ij)`

`## [1] 512044`

We can check the number of missing values:

`length(gd$export_ij[is.na(gd$export_ij)])`

`## [1] 114406`

Let’s assume the missing values of `export_ij`

are zero. How do we change their value from missing to zero?

Remember we need to select the correct condition, which in this case is `is.na(gd$export_ij)`

but also tell R for which variable it should change the values - `export_ij`

:

`gd[is.na(gd$export_ij),"export_ij"] <- 0`

Double checking this worked:

`length(gd$export_ij[is.na(gd$export_ij)])`

`## [1] 0`

Yes it did! Now we have 0 rows in `export_ij`

which are missing.

Let’s regress once again distance with exports to compare:

```
reg2 <- lm(gd$export_ij ~ gd$dist)
summary(reg2)
```

```
##
## Call:
## lm(formula = gd$export_ij ~ gd$dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -939688372 -624180945 -404594794 -132356225 409761225877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 943289085 13155595 71.70 <0.0000000000000002 ***
## gd$dist -65972 1497 -44.07 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4698000000 on 478830 degrees of freedom
## (33212 observations deleted due to missingness)
## Multiple R-squared: 0.004039, Adjusted R-squared: 0.004037
## F-statistic: 1942 on 1 and 478830 DF, p-value: < 0.00000000000000022
```

There is a difference!

We can try to regress the logs of the same variables in order to check the percentage change:

`reg3 <- lm(log(gd$export_ij) ~ log(gd$dist))`

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, …) : NA/NaN/Inf in ‘y’

We get an error, because we cannot take the logarithm of zero. For the purposes of the exercise, we replace all zero export_ij with 0.0001. (A quick reminder: that the default of the log function in R is natural logarithm unless changed!)

```
gd[gd$export_ij==0,"export_ij"] <- 0.0001
reg3 <- lm(log(gd$export_ij) ~ log(gd$dist))
summary(reg3)
```

```
##
## Call:
## lm(formula = log(gd$export_ij) ~ log(gd$dist))
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.756 -3.341 3.907 7.381 19.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.37332 0.15809 230.1 <0.0000000000000002 ***
## log(gd$dist) -3.09398 0.01816 -170.4 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.51 on 478830 degrees of freedom
## (33212 observations deleted due to missingness)
## Multiple R-squared: 0.05717, Adjusted R-squared: 0.05717
## F-statistic: 2.904e+04 on 1 and 478830 DF, p-value: < 0.00000000000000022
```

This result says that for every 1% increase in `distance`

, `export_ij`

is associated with 3.09% decrease. (This is also known as the elasticity of exports with respect to distance.)

We can reproduce the above, but instead of specifying `gd$`

in front of every variable, we can also specify the data source as one of the parameters in the `lm`

function. Both ways work and provide the same results, it is up to personal preference which one to choose.

```
reg4 <- lm(log(export_ij) ~ log(dist), data = gd)
summary(reg4)
```

```
##
## Call:
## lm(formula = log(export_ij) ~ log(dist), data = gd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.756 -3.341 3.907 7.381 19.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.37332 0.15809 230.1 <0.0000000000000002 ***
## log(dist) -3.09398 0.01816 -170.4 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.51 on 478830 degrees of freedom
## (33212 observations deleted due to missingness)
## Multiple R-squared: 0.05717, Adjusted R-squared: 0.05717
## F-statistic: 2.904e+04 on 1 and 478830 DF, p-value: < 0.00000000000000022
```

### Quiz

So far so good. What is the coefficient for `log(export_ij)`

if we run `log(trade_ij) ~ log(dist) + log(export_ij)`

? (Hint: Make sure the new variable that is included - `trade_ij`

does not have any missing values or zeros as you will take the logarithm of it. If it is the case, do the same as we did with `export_ij`

.)

A. 0.30

B. 0.43

C. 0.55

D. 0.67

```
#We can see that `trade_ij` has zeros. We can change that the same way we changed it for `export_ij`:
gd[gd$trade_ij==0,"trade_ij"] <- 0.0001
#Now we can run the following gravity model:
lm(log(trade_ij) ~ log(dist) + log(export_ij), data = gd)
#We can see that the coefficient for log(export_ij) is 0.2976 or 0.30, hence A is the correct answer.
```

### 5.1.3 Linear Gravity model with more variables and interaction terms

Let’s create a new regression with more variables:

```
reg5 <- lm(log(export_ij) ~ log(dist) +
log(gdp_i) +
log(gdp_j) +
rta +
log(ga_tariff_ijji_sa)
, data = gd)
summary(reg5)
```

```
##
## Call:
## lm(formula = log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) +
## rta + log(ga_tariff_ijji_sa), data = gd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.334 -1.688 1.713 4.409 19.737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.352427 0.273433 -228.036 <0.0000000000000002 ***
## log(dist) -2.516327 0.018142 -138.703 <0.0000000000000002 ***
## log(gdp_i) 2.489102 0.006209 400.867 <0.0000000000000002 ***
## log(gdp_j) 1.381105 0.006204 222.598 <0.0000000000000002 ***
## rta 1.322291 0.042079 31.424 <0.0000000000000002 ***
## log(ga_tariff_ijji_sa) -0.224746 0.242981 -0.925 0.355
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.107 on 301412 degrees of freedom
## (210626 observations deleted due to missingness)
## Multiple R-squared: 0.4224, Adjusted R-squared: 0.4224
## F-statistic: 4.409e+04 on 5 and 301412 DF, p-value: < 0.00000000000000022
```

We can export the coefficients to a clipboard so you can “Ctrl/Cmd + V” (paste) in excel:

```
write.table(
summary(reg5)$coefficients #you select just coefficients, it will also work
, "clipboard", sep="\t", row.names=FALSE)
```

We can also add interaction terms in order to capture joint variable effects by specifying the interaction in `I`

operator:

```
reg6 <- lm(log(export_ij) ~ log(dist) +
log(gdp_i) +
log(gdp_j) +
rta +
I(rta*log(gdp_j)) + #put interaction in I operator
I(dist^2) + #first order polynomial - non-linear transformation
log(ga_tariff_ijji_sa)
, data = gd)
summary(reg6)
```

```
##
## Call:
## lm(formula = log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) +
## rta + I(rta * log(gdp_j)) + I(dist^2) + log(ga_tariff_ijji_sa),
## data = gd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.160 -1.726 1.697 4.412 20.456
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -63.9832459652895 0.3271008450554 -195.607
## log(dist) -2.9782272151947 0.0263709439899 -112.936
## log(gdp_i) 2.5488480880571 0.0062645188343 406.871
## log(gdp_j) 1.5342760391309 0.0067795539273 226.309
## rta 23.4357915076500 0.4081369407485 57.421
## I(rta * log(gdp_j)) -0.8966090092956 0.0163094395377 -54.975
## I(dist^2) 0.0000000062601 0.0000000002558 24.469
## log(ga_tariff_ijji_sa) -1.1094688930275 0.2419365488330 -4.586
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## log(dist) < 0.0000000000000002 ***
## log(gdp_i) < 0.0000000000000002 ***
## log(gdp_j) < 0.0000000000000002 ***
## rta < 0.0000000000000002 ***
## I(rta * log(gdp_j)) < 0.0000000000000002 ***
## I(dist^2) < 0.0000000000000002 ***
## log(ga_tariff_ijji_sa) 0.00000452 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.063 on 301410 degrees of freedom
## (210626 observations deleted due to missingness)
## Multiple R-squared: 0.4295, Adjusted R-squared: 0.4295
## F-statistic: 3.242e+04 on 7 and 301410 DF, p-value: < 0.00000000000000022
```

### 5.1.4 STATA-like robust clustered standard errors

Estimates from Ordinary least squares (OLS) are consistent, unbiased, and efficient if these three necessary and sufficient conditions are met:

1). The standard errors are homoskedastic.

2). There is orthogonality between the errors.

3). None of the explanatory variables is a linear combination of the other explanatory variables.

In practice however, it is *not* uncommon to have cases where we have heteroskedastic standard errors. In order to control for this we use robust standard errors. Robust standard errors are arbitrary patterns of heteroskedasticity in the data and therefore a simple and effective way of fixing violations of the second OLS assumption.

In the gravity model context, errors are likely to be correlated by country pair. So it is important to allow for clustering by country pair. To do this, it is necessary to specify a clustering variable that separately identifies each country pair independently of the direction of trade. An example is distance, which is unique to each country pair but is identical for both directions of trade.

(Once again for the full details of this model - check ESCAP’s gravity model guide in R.)

Having this in mind, for linear gravity models, robust clustered standard errors are a way to obtain better estimates. In R we can use the function `lm_robust`

from the `estimatr`

package.

Let’s first install and then activate the package. Uncomment the code to install it by removing the `#`

and running `install.packages("estimatr")`

.

```
#install.packages("estimatr")
library(estimatr)
```

`## Warning: package 'estimatr' was built under R version 3.5.3`

You can obtain STATA-like robust clustered standard errors by using the `lm_robust`

. We specify the equation `log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) + rta + log(ga_tariff_ijji_sa)`

as usual. What is different from the `lm`

function is that here we can specify which variable to be our cluster and the type of standard errors we want. In the example below, we chose our cluster to be distance: `clusters = dist`

and to obtain STATA-like robust standard errors: `se_type = "stata"`

. (As usual for the full details of the specifications of `lm_robust`

run `??lm_robust`

and select the help page titled `estimatr::lm_robust`

)

```
reg_robust <- lm_robust(log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) + rta + log(ga_tariff_ijji_sa),
data = gd, clusters = dist, se_type = "stata")
summary(reg_robust)
```

```
##
## Call:
## lm_robust(formula = log(export_ij) ~ log(dist) + log(gdp_i) +
## log(gdp_j) + rta + log(ga_tariff_ijji_sa), data = gd, clusters = dist,
## se_type = "stata")
##
## Standard error type: stata
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -62.3524 0.86939 -71.7201
## log(dist) -2.5163 0.05655 -44.4980
## log(gdp_i) 2.4891 0.01948 127.7587
## log(gdp_j) 1.3811 0.01766 78.2119
## rta 1.3223 0.10906 12.1240
## log(ga_tariff_ijji_sa) -0.2247 0.64297 -0.3495
## Pr(>|t|) CI Lower CI Upper
## (Intercept) 0.000000000000000000000000000000000000 -64.057 -60.648
## log(dist) 0.000000000000000000000000000000000000 -2.627 -2.405
## log(gdp_i) 0.000000000000000000000000000000000000 2.451 2.527
## log(gdp_j) 0.000000000000000000000000000000000000 1.346 1.416
## rta 0.000000000000000000000000000000001213 1.109 1.536
## log(ga_tariff_ijji_sa) 0.726687161813804882726230971456971020 -1.485 1.036
## DF
## (Intercept) 12607
## log(dist) 12607
## log(gdp_i) 12607
## log(gdp_j) 12607
## rta 12607
## log(ga_tariff_ijji_sa) 12607
##
## Multiple R-squared: 0.4224 , Adjusted R-squared: 0.4224
## F-statistic: 4266 on 5 and 12607 DF, p-value: < 0.00000000000000022
```

Let’s compare these results to `reg5`

which has the exact same specifications except the standard errors are not robust nor clustered. We see that the significance of the coefficients remains the same under both methods except for the variable `rta`

. This is due to the fact that `lm_robust`

produces higher standard error values than `lm`

as they are robust.

`summary(reg5)`

```
##
## Call:
## lm(formula = log(export_ij) ~ log(dist) + log(gdp_i) + log(gdp_j) +
## rta + log(ga_tariff_ijji_sa), data = gd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.334 -1.688 1.713 4.409 19.737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.352427 0.273433 -228.036 <0.0000000000000002 ***
## log(dist) -2.516327 0.018142 -138.703 <0.0000000000000002 ***
## log(gdp_i) 2.489102 0.006209 400.867 <0.0000000000000002 ***
## log(gdp_j) 1.381105 0.006204 222.598 <0.0000000000000002 ***
## rta 1.322291 0.042079 31.424 <0.0000000000000002 ***
## log(ga_tariff_ijji_sa) -0.224746 0.242981 -0.925 0.355
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.107 on 301412 degrees of freedom
## (210626 observations deleted due to missingness)
## Multiple R-squared: 0.4224, Adjusted R-squared: 0.4224
## F-statistic: 4.409e+04 on 5 and 301412 DF, p-value: < 0.00000000000000022
```

### Quiz

What is the value for the standard errors of `log(export_ij)`

if we run `log(trade_ij) ~ log(dist) + log(export_ij)`

with STATA-like robust standard errors? (Hint: Check how we ran the command above and change the equation accordingly!)

A. 0.0015

B. 0.0197

C. 0.0201

D. 0.1756

```
# We run the command:
lm_robust(log(trade_ij) ~ log(dist) + log(export_ij), data = gd, clusters = dist, se_type = "stata")
# The value for the standard errors for log(export_ij) is 0.001470578 or 0.0015, hence answer A is the correct answer.
```

## 5.2 Poisson Pseudo-Maximum Likelihood Estimator (PPML)

The PPML model is often used in trade analysis because it can include trade values equal to zero, have fixed effects and easy interpretation. We will go over the basics of the PPML model in R. Once again, see for more information ESCAP’s gravity model guide in R.

The PPML function in R can be found in the `gravity`

package:

```
#install.packages("gravity")
library(gravity)
```

`## Warning: package 'gravity' was built under R version 3.5.3`

```
##
## Attaching package: 'gravity'
```

```
## The following object is masked from 'package:stats':
##
## nls
```

Here we use `export_ij`

as an output variable, we specify the distance variable in our data (`dist`

in this case) and add any additional variables (e.g. partner GDP: `gdp_j`

and reporter GDP: `gdp_i`

). One important point is that when we run the distance variable in PPML in R, we transform it using the natural logarithm, i.e. `log`

in R.

`ppml("export_ij", dist="dist", c('gdp_j' ,'gdp_i' ), data=gd)`

```
##
## Call: y_ppml ~ dist_log + gdp_j + gdp_i
##
## Coefficients:
## (Intercept) dist_log gdp_j
## 28.0022547089188549 -1.0674026797574185 0.0000000000003008
## gdp_i
## 0.0000000000002836
##
## Degrees of Freedom: 435085 Total (i.e. Null); 435082 Residual
## (43746 observations deleted due to missingness)
## Null Deviance: 1423000000000000
## Residual Deviance: 724400000000000 AIC: NA
```

As mentioned, one of the advantages of PPML is that it can handle zeros so let’s change `export_ij`

back to zero:

```
gd[gd$export_ij==0.0001,"export_ij"] <- 0
ppml("export_ij", dist="dist", c('gdp_j' ,'gdp_i' ), data=gd)
```

```
##
## Call: y_ppml ~ dist_log + gdp_j + gdp_i
##
## Coefficients:
## (Intercept) dist_log gdp_j
## 28.0022547089189793 -1.0674026797574607 0.0000000000003008
## gdp_i
## 0.0000000000002836
##
## Degrees of Freedom: 435085 Total (i.e. Null); 435082 Residual
## (43746 observations deleted due to missingness)
## Null Deviance: 1423000000000000
## Residual Deviance: 724400000000000 AIC: NA
```

First of all, notice that the distance is logged.

Second, in this case there is no change between the coefficients of the PPML model with 0.0001 and 0. This is most likely due to the fact that the sample with zeros is relatively small.

### Quiz

Let’s do a quick practice! We run the exact same model as above, but add `rta`

to the list of variables. What is the coefficient for `rta`

? Make sure you have run all the necessary code from this section!

A. 1.549107

B. 25.417498

C. -0.795194

D. 1.103045

```
# Run the following code if not already ran:
gd[gd$export_ij==0.0001,"export_ij"] <- 0
# Next we specify the PPMl model as follows:
ppml("export_ij", dist="dist", c('gdp_j' ,'gdp_i', 'rta' ), data=gd)
# Once you run all the codes above, you see that the coefficient of `rta` is 1.103045, so D is the correct answer.
```