Chapter 2 Loading data and subsetting

In this section we will learn how to load and operate data. Later, Section 3 will expand on subsetting data sets and show you how to make loops.

2.1 Basic data loading

For this section you can download the data here.

We can load the CSV data from the website link with:

trade_data <- read.csv("https://r.tiid.org/R_tute_data2007-2017.csv")

It is advised to work in R by setting a working directory. Set your working directory from which your data frames will be imported and to which the resulting files will be exported by using the function setwd. This is a very important step, as it will allow you to import files from this location without having to indicate the full path to this directory. And RStudio will save all output and backup files into that directory automatically.

You can also copy the path from your “File explorer”. Note for PC users: when inputting the path to any directory in RStudio, it is necessary to replace all backslashes \ with forward slashes /, or you will get an error message.

setwd("C:/path/to_your/working_directory")

You can check your working directory using getwd.

getwd()

## [1] "C:/path/to_your/working_directory"

Once you have set your working directory, we can read the CSV file in R using the command read.csv by just referencing the name of the CSV file’s name. (Remember you can run ?read.csv to see all the options of this R function.)

trade_data <- read.csv("data/R_tute_data2007-2017.csv")

It is possible to run into the following error, if so don’t forget to put “.csv”, or make sure the location of your file is correct.

Error in file(file, “rt”) : cannot open the connection
In addition: Warning message:
In file(file, “rt”) :
cannot open file ‘C:/path/to_your/working_directory/r_tute_data2007-2017.csv’: No such file or directory

The trade_data file we uploaded on R is a spreadsheet, in R called a “data.frame”.

class(trade_data)
## [1] "data.frame"

We check the variable names of the data set with the names command:

names(trade_data)
## [1] "reporter" "flow"     "partner"  "year"     "value"

We can also open the data set by clicking on it. Remember from the Introduction, your variables and data sets are located in the upper right corner.

From the names function we know that our trade_data has a reporter column which shows the country reporting the trade values and partner is the destination country. flow specifies whether it is an export or an import. year represents the year the trade value was recorded.

Let’s now get into some data subsetting!

2.2 Basic data subsetting

We check the first few rows of a data set with the head command:

head(trade_data)
##   reporter flow partner year      value
## 1      213    I     186 2008  156130202
## 2      213    E     174 2008  117661208
## 3      213    E     172 2008   31986252
## 4      213    E     134 2008 1507966544
## 5      213    I     181 2008    2407260
## 6      213    I     182 2008   80414681

The head function is very useful when we just want to get a small glimpse of the data - for instance what are the variables included. If you need to see more than just the first rows, you can also open it by clicking on it’s name.

If we want to check a specific number of rows, we can do so by specifying this number as in below:

head(trade_data,10)
##    reporter flow partner year       value
## 1       213    I     186 2008   156130202
## 2       213    E     174 2008   117661208
## 3       213    E     172 2008    31986252
## 4       213    E     134 2008  1507966544
## 5       213    I     181 2008     2407260
## 6       213    I     182 2008    80414681
## 7       213    I     676 2008      991884
## 8       213    E     258 2008   107246580
## 9       614    I     110 2008 15302709034
## 10      213    I     311 2008

This way we can create a smaller subset of the data with the first 10 rows to perform initial tests:

first10 <- head(trade_data,10)

There is also another way to access row data. We can select the entire first row also like this:

first10[1,]
##   reporter flow partner year     value
## 1      213    I     186 2008 156130202

Notice that we use square brackets because this is a data set. If you run the above code but with () you will get an error:

first10(1,)

Error in first10(1, ) : could not find function “first10”.

This is because we use the round brackets () to specify the options for functions and square brackets [] to specify the options for data sets.

If we want to select the first five rows, both ways give the same result. Notice in the second way 1:5 means from row 1 to row 5 inclusive.

head(first10, 5)
##   reporter flow partner year      value
## 1      213    I     186 2008  156130202
## 2      213    E     174 2008  117661208
## 3      213    E     172 2008   31986252
## 4      213    E     134 2008 1507966544
## 5      213    I     181 2008    2407260
first10[1:5,]
##   reporter flow partner year      value
## 1      213    I     186 2008  156130202
## 2      213    E     174 2008  117661208
## 3      213    E     172 2008   31986252
## 4      213    E     134 2008 1507966544
## 5      213    I     181 2008    2407260

We can also get specific rows, by specifying them inside c(). For instance the first and the fifth row only:

first10[c(1,5),]
##   reporter flow partner year     value
## 1      213    I     186 2008 156130202
## 5      213    I     181 2008   2407260

In fact, when referring to the data set like this, before the comma is the specification for the row(s) and after - for the column(s).

For example this gives you the first column:

first10[,1]
##  [1] 213 213 213 213 213 213 213 213 614 213

We can get the first and third column (reporter and partner columns respectively) by index:

first10[,c(1,3)]
##    reporter partner
## 1       213     186
## 2       213     174
## 3       213     172
## 4       213     134
## 5       213     181
## 6       213     182
## 7       213     676
## 8       213     258
## 9       614     110
## 10      213     311

Since the columns have names, we can replace the indices with the column/variable names:

first10[,c("reporter","partner")]
##    reporter partner
## 1       213     186
## 2       213     174
## 3       213     172
## 4       213     134
## 5       213     181
## 6       213     182
## 7       213     676
## 8       213     258
## 9       614     110
## 10      213     311

There is a third way to select a column from a data set: by specifying the name of the data set and including a $ before the name of the column. For example, in this case our data set is called first10 and we want the column named “flow”:

first10$flow
##  [1] I E E E I I I E I I
## Levels: E I

To summarize these three ways are all equivalent in getting the reporter column:

first10[,1]
##  [1] 213 213 213 213 213 213 213 213 614 213
first10[,"reporter"]
##  [1] 213 213 213 213 213 213 213 213 614 213
first10$reporter
##  [1] 213 213 213 213 213 213 213 213 614 213

The usage of either of these depends on personal preference, but they are all equivalent. In cases where variable names are helpful to be shown in the code, the second and third way are preferred.

If we had a column/variable name from a CSV file with a space in it, we have to include the quotation marks:

first10$'reporter code' 

These two lines are equivalent:

first10$'flow'
##  [1] I E E E I I I E I I
## Levels: E I
first10$"flow"
##  [1] I E E E I I I E I I
## Levels: E I

We can also check the class for a column:

class(first10$flow)
## [1] "factor"

In this case “flow” is a factor. A factor is a variable in R with limited amount of different values, referred to as “levels” in R. A factor is also known as a categorical variable.

Let’s combine the knowledge about selecting rows and columns. We can select just the first five reporters:

first10[1:5,"reporter"]
## [1] 213 213 213 213 213

Or the first 5 reporters and partners, the two codes are equivalent:

first10[1:5,c(1,3)]
##   reporter partner
## 1      213     186
## 2      213     174
## 3      213     172
## 4      213     134
## 5      213     181
first10[1:5,c("reporter", "partner")]
##   reporter partner
## 1      213     186
## 2      213     174
## 3      213     172
## 4      213     134
## 5      213     181

Quiz

Let’s do a quick practice! Remember to keep R open so you can run the possible answers and see which one is correct.

Which answer:
- creates a data set called mydata with the first 20 rows selected from the trade_data
- and then creates an object called element which chooses the 5th row element from the 2nd column?

(Hint: Check how we got the first10 data and how to select elements. Hint 2: Remember, when selecting elements from a data frame - row comes before column, hence if we want row a and column b, this would be: data[a,b])

A. mydata <- head(trade_data,20)
    element <- mydata[2,5]

#Answer A is incorrect because we choose our element from the second row `mydata[5,2]`, fifth column, instead of fifth row, second column `mydata[2,5]`. Hence the correct is Answer B. 

B. mydata <- head(trade_data,20)
    element <- mydata[5,2]

#Answer B is correct! 

C. mydata <- head(trade_data)
    element <- mydata[7,2]

#Answer C is incorrect because we don't specify the first 20 rows to choose in the head() command. Just running the head function without specifying 20 rows, will only give us six rows.  Hence the correct is Answer B. 

2.3 Logical operators

Logical operators are used in programming to check whether statements are TRUE or FALSE. They will be very helpful when we expand on data subsetting in the next section.

If we want to check if a value is equal to another value, we use ==. Let’s check if the value of “a” (which we just declared above) is 34:

a <- 34 #Just in case you cleared your workspace 
a == 34
## [1] TRUE

If we check for 35, it gives FALSE.

a == 35 
## [1] FALSE

We can also check if a value is not equal with !=:

a != 34
## [1] FALSE

Notice how for 34, we get FALSE as the value of a is 34. For 35, we get TRUE because our value is different from 35.

a != 35
## [1] TRUE

Other logical operators are >: greater than, <: smaller than, >=: greater or equal and <=: smaller or equal.

a < 34
## [1] FALSE
a <= 34
## [1] TRUE

We can use logical operators also with data:

one2five <- 1:5
three2seven <- 3:7

We can now check if the number one is in the sequence:

one2five == 1
## [1]  TRUE FALSE FALSE FALSE FALSE
three2seven == 1
## [1] FALSE FALSE FALSE FALSE FALSE

If we use the equal logical operator to compare the two sequences we created, it gives us FALSE. This is because it compares row by row from the first to the last number.

one2five==three2seven
## [1] FALSE FALSE FALSE FALSE FALSE

Fortunately, there is another way to check if any of the numbers in one2five are in three2seven. We use the %in% operator.

one2five %in% three2seven
## [1] FALSE FALSE  TRUE  TRUE  TRUE

Logical operators are very helpful when subsetting data. For example, let’s select only the row where the reporter value is 614 from our previous data first10. We do so by specifying the data set from which we want to subset and include square brackets, in this case first10[]. Then we specify the condition and put a comma. This means that it applies to all columns. If we wanted the value of a particular column, we could specify it after the comma.

first10[first10$reporter==614,]
##   reporter flow partner year       value
## 9      614    I     110 2008 15302709034

Just by itself the chunk of the code evaluates logically whether the statement is true for values only for the column where the condition is specified. In this case for the reporter column.

first10$reporter==614
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

We can also select only the rows with exports (flow==“E”).

first10[first10$flow=="E",]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 3      213    E     172 2008   31986252
## 4      213    E     134 2008 1507966544
## 8      213    E     258 2008  107246580

We can have the criteria above and also select rows with value > 100,000,000 by using &:

first10[first10$flow=="E"&
          first10$value>100000000,]
## Warning in Ops.factor(first10$value, 1e+08): '>' not meaningful for factors
##      reporter flow partner year value
## NA         NA <NA>      NA   NA  <NA>
## NA.1       NA <NA>      NA   NA  <NA>
## NA.2       NA <NA>      NA   NA  <NA>
## NA.3       NA <NA>      NA   NA  <NA>

The & operator stands for “and”.

Make sure to include the & operator after the first line of code, otherwise R does not know that the next line is part of the same expression.

This give us an error because column “value” is factor.

class(first10$value)
## [1] "factor"

We can change the value column from being factor to numeric like this:

first10$value <- as.numeric(as.character(first10$value))

When converting from class factor to numeric, we first need to convert to the character class first then to numeric, otherwise it will take the index in the factor dictionary rather than the actual value. We convert to the character class by using the as.character command. Then to convert the values to numeric - we use as.numeric.

Now the class is numeric:

class(first10$value)
## [1] "numeric"

So we can subset by value with the > operator:

first10[first10$flow=="E"&
          first10$value>100000000,]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 4      213    E     134 2008 1507966544
## 8      213    E     258 2008  107246580

We can also subset the exports value for more than 100 million for partner 174:

first10[first10$flow=="E"&
          first10$value>100000000&
          first10$partner==174,]
##   reporter flow partner year     value
## 2      213    E     174 2008 117661208

Same as above, but partner is 174 or 134:

first10[first10$flow=="E"&
          first10$value>100000000&
          first10$partner %in% c(174,134),]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 4      213    E     134 2008 1507966544

Select any flows of any value of partner 174 or 134, where “|” is the “or” operator:

first10[first10$partner==174|first10$partner==134,]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 4      213    E     134 2008 1507966544

The “or” (|) operator means we select both values if they exist.

In fact, these two lines of code are equivalent in this case:

first10[first10$partner %in% c(174,134),]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 4      213    E     134 2008 1507966544
first10[first10$partner==174|first10$partner==134,]
##   reporter flow partner year      value
## 2      213    E     174 2008  117661208
## 4      213    E     134 2008 1507966544

Quiz

Okay, that was a lot. This is a very important part, so let’s try to practice it!

Which answer correctly selects all the conditions listed below correctly:
- import “I” flow
- trade values larger than (>) a 1000
- select both partners with codes 186 and 181 ?

(Hint: We just did something similar, look at the code!)

A. first10[first10$flow=="I"&
                    first10$value>1000&
                    first10$partner %in% c(186,181),]

#Answer A is correct. Good job! :). 

B. first10[first10$partner==186|first10$partner==181,]

#Answer B is incorrect because it only selects the partners correctly (186 and 181), but does not select the import ("I") flow and the trade value to be higher than 1000. Hence, answer A is the correct answer. 

This concludes Section 2! From now on we will work with different data sets and in order to have more space on R, you can always run the rm command in order to delete the data sets from your environment in R that are not in use. In this case we delete the first10 data set:

rm(first10)