Expected time (2 - 3 hours)

Topics Expected Time
1. Lesson Overview and Helpful hint of the week (10min)
2a. Operational Ordering Video (18min)
2b. Operational Ordering (25min)
3a. Tidy Data Video (19min)
3b. Tidy Data, Tidying up Unitidy Data (35min)
4. Weekly Homework (30min)

Overview

Concepts covered this week

This lesson will introduce the Tidyverse by explaining its purpose, its makeup and a brief overview of its data philosophy. We will then cover two broad topics relating to the Tidyverse namely, operational ordering and tidy data.

Operational ordering

The first topic will explain how the tidyverse breaks up complex structures into simple, ordered steps that allow for easier collaboration and revision of your own work. Here we introduce a couple of key verbs that are used to manipulate our data and end off by showing some useful tools for exploring your data. Below I will provide an English sentence that is the equivalent of normal R code and then show what the Tidyverse equivalent is:

“I would like to know what the average student enrolled in Finance Honours, that has also attempted other programming courses, thinks of their R course (excluding students repeating the course).”

Notice how the commas and brackets act as placeholders. Our ability to order complex senetences like this into something meaningful is often taken for granted but I assume this is more difficult for people learning English. Now lets look at the Tidyverse equivalent:

“Find students enrolled in Finance honours and then exclude repeat students and then only include students that have also attempted other programming courses and then ask them what they think of the R course and then calculate the average.”

The Tidyverse version is a terrible sentence. It becomes boring with its repition and simplicity but it is far easier to understand and follow. It is also far easier to edit or manipulate and clarifies the order of operations being executed by R.

Tidy data and tidying up

The second Tidyverse topic covered this week will cover the concept of Tidy data. This idea is fundamental to creating collaborative and systematic work flows and is based on the idea that if there is one data format as an input, we can reduce the number of necessary functions and simplify the language we use to manipulate data to a small number of common use verbs.

In this section we begin by working through the examples given in Hadley Wickhams’s paper “Tidy Data” and then progress onto some basic functions used in tidying up untidy datasets.

Helpful Hint of The Week: Cheatsheets! What they are and where to find them

Cheatsheets are like crib notes that a student would want to smuggle into an R exam. They are filled with helpful hints, guides and commands to use for specific aspect of R and R studio and often include info graphics and easy to follow steps.

There are hundreds of extremely useful cheatsheets available that can be used to help you remember all of the technical details of some function or package that you’re using. They are usually provided as a jam packed two page pdf file and are free to download.

Here is a link to a free to download cheatsheet repository provided by R Studio. I recommend you download “RStudio IDE Cheatsheet” for learning more about the R Studio interface and the “Data Transformation Cheatsheet” for learning “dplyr” which we’ll cover in this lesson.

Other cheatsheet which will definitely be useful in the future include the “Data Visualization Cheatsheet” for ggplot2 (the greatest and simplist plotting ever! maybe the single best thing about R) which will get its own lesson, “Dates and Times Cheatsheet”, “Apply Functions Cheatsheet”, “Data Import Cheatsheet” and the “R Markdown Cheatsheet” which will help you with your project.

I still refer to these cheatsheets all the time because the point isn’t to memorise functions, its to know where to find the info you need when you need it.


=============== Operational Ordering ===============


Time stamped video on Loom (18min)

Note that the first example used in the script will be duplicated in the Operational Ordering video. If you find this confusing please watch the video and if you found the video confusing see if the step by step approach shown here helps.

The tidyverse is a collection of packages for data science that all conform to the same data philosophy and grammar. That can be summarised as a consistent data layout (tidy data), manipulated and operated on in an intuitive, readable and generalised manner.

Currently the Tidyverse consists of the following packages:

  • ggplot2: A graphing tool based on the Grammer of Graphics.
  • dplyr: Data manipulation packages that relies on a small set of verbs.
  • tidyr: Provides functions to convert datasets to tidy data format.
  • readr: Fast solution for reading in (importing) rectangular data.
  • purrr: Expansion of the functional programming toolkit.
  • tibble: A reimagined dataframe.
  • stringr: Allows simple handling of string (character) data
  • forcats: Simplified framework for handling of categorical data

By installing the tidyverse and calling it from library, it is the equivalent of installing all of the above packages and calling them from your library. (Remember to unhash the “install.packages(”tidyverse“)” the first time you run this on a machine)

# install.packages("tidyverse")
library("tidyverse")

One of the great additions that the tidyverse provides is the pipe operator “%>%”. This operator originally comes from the “magrittr” package and can be thought of as an “and then” conjunction.

The keyboard shortcut for “%>%” is “ctrl” + “shift” + “M”. Try it out.

Its purpose is to rearrange the procedural order from inside out and right to left to the more intuitive left to right and top to bottom.

Let’s bring up some data from the built in databases (Note the capital “O” in Orange):

orange_df <- datasets::Orange
   Tree  age circumference
1     1  118            30
2     1  484            58
3     1  664            87
4     1 1004           115
5     1 1231           120
6     1 1372           142
7     1 1582           145
8     2  118            33
9     2  484            69
10    2  664           111
11    2 1004           156
12    2 1231           172
13    2 1372           203
14    2 1582           203
15    3  118            30
16    3  484            51
17    3  664            75
18    3 1004           108
19    3 1231           115
20    3 1372           139
21    3 1582           140
22    4  118            32
23    4  484            62
24    4  664           112
25    4 1004           167
26    4 1231           179
27    4 1372           209
28    4 1582           214
29    5  118            30
30    5  484            49
31    5  664            81
32    5 1004           125
33    5 1231           142
34    5 1372           174
35    5 1582           177


To find out more about the dataset, type “?datasets::Orange”.

If we want to calculate the average circumference of all trees aged 118 days we would normally type the following command:

answer <- mean(orange_df$circumference[orange_df$age == 118])

This would read as follows: “Calculate the average of the orange_df circumference variable, for observations corresponding to the age variable equalling 118 days.”

This is hard to read and as we will see the operations go inside out and right to left making collaborating or revising your work tricky. To understand the ordering by which R is executing this instruction we can break this into 3 steps.

STEP 1

age_118 <- orange_df$age == 118
age_118
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[16] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[31] FALSE FALSE FALSE FALSE FALSE

This produces a TRUE/FALSE vector. We can see this next to the original dataframe by using “cbind” to combine columns.

cbind(age_118, orange_df)
   age_118 Tree  age circumference
1     TRUE    1  118            30
2    FALSE    1  484            58
3    FALSE    1  664            87
4    FALSE    1 1004           115
5    FALSE    1 1231           120
6    FALSE    1 1372           142
7    FALSE    1 1582           145
8     TRUE    2  118            33
9    FALSE    2  484            69
10   FALSE    2  664           111
11   FALSE    2 1004           156
12   FALSE    2 1231           172
13   FALSE    2 1372           203
14   FALSE    2 1582           203
15    TRUE    3  118            30
16   FALSE    3  484            51
17   FALSE    3  664            75
18   FALSE    3 1004           108
19   FALSE    3 1231           115
20   FALSE    3 1372           139
21   FALSE    3 1582           140
22    TRUE    4  118            32
23   FALSE    4  484            62
24   FALSE    4  664           112
25   FALSE    4 1004           167
26   FALSE    4 1231           179
27   FALSE    4 1372           209
28   FALSE    4 1582           214
29    TRUE    5  118            30
30   FALSE    5  484            49
31   FALSE    5  664            81
32   FALSE    5 1004           125
33   FALSE    5 1231           142
34   FALSE    5 1372           174
35   FALSE    5 1582           177


STEP 2

Next we retrieve the circumference of the trees for all observations that correspond to age_118 being TRUE.

circumference_118 <- orange_df$circumference[age_118]
circumference_118
[1] 30 33 30 32 30

You can double check that this is correct.

STEP 3

Calculate the average using the function “mean()”.

mean(circumference_118)
[1] 31

The tidyverse is designed to allow us to write the procedures in the order of these steps. Where:

mean(orange_df$circumference[orange_df$age == 118])
[1] 31

Is read as Step 3, Step 2, Step 1, the tidyverse allows us to write the steps in order.

orange_df %>%     # (1) Look at the object orange_df and then  (%>%)
  filter(age == 118) %>%      # (2) Filter the rows where the age column is equal to 118 and then
  select(circumference) %>%   # (3) Select the column called circumference and then
  unlist() %>%                # (4) unlist the data (this is a simple structure transformation) and then
  mean()                      # (5) calculate the mean.
[1] 31


The tidy method gives us the same answer we got before but the code is much more readable and intuitive.

But what if we wanted to calculate the average of all the trees circumferences at each age?

Using base R we could use the function “aggregate()”. “aggregate” uses an equation format to seperate what we want to find out (in this case circumference) from what we want to define the categories under (in this case age).

The input structure is as follows: aggregate(equation, dataframe, function)

We want to know the mean circumference after sorting by age so the equation would be “circumference ~ age”, the dataframe is “orange_df” and the function we want to use is “mean”.

aggregate(circumference ~ age, orange_df, mean)
   age circumference
1  118          31.0
2  484          57.8
3  664          93.2
4 1004         134.2
5 1231         145.6
6 1372         173.4
7 1582         175.8

The above function has calculated the average circumference for each age.

Using the tidyverse we would employ a functions called “group_by” and “summarise” and it would read as follows:

  1. Use the dataframe orange_df and then
  2. group the data by age and then
  3. calculate the mean for each group

The great thing about the tidyverse is that the function names almost always describe exactly what they do.

orange_df %>%                      # (1) Use the dataframe orange_df and then
  group_by(age) %>%                # (2) group the data by age and then
  summarise(mean(circumference))   # (3) calculate the mean for each group
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 7 x 2
    age `mean(circumference)`
  <dbl>                 <dbl>
1   118                  31  
2   484                  57.8
3   664                  93.2
4  1004                 134. 
5  1231                 146. 
6  1372                 173. 
7  1582                 176. 

As your operations become more complex, readability becomes more and more important.

Note that both methods are efficient and produce the same results. Tidyverse solutions tend to follow more consistent logic than the hundreds of methods we could employ to solve similar problems.

Additionally, the Tidyverse packages are all designed to read data of the same basic structure and then output data in a consistent manner. The data we input into our Tidyverse functions are called tidy data.

The final section of this week’s lesson will introduce Tidy data, how to get your data tidy and why this will make your life easier.


========== Intro to Tidy Data and how to Tidy Up ===========


Time stamped video on Loom (19min)


The “Tidy Data” video will repeat the first example provided here while explaining the concept of tidy data.

Lets start by clearing our environment so we can clearly see each object we create. If you would also like to clear your console remember you can press “ctrl” plus “l”

rm(list = ls())

The Tidy data philosophy comes from Hadley Wikhams paper linked here

I highly recommend you read the paper and then attempt the exercises shown here or go through it in conjunction with this section. The paper is a very easy read and I will use the examples, headings and structure in that paper in this section of the lesson.

Tidy Data has 3 rules:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Let’s elaborate on these terms:

Identifiers such as Tree numbers, or share names all count as a variable. Additionally, a time period identifier such as time stamps, or years or days of the week is a variable. This means identifiers should be in their own column and not be column names.

Observations are descriptions of a measurement of interest.

For example consider a looking at two share’s prices over two days. Each measurement of each share’s price would form an observation giving us four observations in total. Each observation would be described by a date column, a share code column and a price column. The date, the share code and the price collectively make up an observation.

We will go through some common examples of untidy data and show why it is untidy using the three rules mentioned above. Note that the examples will by no means be exhaustive because there may be one way for data to be tidy but there are an infinite many ways for it to be untidy.

One easy way to spot untidy data is to check whether observations are coordinates rather than rows

Example 1 (Table 1 in Wickham’s paper)

treatment_a <- c(NA, 16, 3)
treatment_b <- c(2, 11, 1)
df_1 <- data.frame(treatment_a, treatment_b)
rownames(df_1) <- c("John Smith", "Jane Doe", "Mary Johnson") # Names the rows of an object
df_1
             treatment_a treatment_b
John Smith            NA           2
Jane Doe              16          11
Mary Johnson           3           1

df_1 is a very common data format, particularly in spreadsheet software like Excel. The data is efficently organised without repetition and easy to read. But let’s assess whether it satisfies the three tidy rules.

According to the first rule each variable should form a column.

This is not satisfied for two reasons. The first rule is violatted because the identifiers or person names are not their own column but rather row names. Names or identifiers should always be their own column. Also, the treatment used should be a variable and not a heading as it is part of the information describing each observation.

According to the second rule each observation should form a row.

The second rule is violated because each observation is a coordinate and not a row. By coordinate, i mean that the observation is contained in a combination of a row number and a column number instead of just a single row.

This will become clearer with when we look at the tidy version of this data.

Example 2 (Table 2 in Wickham’s paper)

Table two is the same data displayed with the person names as column names and the treatments as rownames.

To recreate this we can simply transpose df_1. “t()” is R’s transpose function

df_2 <- t(df_1)

Notice that because transpose is a matrix operation, R coerced the dataframe into a matrix.

is.data.frame(df_2)
[1] FALSE
is.matrix(df_2)
[1] TRUE

This is also a good time to observe how matrices and dataframe are displayed in our global environment. Dataframes are described based on their number of observations and their number of variables. Matrices are displayed based on their dimensions as a simple data structure: first dimension is of length 2, second dimension is of length 3. Or more simply, 2 rows and 3 columns.

Lets coerce df_2 back into a dataframe and double check that none of the data was lost.

df_2 <- as.data.frame(df_2)
df_2
            John Smith Jane Doe Mary Johnson
treatment_a         NA       16            3
treatment_b          2       11            1

One way of thinking about why tidy data is important is trying to decide which object is better between df_1 and df_2? It’s hard to justify one over the other and the choice seems arbitrary.

Both are untidy because they have one variable stored in rownames and another variable stored in column names. Each observation is determined by a row/column coordinate but we know each observation should be in it’s own row.

Additionally, without an overall heading, we don’t actually know what the data inside the table means? What are 1, 2, 3, 11 and 16 measurements of?

Lets look at the tidy version of df_1 and df_2 by inputting it manually (table 3). Note that we can name the columns of our dataframes while we input them. Also note that we can put each variable on a separate line to make it easier to read.

df_tidy <- data.frame(name = c("John Smith", "Jane Doe", "Mary Johnson", "John Smith", "Jane Doe", "Mary Johnson"),
                      trt = c("a", "a", "a", "b", "b", "b"),
                      result = c(NA, 16, 3 , 2 , 11, 1))
df_tidy
          name trt result
1   John Smith   a     NA
2     Jane Doe   a     16
3 Mary Johnson   a      3
4   John Smith   b      2
5     Jane Doe   b     11
6 Mary Johnson   b      1

Here we would have the opportunity to describe the data better if we knew what the “result” was. With better information we could, as an example, call it “recovery time (in days)”.

Notice that the tidy version of the dataframe look less efficient. Where the untidy versions occupied 3 rows (columns) and 2 columns (rows), the tidy version occupies 6 rows and 3 columns. Importantly however, we have labels for each variable and there is only one version of this tidy data. Alternatively, we could have one person organise their data like df_1 and another organise it like df_2 without any good reason why one is better than the other. This would lead to two different people coming up with individual solutions to manipulating and analysing their data but with tidy data there would be only one.

For a discussion on the identifcation of variables and observations, please read page 4, paragraph 2 of the “Tidy Data” paper linked to above.

Five common problems with messy data and their solutions

Extracted from (pg 5 Section 3. Tidying messy datasets)

  1. Column headers are values, not variable names.
  2. Multiple variables are stored in one column.
  3. Variables are stored in both rows and columns.
  4. Multiple types of observational units are stored in the same table.
  5. A single observational unit is stored in multiple tables.

Let’s go through a couple of these and tidy them up!

(1) Column headers are values, not variable names.

We will start by recreating the table given in the paper.

Note that I provide the names separately to the data here whereas I did it all at once in creating df_tidy. This is because the names are irregular in that they contain symbols and numbers that create errors when input into the data.frame() function. The names() function is more lenient in this regard.

df_4 <- data.frame(
  c("Agnostic", "Athiest", "Buddhist", "Catholic", "Don't know/refused", "Evangelical Prot", "Hindu", "Historically Black Prot", "Jehovah's Witness", "Jewish"), 
  c(27, 12, 27, 418, 15, 575, 1, 228, 20, 19), 
  c(34, 27, 21, 617, 14, 869, 9, 244, 27, 19), 
  c(60, 37, 30, 732, 15, 1064, 7, 236, 24, 25), 
  c(81, 52, 34, 670, 11, 982, 9, 238, 24, 25), 
  c(76, 35, 33, 638, 10, 881, 11, 197, 21, 30), 
  c(137, 70, 58, 1116, 35, 1486, 34, 223, 30, 95)
)

names(df_4) <- c("religion", "<$10k", "$10-20k", "$20-30k", "$30-40k", "$40-50k", "$50-75k")
df_4
                  religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
1                 Agnostic    27      34      60      81      76     137
2                  Athiest    12      27      37      52      35      70
3                 Buddhist    27      21      30      34      33      58
4                 Catholic   418     617     732     670     638    1116
5       Don't know/refused    15      14      15      11      10      35
6         Evangelical Prot   575     869    1064     982     881    1486
7                    Hindu     1       9       7       9      11      34
8  Historically Black Prot   228     244     236     238     197     223
9        Jehovah's Witness    20      27      24      24      21      30
10                  Jewish    19      19      25      25      30      95

The data used here comes from a Pew research poll on religion and income. The final three columns have been omitted as per the Wickham paper. The original research link provided in the paper seems to be down. This is what I think the data links to although the brackets seem to be further aggregated.

Notice that although it may be obvious that the numbers represent a count of the people that fit into a religious/income combination, this is not explicitly clear (it could have been average age…).

Lets look at the data and try to identify our variables and then identify our observations as units combining one measurement of each variable.

The first column seems to be a variable and it is already in its own column so this doesn’t need fixing. The next variable is the income brackets which is stored in the column names of column 2-7. This should be its own column. The last variable is the count of the sample population that fits into each religion/ income bracket coordinate.

Lets see how Tidyverse can help us tidy this data up.

We are going to be using a function called “gather()” to gather up the variable stored in column names and make it a variable column.

We can use the help function first to see what inputs gather requires.

?gather

It seems that this function requires three main inputs: data, key, and value as well as information regarding which columns it is gathering (the default will gather all).

We can specify that we are looking at “df_4” by using “df_4 %>% gather()” so we won’t need to repeat this. Key is asking us what to call the values that used to be column names. Value is asking us what to call the values that used to be coordinates. Finally, we specify 2:7 because we only want to gather columns 2:7 because column 1 was already a variable.


df_4_tidy <- df_4 %>% 
  gather(key = "Income Bracket",
         value = "Number of people",
         2:7)
                  religion Income Bracket Number of people
1                 Agnostic          <$10k               27
2                  Athiest          <$10k               12
3                 Buddhist          <$10k               27
4                 Catholic          <$10k              418
5       Don't know/refused          <$10k               15
6         Evangelical Prot          <$10k              575
7                    Hindu          <$10k                1
8  Historically Black Prot          <$10k              228
9        Jehovah's Witness          <$10k               20
10                  Jewish          <$10k               19
11                Agnostic        $10-20k               34
12                 Athiest        $10-20k               27
13                Buddhist        $10-20k               21
14                Catholic        $10-20k              617
15      Don't know/refused        $10-20k               14
16        Evangelical Prot        $10-20k              869
17                   Hindu        $10-20k                9
18 Historically Black Prot        $10-20k              244
19       Jehovah's Witness        $10-20k               27
20                  Jewish        $10-20k               19
21                Agnostic        $20-30k               60
22                 Athiest        $20-30k               37
23                Buddhist        $20-30k               30
24                Catholic        $20-30k              732
25      Don't know/refused        $20-30k               15
26        Evangelical Prot        $20-30k             1064
27                   Hindu        $20-30k                7
28 Historically Black Prot        $20-30k              236
29       Jehovah's Witness        $20-30k               24
30                  Jewish        $20-30k               25
31                Agnostic        $30-40k               81
32                 Athiest        $30-40k               52
33                Buddhist        $30-40k               34
34                Catholic        $30-40k              670
35      Don't know/refused        $30-40k               11
36        Evangelical Prot        $30-40k              982
37                   Hindu        $30-40k                9
38 Historically Black Prot        $30-40k              238
39       Jehovah's Witness        $30-40k               24
40                  Jewish        $30-40k               25
41                Agnostic        $40-50k               76
42                 Athiest        $40-50k               35
43                Buddhist        $40-50k               33
44                Catholic        $40-50k              638
45      Don't know/refused        $40-50k               10
46        Evangelical Prot        $40-50k              881
47                   Hindu        $40-50k               11
48 Historically Black Prot        $40-50k              197
49       Jehovah's Witness        $40-50k               21
50                  Jewish        $40-50k               30
51                Agnostic        $50-75k              137
52                 Athiest        $50-75k               70
53                Buddhist        $50-75k               58
54                Catholic        $50-75k             1116
55      Don't know/refused        $50-75k               35
56        Evangelical Prot        $50-75k             1486
57                   Hindu        $50-75k               34
58 Historically Black Prot        $50-75k              223
59       Jehovah's Witness        $50-75k               30
60                  Jewish        $50-75k               95


Note that placing each argument from the function on a separate line is completely optional but I find it makes it easier to read. Also, we could have also used all columns except 1 “-1” instead of “2:7”. Try it out!

The opposite of “gather” is “spread”. We can use this to reverse the process done to make df_4_tidy. The inputs are the same as gather but the key is asking for the column that you would like to turn into column names, and the value is asking for the column to spread out into coordinates.

df_4_untidy <- df_4_tidy %>% 
  spread(key = "Income Bracket", value = "Number of people", -1)

This is quite a tricky concept to grasp straight away so play around with it and keep in mind you will have to revisit this concept often to remind yourself how it works (I did anyway).

df_4_untidy should now back to its original format and should be identical to df_4. This can be useful if you would like to perform analysis or transformations on data but then return to its more readable form later.

identical(df_4, df_4_untidy)
[1] FALSE

Note that identical returns FALSE only because the gather and spread operations have reordered the variables and placed the “<$10k” as the 7th variable instead of the 2nd. We can fix this manually by simply instructing R to place the “<$10k” in 2nd column using the “select()” function.

df_4_untidy <- df_4_untidy %>% 
  select(1, 7, everything())

Select is usually used to select certain columns from your dataframe, however it also reorders your dataframe according to the order that you select the variables. There are much neater solutions than the one provided above but we are not yet ready to introduce them.

For now, lets just note why the above solution is bad.

  1. It is an adjustment to the original rearrangement of columns instead of stopping the rearrangment from happening.
  2. It is a non-general solution, meaning that if the ordering wasn’t in this specific order this solution wouldn’t work. We will get into generalising solutions when we introduce custom functions.
  3. It is self referencing which means it is relying on a previous version of itself and changing itself in the process. This is dangerous because it if you accidentally run the operation twice, you will mess up your ordering. Run it again and see for yourself!

(1) Column headers are values, (2) multiple variables in one column & (3) variables stored in both rows and columns.

Last, we will look at a particularly tough dataset.

This data will have multiple variables stored as column names and require a few steps to clean up. We are replicating Table 9 from the paper which has data from the World Health Organisation on Tuberculosis cases sorted by country, year and demographic group.

df_messiest <- data.frame(
  country = c("AD", "AE", "AF", "AG", "AL", "AM", "AN", "AO", "AR", "AS"),
  year = c(2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000),
  m014 = c(0, 2, 52, 0, 2, 2, 0, 186, 97, NA),
  m1524 = c(0, 4, 228, 0, 19, 152, 0, 999, 278, NA),
  m2534 = c(1, 4, 183, 0, 21, 130, 1, 1003, 594, NA),
  m3544 = c(0, 6, 149, 0, 14, 131, 2, 912, 402, NA),
  m4554 = c(0, 5, 129, 0, 24, 63, 0, 482, 419, 1),
  m5564 = c(0, 12, 94, 0, 19, 26, 0, 312, 368, 1),
  m65 = c(0, 10, 80, 1, 16, 21, 0, 194, 330, NA),
  mu = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
  f014 = c(NA, 3, 93, 1, 3, 1, 0, 247, 121, NA)
  )
df_messiest
   country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014
1       AD 2000    0     0     1     0     0     0   0 NA   NA
2       AE 2000    2     4     4     6     5    12  10 NA    3
3       AF 2000   52   228   183   149   129    94  80 NA   93
4       AG 2000    0     0     0     0     0     0   1 NA    1
5       AL 2000    2    19    21    14    24    19  16 NA    3
6       AM 2000    2   152   130   131    63    26  21 NA    1
7       AN 2000    0     0     1     2     0     0   0 NA    0
8       AO 2000  186   999  1003   912   482   312 194 NA  247
9       AR 2000   97   278   594   402   419   368 330 NA  121
10      AS 2000   NA    NA    NA    NA     1     1  NA NA   NA

This is clearly a very messy and hard to read table. Here the column names (excluding “country” and “year”) hold multiple pieces of information. m014 is male that is between 0 and 14 years old. Likewise f is for female and u stands for unknown age. The dataset is cut off and not showing the rest of the data for females for simplicity.

The untidiness of this dataset is complex.

  • First we should gather the rows and make them a variable.
  • Second we should seperate the variable into sex and age group.


df_less_messy <- df_messiest %>% 
  gather(key = "sex_age", value = "cases", 3:11)
   country year sex_age cases
1       AD 2000    m014     0
2       AE 2000    m014     2
3       AF 2000    m014    52
4       AG 2000    m014     0
5       AL 2000    m014     2
6       AM 2000    m014     2
7       AN 2000    m014     0
8       AO 2000    m014   186
9       AR 2000    m014    97
10      AS 2000    m014    NA
11      AD 2000   m1524     0
12      AE 2000   m1524     4
13      AF 2000   m1524   228
14      AG 2000   m1524     0
15      AL 2000   m1524    19
16      AM 2000   m1524   152
17      AN 2000   m1524     0
18      AO 2000   m1524   999
19      AR 2000   m1524   278
20      AS 2000   m1524    NA
21      AD 2000   m2534     1
22      AE 2000   m2534     4
23      AF 2000   m2534   183
24      AG 2000   m2534     0
25      AL 2000   m2534    21
26      AM 2000   m2534   130
27      AN 2000   m2534     1
28      AO 2000   m2534  1003
29      AR 2000   m2534   594
30      AS 2000   m2534    NA
31      AD 2000   m3544     0
32      AE 2000   m3544     6
33      AF 2000   m3544   149
34      AG 2000   m3544     0
35      AL 2000   m3544    14
36      AM 2000   m3544   131
37      AN 2000   m3544     2
38      AO 2000   m3544   912
39      AR 2000   m3544   402
40      AS 2000   m3544    NA
41      AD 2000   m4554     0
42      AE 2000   m4554     5
43      AF 2000   m4554   129
44      AG 2000   m4554     0
45      AL 2000   m4554    24
46      AM 2000   m4554    63
47      AN 2000   m4554     0
48      AO 2000   m4554   482
49      AR 2000   m4554   419
50      AS 2000   m4554     1
51      AD 2000   m5564     0
52      AE 2000   m5564    12
53      AF 2000   m5564    94
54      AG 2000   m5564     0
55      AL 2000   m5564    19
56      AM 2000   m5564    26
57      AN 2000   m5564     0
58      AO 2000   m5564   312
59      AR 2000   m5564   368
60      AS 2000   m5564     1
61      AD 2000     m65     0
62      AE 2000     m65    10
63      AF 2000     m65    80
64      AG 2000     m65     1
65      AL 2000     m65    16
66      AM 2000     m65    21
67      AN 2000     m65     0
68      AO 2000     m65   194
69      AR 2000     m65   330
70      AS 2000     m65    NA
71      AD 2000      mu    NA
72      AE 2000      mu    NA
73      AF 2000      mu    NA
74      AG 2000      mu    NA
75      AL 2000      mu    NA
76      AM 2000      mu    NA
77      AN 2000      mu    NA
78      AO 2000      mu    NA
79      AR 2000      mu    NA
80      AS 2000      mu    NA
81      AD 2000    f014    NA
82      AE 2000    f014     3
83      AF 2000    f014    93
84      AG 2000    f014     1
85      AL 2000    f014     3
86      AM 2000    f014     1
87      AN 2000    f014     0
88      AO 2000    f014   247
89      AR 2000    f014   121
90      AS 2000    f014    NA


Now we need to separate sex and age brackets into their own columns. To do this we use a function called“separate()”. Separate asks us which column to separate (sex_age) and then asks what the names of the new columns should be (sex and age), and finally where it should create the split.


df_tidied_up <- df_less_messy %>% 
  separate(sex_age, into = c("sex", "age"), sep = 1)
   country year sex  age cases
1       AD 2000   m  014     0
2       AE 2000   m  014     2
3       AF 2000   m  014    52
4       AG 2000   m  014     0
5       AL 2000   m  014     2
6       AM 2000   m  014     2
7       AN 2000   m  014     0
8       AO 2000   m  014   186
9       AR 2000   m  014    97
10      AS 2000   m  014    NA
11      AD 2000   m 1524     0
12      AE 2000   m 1524     4
13      AF 2000   m 1524   228
14      AG 2000   m 1524     0
15      AL 2000   m 1524    19
16      AM 2000   m 1524   152
17      AN 2000   m 1524     0
18      AO 2000   m 1524   999
19      AR 2000   m 1524   278
20      AS 2000   m 1524    NA
21      AD 2000   m 2534     1
22      AE 2000   m 2534     4
23      AF 2000   m 2534   183
24      AG 2000   m 2534     0
25      AL 2000   m 2534    21
26      AM 2000   m 2534   130
27      AN 2000   m 2534     1
28      AO 2000   m 2534  1003
29      AR 2000   m 2534   594
30      AS 2000   m 2534    NA
31      AD 2000   m 3544     0
32      AE 2000   m 3544     6
33      AF 2000   m 3544   149
34      AG 2000   m 3544     0
35      AL 2000   m 3544    14
36      AM 2000   m 3544   131
37      AN 2000   m 3544     2
38      AO 2000   m 3544   912
39      AR 2000   m 3544   402
40      AS 2000   m 3544    NA
41      AD 2000   m 4554     0
42      AE 2000   m 4554     5
43      AF 2000   m 4554   129
44      AG 2000   m 4554     0
45      AL 2000   m 4554    24
46      AM 2000   m 4554    63
47      AN 2000   m 4554     0
48      AO 2000   m 4554   482
49      AR 2000   m 4554   419
50      AS 2000   m 4554     1
51      AD 2000   m 5564     0
52      AE 2000   m 5564    12
53      AF 2000   m 5564    94
54      AG 2000   m 5564     0
55      AL 2000   m 5564    19
56      AM 2000   m 5564    26
57      AN 2000   m 5564     0
58      AO 2000   m 5564   312
59      AR 2000   m 5564   368
60      AS 2000   m 5564     1
61      AD 2000   m   65     0
62      AE 2000   m   65    10
63      AF 2000   m   65    80
64      AG 2000   m   65     1
65      AL 2000   m   65    16
66      AM 2000   m   65    21
67      AN 2000   m   65     0
68      AO 2000   m   65   194
69      AR 2000   m   65   330
70      AS 2000   m   65    NA
71      AD 2000   m    u    NA
72      AE 2000   m    u    NA
73      AF 2000   m    u    NA
74      AG 2000   m    u    NA
75      AL 2000   m    u    NA
76      AM 2000   m    u    NA
77      AN 2000   m    u    NA
78      AO 2000   m    u    NA
79      AR 2000   m    u    NA
80      AS 2000   m    u    NA
81      AD 2000   f  014    NA
82      AE 2000   f  014     3
83      AF 2000   f  014    93
84      AG 2000   f  014     1
85      AL 2000   f  014     3
86      AM 2000   f  014     1
87      AN 2000   f  014     0
88      AO 2000   f  014   247
89      AR 2000   f  014   121
90      AS 2000   f  014    NA


Notice that “sep” argument is very versatile, if your values have a separator such as “-” or “/” then you can input these (in "") to find the separator and use this. Alternatively, if your data has no separator like in our example, you can input a numerical value to tell R where in the string it should split. Here we told it to split after the 1st character.

Finally our data is tidy!

Although there are additional operations that may be useful here such as expanding “m” and “f” into “male” and “female” and “2534” into “25-34”, we will save these for you to do at home.

Homework

  1. Take the tidy dataset “Orange” and transform it into a messy version. Make Tree numbers the rownames and age the column names.

  2. Using the tidy version of “Orange”, use ggplot to plot (use a scatter plot) the Tree circumference (Y) against the age (X) for all Trees while using Tree numbers to change the colour.

Hint use “?ggplot” and “?geom_point” as well as “example(geom_point)” for help

  1. Using the Hadley Wickham paper linked to earlier, create a dataframe “untidy_df” containing the information from Table 7. Using methods described in this lesson as well as some googling, convert your dataframe into a tidy version that looks like Table 8.

swirl homework

No Swirl homework this time

Operators and functions covered

Operators Covered Brief explanation
“%>%” (Tidyverse) Pipe operator. Can be thought of as “and then” and is used to reorder the operations into R. Shortcut key is “ctrl” + “shift” + “m”
“~” Used as an equation “=” example: y ~ mx + c. Commonly used in lm() and aggregate().

Functions Covered Brief explanation
“browseURL” Function that allows URL addresses to be accessed from R. Note, you need to put the URL in "".
“cbind and rbind” Function that combines all columns (cbind) or all rows (rbind). Requires that they are equal length
“filter” (Tidyverse) Used to subset rows (often conditionally)
“select” (Tidyverse) Used to select columns (variables)
“unlist” Simplifies an object from a list into a vector. This may coerce datatypes
“aggregate” Calculates a function on a dataframe’s variable y based on sorting according to the same dataframe’s x variables. Form: aggregate(y ~ x1 + x2, function to apply)
“group_by” (Tidyverse) Groups a dataframe according to a variable or variables. Oftyen used with summarise()
“summarise” (Tidyverse) performs a function on all groupings of a specified variable after being grouped according to group_by().
“rownames”, “colnames”" and “names”" Accesses the name vector for rows, columns or lists/vectors respectively. Can be used to view them, or to replace them
“t” matrix transpose. If applied to rectangular non-matrix objects it will coerce them into a matrix imposing any datatype coercions necessary
“gather” (Tidyverse) Used when a variable is stored as column names. Gather creates a new variable from the column names (key) and a variable for the data previosuly stored under these columns names (values)
“spread” (Tidyverse) The opposite of “gather”. Spread takes a variable (key) and spreads it out as column names while populating the columns with values from another column (value)
“separate” (Tidyverse) Takes a selected column and splits it into two columns. Can split according to position (example 5th character) or according to a separator “-” or “/”.
“sep” specifies a separator. Used to join or separate

Packages Covered Brief explanation
“tidyverse” A group of packages including: ggplot2, dplyr, tidyr, readr, purrr tibble, stringr and forcats
“ggplot2” A graphing tool based on the Grammer of Graphics
“dplyr” Data manipulation packages that relies on a small set of verbs
“tidyr” Provides functions to convert datasets to tidy data format
“readr” Fast solution for reading in (importing) rectangular data
“purrr” Expansion of the functional programming toolkit
“tibble” A reimagined dataframe
“stringr” Allows simple handling of string (character) data
“forcats” Simplified framework for handling of categorical data