Topics | Expected Time |
---|---|
1. Lesson Overview and Helpful hint of the week | (10min) |
2a. Operational Ordering Video | (18min) |
2b. Operational Ordering | (25min) |
3a. Tidy Data Video | (19min) |
3b. Tidy Data, Tidying up Unitidy Data | (35min) |
4. Weekly Homework | (30min) |
This lesson will introduce the Tidyverse by explaining its purpose, its makeup and a brief overview of its data philosophy. We will then cover two broad topics relating to the Tidyverse namely, operational ordering and tidy data.
The first topic will explain how the tidyverse breaks up complex structures into simple, ordered steps that allow for easier collaboration and revision of your own work. Here we introduce a couple of key verbs that are used to manipulate our data and end off by showing some useful tools for exploring your data. Below I will provide an English sentence that is the equivalent of normal R code and then show what the Tidyverse equivalent is:
“I would like to know what the average student enrolled in Finance Honours, that has also attempted other programming courses, thinks of their R course (excluding students repeating the course).”
Notice how the commas and brackets act as placeholders. Our ability to order complex senetences like this into something meaningful is often taken for granted but I assume this is more difficult for people learning English. Now lets look at the Tidyverse equivalent:
“Find students enrolled in Finance honours and then exclude repeat students and then only include students that have also attempted other programming courses and then ask them what they think of the R course and then calculate the average.”
The Tidyverse version is a terrible sentence. It becomes boring with its repition and simplicity but it is far easier to understand and follow. It is also far easier to edit or manipulate and clarifies the order of operations being executed by R.
The second Tidyverse topic covered this week will cover the concept of Tidy data. This idea is fundamental to creating collaborative and systematic work flows and is based on the idea that if there is one data format as an input, we can reduce the number of necessary functions and simplify the language we use to manipulate data to a small number of common use verbs.
In this section we begin by working through the examples given in Hadley Wickhams’s paper “Tidy Data” and then progress onto some basic functions used in tidying up untidy datasets.
Cheatsheets are like crib notes that a student would want to smuggle into an R exam. They are filled with helpful hints, guides and commands to use for specific aspect of R and R studio and often include info graphics and easy to follow steps.
There are hundreds of extremely useful cheatsheets available that can be used to help you remember all of the technical details of some function or package that you’re using. They are usually provided as a jam packed two page pdf file and are free to download.
Here is a link to a free to download cheatsheet repository provided by R Studio. I recommend you download “RStudio IDE Cheatsheet” for learning more about the R Studio interface and the “Data Transformation Cheatsheet” for learning “dplyr” which we’ll cover in this lesson.
Other cheatsheet which will definitely be useful in the future include the “Data Visualization Cheatsheet” for ggplot2 (the greatest and simplist plotting ever! maybe the single best thing about R) which will get its own lesson, “Dates and Times Cheatsheet”, “Apply Functions Cheatsheet”, “Data Import Cheatsheet” and the “R Markdown Cheatsheet” which will help you with your project.
I still refer to these cheatsheets all the time because the point isn’t to memorise functions, its to know where to find the info you need when you need it.
Time stamped video on Loom (18min)
Note that the first example used in the script will be duplicated in the Operational Ordering video. If you find this confusing please watch the video and if you found the video confusing see if the step by step approach shown here helps.
The tidyverse is a collection of packages for data science that all conform to the same data philosophy and grammar. That can be summarised as a consistent data layout (tidy data), manipulated and operated on in an intuitive, readable and generalised manner.
Currently the Tidyverse consists of the following packages:
By installing the tidyverse and calling it from library, it is the equivalent of installing all of the above packages and calling them from your library. (Remember to unhash the “install.packages(”tidyverse“)” the first time you run this on a machine)
# install.packages("tidyverse")
library("tidyverse")
One of the great additions that the tidyverse provides is the pipe operator “%>%”. This operator originally comes from the “magrittr” package and can be thought of as an “and then” conjunction.
The keyboard shortcut for “%>%” is “ctrl” + “shift” + “M”. Try it out.
Its purpose is to rearrange the procedural order from inside out and right to left to the more intuitive left to right and top to bottom.
Let’s bring up some data from the built in databases (Note the capital “O” in Orange):
orange_df <- datasets::Orange
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
7 1 1582 145
8 2 118 33
9 2 484 69
10 2 664 111
11 2 1004 156
12 2 1231 172
13 2 1372 203
14 2 1582 203
15 3 118 30
16 3 484 51
17 3 664 75
18 3 1004 108
19 3 1231 115
20 3 1372 139
21 3 1582 140
22 4 118 32
23 4 484 62
24 4 664 112
25 4 1004 167
26 4 1231 179
27 4 1372 209
28 4 1582 214
29 5 118 30
30 5 484 49
31 5 664 81
32 5 1004 125
33 5 1231 142
34 5 1372 174
35 5 1582 177
To find out more about the dataset, type “?datasets::Orange”.
If we want to calculate the average circumference of all trees aged 118 days we would normally type the following command:
answer <- mean(orange_df$circumference[orange_df$age == 118])
This would read as follows: “Calculate the average of the orange_df circumference variable, for observations corresponding to the age variable equalling 118 days.”
This is hard to read and as we will see the operations go inside out and right to left making collaborating or revising your work tricky. To understand the ordering by which R is executing this instruction we can break this into 3 steps.
age_118 <- orange_df$age == 118
age_118
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[16] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[31] FALSE FALSE FALSE FALSE FALSE
This produces a TRUE/FALSE vector. We can see this next to the original dataframe by using “cbind” to combine columns.
cbind(age_118, orange_df)
age_118 Tree age circumference
1 TRUE 1 118 30
2 FALSE 1 484 58
3 FALSE 1 664 87
4 FALSE 1 1004 115
5 FALSE 1 1231 120
6 FALSE 1 1372 142
7 FALSE 1 1582 145
8 TRUE 2 118 33
9 FALSE 2 484 69
10 FALSE 2 664 111
11 FALSE 2 1004 156
12 FALSE 2 1231 172
13 FALSE 2 1372 203
14 FALSE 2 1582 203
15 TRUE 3 118 30
16 FALSE 3 484 51
17 FALSE 3 664 75
18 FALSE 3 1004 108
19 FALSE 3 1231 115
20 FALSE 3 1372 139
21 FALSE 3 1582 140
22 TRUE 4 118 32
23 FALSE 4 484 62
24 FALSE 4 664 112
25 FALSE 4 1004 167
26 FALSE 4 1231 179
27 FALSE 4 1372 209
28 FALSE 4 1582 214
29 TRUE 5 118 30
30 FALSE 5 484 49
31 FALSE 5 664 81
32 FALSE 5 1004 125
33 FALSE 5 1231 142
34 FALSE 5 1372 174
35 FALSE 5 1582 177
Next we retrieve the circumference of the trees for all observations that correspond to age_118 being TRUE.
circumference_118 <- orange_df$circumference[age_118]
circumference_118
[1] 30 33 30 32 30
You can double check that this is correct.
Calculate the average using the function “mean()”.
mean(circumference_118)
[1] 31
The tidyverse is designed to allow us to write the procedures in the order of these steps. Where:
mean(orange_df$circumference[orange_df$age == 118])
[1] 31
Is read as Step 3, Step 2, Step 1, the tidyverse allows us to write the steps in order.
orange_df %>% # (1) Look at the object orange_df and then (%>%)
filter(age == 118) %>% # (2) Filter the rows where the age column is equal to 118 and then
select(circumference) %>% # (3) Select the column called circumference and then
unlist() %>% # (4) unlist the data (this is a simple structure transformation) and then
mean() # (5) calculate the mean.
[1] 31
The tidy method gives us the same answer we got before but the code is much more readable and intuitive.
But what if we wanted to calculate the average of all the trees circumferences at each age?
Using base R we could use the function “aggregate()”. “aggregate” uses an equation format to seperate what we want to find out (in this case circumference) from what we want to define the categories under (in this case age).
The input structure is as follows: aggregate(equation, dataframe, function)
We want to know the mean circumference after sorting by age so the equation would be “circumference ~ age”, the dataframe is “orange_df” and the function we want to use is “mean”.
aggregate(circumference ~ age, orange_df, mean)
age circumference
1 118 31.0
2 484 57.8
3 664 93.2
4 1004 134.2
5 1231 145.6
6 1372 173.4
7 1582 175.8
The above function has calculated the average circumference for each age.
Using the tidyverse we would employ a functions called “group_by” and “summarise” and it would read as follows:
The great thing about the tidyverse is that the function names almost always describe exactly what they do.
orange_df %>% # (1) Use the dataframe orange_df and then
group_by(age) %>% # (2) group the data by age and then
summarise(mean(circumference)) # (3) calculate the mean for each group
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 7 x 2
age `mean(circumference)`
<dbl> <dbl>
1 118 31
2 484 57.8
3 664 93.2
4 1004 134.
5 1231 146.
6 1372 173.
7 1582 176.
As your operations become more complex, readability becomes more and more important.
Note that both methods are efficient and produce the same results. Tidyverse solutions tend to follow more consistent logic than the hundreds of methods we could employ to solve similar problems.
Additionally, the Tidyverse packages are all designed to read data of the same basic structure and then output data in a consistent manner. The data we input into our Tidyverse functions are called tidy data.
The final section of this week’s lesson will introduce Tidy data, how to get your data tidy and why this will make your life easier.
Time stamped video on Loom (19min)
The “Tidy Data” video will repeat the first example provided here while explaining the concept of tidy data.
Lets start by clearing our environment so we can clearly see each object we create. If you would also like to clear your console remember you can press “ctrl” plus “l”
rm(list = ls())
The Tidy data philosophy comes from Hadley Wikhams paper linked here
I highly recommend you read the paper and then attempt the exercises shown here or go through it in conjunction with this section. The paper is a very easy read and I will use the examples, headings and structure in that paper in this section of the lesson.
Tidy Data has 3 rules:
Let’s elaborate on these terms:
Identifiers such as Tree numbers, or share names all count as a variable. Additionally, a time period identifier such as time stamps, or years or days of the week is a variable. This means identifiers should be in their own column and not be column names.
Observations are descriptions of a measurement of interest.
For example consider a looking at two share’s prices over two days. Each measurement of each share’s price would form an observation giving us four observations in total. Each observation would be described by a date column, a share code column and a price column. The date, the share code and the price collectively make up an observation.
We will go through some common examples of untidy data and show why it is untidy using the three rules mentioned above. Note that the examples will by no means be exhaustive because there may be one way for data to be tidy but there are an infinite many ways for it to be untidy.
One easy way to spot untidy data is to check whether observations are coordinates rather than rows
Example 1 (Table 1 in Wickham’s paper)
treatment_a <- c(NA, 16, 3)
treatment_b <- c(2, 11, 1)
df_1 <- data.frame(treatment_a, treatment_b)
rownames(df_1) <- c("John Smith", "Jane Doe", "Mary Johnson") # Names the rows of an object
df_1
treatment_a treatment_b
John Smith NA 2
Jane Doe 16 11
Mary Johnson 3 1
df_1 is a very common data format, particularly in spreadsheet software like Excel. The data is efficently organised without repetition and easy to read. But let’s assess whether it satisfies the three tidy rules.
According to the first rule each variable should form a column.
This is not satisfied for two reasons. The first rule is violatted because the identifiers or person names are not their own column but rather row names. Names or identifiers should always be their own column. Also, the treatment used should be a variable and not a heading as it is part of the information describing each observation.
According to the second rule each observation should form a row.
The second rule is violated because each observation is a coordinate and not a row. By coordinate, i mean that the observation is contained in a combination of a row number and a column number instead of just a single row.
This will become clearer with when we look at the tidy version of this data.
Example 2 (Table 2 in Wickham’s paper)
Table two is the same data displayed with the person names as column names and the treatments as rownames.
To recreate this we can simply transpose df_1. “t()” is R’s transpose function
df_2 <- t(df_1)
Notice that because transpose is a matrix operation, R coerced the dataframe into a matrix.
is.data.frame(df_2)
[1] FALSE
is.matrix(df_2)
[1] TRUE
This is also a good time to observe how matrices and dataframe are displayed in our global environment. Dataframes are described based on their number of observations and their number of variables. Matrices are displayed based on their dimensions as a simple data structure: first dimension is of length 2, second dimension is of length 3. Or more simply, 2 rows and 3 columns.
Lets coerce df_2 back into a dataframe and double check that none of the data was lost.
df_2 <- as.data.frame(df_2)
df_2
John Smith Jane Doe Mary Johnson
treatment_a NA 16 3
treatment_b 2 11 1
One way of thinking about why tidy data is important is trying to decide which object is better between df_1 and df_2? It’s hard to justify one over the other and the choice seems arbitrary.
Both are untidy because they have one variable stored in rownames and another variable stored in column names. Each observation is determined by a row/column coordinate but we know each observation should be in it’s own row.
Additionally, without an overall heading, we don’t actually know what the data inside the table means? What are 1, 2, 3, 11 and 16 measurements of?
Lets look at the tidy version of df_1 and df_2 by inputting it manually (table 3). Note that we can name the columns of our dataframes while we input them. Also note that we can put each variable on a separate line to make it easier to read.
df_tidy <- data.frame(name = c("John Smith", "Jane Doe", "Mary Johnson", "John Smith", "Jane Doe", "Mary Johnson"),
trt = c("a", "a", "a", "b", "b", "b"),
result = c(NA, 16, 3 , 2 , 11, 1))
df_tidy
name trt result
1 John Smith a NA
2 Jane Doe a 16
3 Mary Johnson a 3
4 John Smith b 2
5 Jane Doe b 11
6 Mary Johnson b 1
Here we would have the opportunity to describe the data better if we knew what the “result” was. With better information we could, as an example, call it “recovery time (in days)”.
Notice that the tidy version of the dataframe look less efficient. Where the untidy versions occupied 3 rows (columns) and 2 columns (rows), the tidy version occupies 6 rows and 3 columns. Importantly however, we have labels for each variable and there is only one version of this tidy data. Alternatively, we could have one person organise their data like df_1 and another organise it like df_2 without any good reason why one is better than the other. This would lead to two different people coming up with individual solutions to manipulating and analysing their data but with tidy data there would be only one.
For a discussion on the identifcation of variables and observations, please read page 4, paragraph 2 of the “Tidy Data” paper linked to above.
Extracted from (pg 5 Section 3. Tidying messy datasets)
Let’s go through a couple of these and tidy them up!
We will start by recreating the table given in the paper.
Note that I provide the names separately to the data here whereas I did it all at once in creating df_tidy. This is because the names are irregular in that they contain symbols and numbers that create errors when input into the data.frame() function. The names() function is more lenient in this regard.
df_4 <- data.frame(
c("Agnostic", "Athiest", "Buddhist", "Catholic", "Don't know/refused", "Evangelical Prot", "Hindu", "Historically Black Prot", "Jehovah's Witness", "Jewish"),
c(27, 12, 27, 418, 15, 575, 1, 228, 20, 19),
c(34, 27, 21, 617, 14, 869, 9, 244, 27, 19),
c(60, 37, 30, 732, 15, 1064, 7, 236, 24, 25),
c(81, 52, 34, 670, 11, 982, 9, 238, 24, 25),
c(76, 35, 33, 638, 10, 881, 11, 197, 21, 30),
c(137, 70, 58, 1116, 35, 1486, 34, 223, 30, 95)
)
names(df_4) <- c("religion", "<$10k", "$10-20k", "$20-30k", "$30-40k", "$40-50k", "$50-75k")
df_4
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
1 Agnostic 27 34 60 81 76 137
2 Athiest 12 27 37 52 35 70
3 Buddhist 27 21 30 34 33 58
4 Catholic 418 617 732 670 638 1116
5 Don't know/refused 15 14 15 11 10 35
6 Evangelical Prot 575 869 1064 982 881 1486
7 Hindu 1 9 7 9 11 34
8 Historically Black Prot 228 244 236 238 197 223
9 Jehovah's Witness 20 27 24 24 21 30
10 Jewish 19 19 25 25 30 95
The data used here comes from a Pew research poll on religion and income. The final three columns have been omitted as per the Wickham paper. The original research link provided in the paper seems to be down. This is what I think the data links to although the brackets seem to be further aggregated.
Notice that although it may be obvious that the numbers represent a count of the people that fit into a religious/income combination, this is not explicitly clear (it could have been average age…).
Lets look at the data and try to identify our variables and then identify our observations as units combining one measurement of each variable.
The first column seems to be a variable and it is already in its own column so this doesn’t need fixing. The next variable is the income brackets which is stored in the column names of column 2-7. This should be its own column. The last variable is the count of the sample population that fits into each religion/ income bracket coordinate.
Lets see how Tidyverse can help us tidy this data up.
We are going to be using a function called “gather()” to gather up the variable stored in column names and make it a variable column.
We can use the help function first to see what inputs gather requires.
?gather
It seems that this function requires three main inputs: data, key, and value as well as information regarding which columns it is gathering (the default will gather all).
We can specify that we are looking at “df_4” by using “df_4 %>% gather()” so we won’t need to repeat this. Key is asking us what to call the values that used to be column names. Value is asking us what to call the values that used to be coordinates. Finally, we specify 2:7 because we only want to gather columns 2:7 because column 1 was already a variable.
df_4_tidy <- df_4 %>%
gather(key = "Income Bracket",
value = "Number of people",
2:7)
religion Income Bracket Number of people
1 Agnostic <$10k 27
2 Athiest <$10k 12
3 Buddhist <$10k 27
4 Catholic <$10k 418
5 Don't know/refused <$10k 15
6 Evangelical Prot <$10k 575
7 Hindu <$10k 1
8 Historically Black Prot <$10k 228
9 Jehovah's Witness <$10k 20
10 Jewish <$10k 19
11 Agnostic $10-20k 34
12 Athiest $10-20k 27
13 Buddhist $10-20k 21
14 Catholic $10-20k 617
15 Don't know/refused $10-20k 14
16 Evangelical Prot $10-20k 869
17 Hindu $10-20k 9
18 Historically Black Prot $10-20k 244
19 Jehovah's Witness $10-20k 27
20 Jewish $10-20k 19
21 Agnostic $20-30k 60
22 Athiest $20-30k 37
23 Buddhist $20-30k 30
24 Catholic $20-30k 732
25 Don't know/refused $20-30k 15
26 Evangelical Prot $20-30k 1064
27 Hindu $20-30k 7
28 Historically Black Prot $20-30k 236
29 Jehovah's Witness $20-30k 24
30 Jewish $20-30k 25
31 Agnostic $30-40k 81
32 Athiest $30-40k 52
33 Buddhist $30-40k 34
34 Catholic $30-40k 670
35 Don't know/refused $30-40k 11
36 Evangelical Prot $30-40k 982
37 Hindu $30-40k 9
38 Historically Black Prot $30-40k 238
39 Jehovah's Witness $30-40k 24
40 Jewish $30-40k 25
41 Agnostic $40-50k 76
42 Athiest $40-50k 35
43 Buddhist $40-50k 33
44 Catholic $40-50k 638
45 Don't know/refused $40-50k 10
46 Evangelical Prot $40-50k 881
47 Hindu $40-50k 11
48 Historically Black Prot $40-50k 197
49 Jehovah's Witness $40-50k 21
50 Jewish $40-50k 30
51 Agnostic $50-75k 137
52 Athiest $50-75k 70
53 Buddhist $50-75k 58
54 Catholic $50-75k 1116
55 Don't know/refused $50-75k 35
56 Evangelical Prot $50-75k 1486
57 Hindu $50-75k 34
58 Historically Black Prot $50-75k 223
59 Jehovah's Witness $50-75k 30
60 Jewish $50-75k 95
Note that placing each argument from the function on a separate line is completely optional but I find it makes it easier to read. Also, we could have also used all columns except 1 “-1” instead of “2:7”. Try it out!
The opposite of “gather” is “spread”. We can use this to reverse the process done to make df_4_tidy. The inputs are the same as gather but the key is asking for the column that you would like to turn into column names, and the value is asking for the column to spread out into coordinates.
df_4_untidy <- df_4_tidy %>%
spread(key = "Income Bracket", value = "Number of people", -1)
This is quite a tricky concept to grasp straight away so play around with it and keep in mind you will have to revisit this concept often to remind yourself how it works (I did anyway).
df_4_untidy should now back to its original format and should be identical to df_4. This can be useful if you would like to perform analysis or transformations on data but then return to its more readable form later.
identical(df_4, df_4_untidy)
[1] FALSE
Note that identical returns FALSE only because the gather and spread operations have reordered the variables and placed the “<$10k” as the 7th variable instead of the 2nd. We can fix this manually by simply instructing R to place the “<$10k” in 2nd column using the “select()” function.
df_4_untidy <- df_4_untidy %>%
select(1, 7, everything())
Select is usually used to select certain columns from your dataframe, however it also reorders your dataframe according to the order that you select the variables. There are much neater solutions than the one provided above but we are not yet ready to introduce them.
For now, lets just note why the above solution is bad.
Last, we will look at a particularly tough dataset.
This data will have multiple variables stored as column names and require a few steps to clean up. We are replicating Table 9 from the paper which has data from the World Health Organisation on Tuberculosis cases sorted by country, year and demographic group.
df_messiest <- data.frame(
country = c("AD", "AE", "AF", "AG", "AL", "AM", "AN", "AO", "AR", "AS"),
year = c(2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000),
m014 = c(0, 2, 52, 0, 2, 2, 0, 186, 97, NA),
m1524 = c(0, 4, 228, 0, 19, 152, 0, 999, 278, NA),
m2534 = c(1, 4, 183, 0, 21, 130, 1, 1003, 594, NA),
m3544 = c(0, 6, 149, 0, 14, 131, 2, 912, 402, NA),
m4554 = c(0, 5, 129, 0, 24, 63, 0, 482, 419, 1),
m5564 = c(0, 12, 94, 0, 19, 26, 0, 312, 368, 1),
m65 = c(0, 10, 80, 1, 16, 21, 0, 194, 330, NA),
mu = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
f014 = c(NA, 3, 93, 1, 3, 1, 0, 247, 121, NA)
)
df_messiest
country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014
1 AD 2000 0 0 1 0 0 0 0 NA NA
2 AE 2000 2 4 4 6 5 12 10 NA 3
3 AF 2000 52 228 183 149 129 94 80 NA 93
4 AG 2000 0 0 0 0 0 0 1 NA 1
5 AL 2000 2 19 21 14 24 19 16 NA 3
6 AM 2000 2 152 130 131 63 26 21 NA 1
7 AN 2000 0 0 1 2 0 0 0 NA 0
8 AO 2000 186 999 1003 912 482 312 194 NA 247
9 AR 2000 97 278 594 402 419 368 330 NA 121
10 AS 2000 NA NA NA NA 1 1 NA NA NA
This is clearly a very messy and hard to read table. Here the column names (excluding “country” and “year”) hold multiple pieces of information. m014 is male that is between 0 and 14 years old. Likewise f is for female and u stands for unknown age. The dataset is cut off and not showing the rest of the data for females for simplicity.
The untidiness of this dataset is complex.
df_less_messy <- df_messiest %>%
gather(key = "sex_age", value = "cases", 3:11)
country year sex_age cases
1 AD 2000 m014 0
2 AE 2000 m014 2
3 AF 2000 m014 52
4 AG 2000 m014 0
5 AL 2000 m014 2
6 AM 2000 m014 2
7 AN 2000 m014 0
8 AO 2000 m014 186
9 AR 2000 m014 97
10 AS 2000 m014 NA
11 AD 2000 m1524 0
12 AE 2000 m1524 4
13 AF 2000 m1524 228
14 AG 2000 m1524 0
15 AL 2000 m1524 19
16 AM 2000 m1524 152
17 AN 2000 m1524 0
18 AO 2000 m1524 999
19 AR 2000 m1524 278
20 AS 2000 m1524 NA
21 AD 2000 m2534 1
22 AE 2000 m2534 4
23 AF 2000 m2534 183
24 AG 2000 m2534 0
25 AL 2000 m2534 21
26 AM 2000 m2534 130
27 AN 2000 m2534 1
28 AO 2000 m2534 1003
29 AR 2000 m2534 594
30 AS 2000 m2534 NA
31 AD 2000 m3544 0
32 AE 2000 m3544 6
33 AF 2000 m3544 149
34 AG 2000 m3544 0
35 AL 2000 m3544 14
36 AM 2000 m3544 131
37 AN 2000 m3544 2
38 AO 2000 m3544 912
39 AR 2000 m3544 402
40 AS 2000 m3544 NA
41 AD 2000 m4554 0
42 AE 2000 m4554 5
43 AF 2000 m4554 129
44 AG 2000 m4554 0
45 AL 2000 m4554 24
46 AM 2000 m4554 63
47 AN 2000 m4554 0
48 AO 2000 m4554 482
49 AR 2000 m4554 419
50 AS 2000 m4554 1
51 AD 2000 m5564 0
52 AE 2000 m5564 12
53 AF 2000 m5564 94
54 AG 2000 m5564 0
55 AL 2000 m5564 19
56 AM 2000 m5564 26
57 AN 2000 m5564 0
58 AO 2000 m5564 312
59 AR 2000 m5564 368
60 AS 2000 m5564 1
61 AD 2000 m65 0
62 AE 2000 m65 10
63 AF 2000 m65 80
64 AG 2000 m65 1
65 AL 2000 m65 16
66 AM 2000 m65 21
67 AN 2000 m65 0
68 AO 2000 m65 194
69 AR 2000 m65 330
70 AS 2000 m65 NA
71 AD 2000 mu NA
72 AE 2000 mu NA
73 AF 2000 mu NA
74 AG 2000 mu NA
75 AL 2000 mu NA
76 AM 2000 mu NA
77 AN 2000 mu NA
78 AO 2000 mu NA
79 AR 2000 mu NA
80 AS 2000 mu NA
81 AD 2000 f014 NA
82 AE 2000 f014 3
83 AF 2000 f014 93
84 AG 2000 f014 1
85 AL 2000 f014 3
86 AM 2000 f014 1
87 AN 2000 f014 0
88 AO 2000 f014 247
89 AR 2000 f014 121
90 AS 2000 f014 NA
Now we need to separate sex and age brackets into their own columns. To do this we use a function called“separate()”. Separate asks us which column to separate (sex_age) and then asks what the names of the new columns should be (sex and age), and finally where it should create the split.
df_tidied_up <- df_less_messy %>%
separate(sex_age, into = c("sex", "age"), sep = 1)
country year sex age cases
1 AD 2000 m 014 0
2 AE 2000 m 014 2
3 AF 2000 m 014 52
4 AG 2000 m 014 0
5 AL 2000 m 014 2
6 AM 2000 m 014 2
7 AN 2000 m 014 0
8 AO 2000 m 014 186
9 AR 2000 m 014 97
10 AS 2000 m 014 NA
11 AD 2000 m 1524 0
12 AE 2000 m 1524 4
13 AF 2000 m 1524 228
14 AG 2000 m 1524 0
15 AL 2000 m 1524 19
16 AM 2000 m 1524 152
17 AN 2000 m 1524 0
18 AO 2000 m 1524 999
19 AR 2000 m 1524 278
20 AS 2000 m 1524 NA
21 AD 2000 m 2534 1
22 AE 2000 m 2534 4
23 AF 2000 m 2534 183
24 AG 2000 m 2534 0
25 AL 2000 m 2534 21
26 AM 2000 m 2534 130
27 AN 2000 m 2534 1
28 AO 2000 m 2534 1003
29 AR 2000 m 2534 594
30 AS 2000 m 2534 NA
31 AD 2000 m 3544 0
32 AE 2000 m 3544 6
33 AF 2000 m 3544 149
34 AG 2000 m 3544 0
35 AL 2000 m 3544 14
36 AM 2000 m 3544 131
37 AN 2000 m 3544 2
38 AO 2000 m 3544 912
39 AR 2000 m 3544 402
40 AS 2000 m 3544 NA
41 AD 2000 m 4554 0
42 AE 2000 m 4554 5
43 AF 2000 m 4554 129
44 AG 2000 m 4554 0
45 AL 2000 m 4554 24
46 AM 2000 m 4554 63
47 AN 2000 m 4554 0
48 AO 2000 m 4554 482
49 AR 2000 m 4554 419
50 AS 2000 m 4554 1
51 AD 2000 m 5564 0
52 AE 2000 m 5564 12
53 AF 2000 m 5564 94
54 AG 2000 m 5564 0
55 AL 2000 m 5564 19
56 AM 2000 m 5564 26
57 AN 2000 m 5564 0
58 AO 2000 m 5564 312
59 AR 2000 m 5564 368
60 AS 2000 m 5564 1
61 AD 2000 m 65 0
62 AE 2000 m 65 10
63 AF 2000 m 65 80
64 AG 2000 m 65 1
65 AL 2000 m 65 16
66 AM 2000 m 65 21
67 AN 2000 m 65 0
68 AO 2000 m 65 194
69 AR 2000 m 65 330
70 AS 2000 m 65 NA
71 AD 2000 m u NA
72 AE 2000 m u NA
73 AF 2000 m u NA
74 AG 2000 m u NA
75 AL 2000 m u NA
76 AM 2000 m u NA
77 AN 2000 m u NA
78 AO 2000 m u NA
79 AR 2000 m u NA
80 AS 2000 m u NA
81 AD 2000 f 014 NA
82 AE 2000 f 014 3
83 AF 2000 f 014 93
84 AG 2000 f 014 1
85 AL 2000 f 014 3
86 AM 2000 f 014 1
87 AN 2000 f 014 0
88 AO 2000 f 014 247
89 AR 2000 f 014 121
90 AS 2000 f 014 NA
Notice that “sep” argument is very versatile, if your values have a separator such as “-” or “/” then you can input these (in "") to find the separator and use this. Alternatively, if your data has no separator like in our example, you can input a numerical value to tell R where in the string it should split. Here we told it to split after the 1st character.
Finally our data is tidy!
Although there are additional operations that may be useful here such as expanding “m” and “f” into “male” and “female” and “2534” into “25-34”, we will save these for you to do at home.
Take the tidy dataset “Orange” and transform it into a messy version. Make Tree numbers the rownames and age the column names.
Using the tidy version of “Orange”, use ggplot to plot (use a scatter plot) the Tree circumference (Y) against the age (X) for all Trees while using Tree numbers to change the colour.
Hint use “?ggplot” and “?geom_point” as well as “example(geom_point)” for help
No Swirl homework this time
Operators Covered | Brief explanation |
---|---|
“%>%” | (Tidyverse) Pipe operator. Can be thought of as “and then” and is used to reorder the operations into R. Shortcut key is “ctrl” + “shift” + “m” |
“~” | Used as an equation “=” example: y ~ mx + c. Commonly used in lm() and aggregate(). |
Functions Covered | Brief explanation |
---|---|
“browseURL” | Function that allows URL addresses to be accessed from R. Note, you need to put the URL in "". |
“cbind and rbind” | Function that combines all columns (cbind) or all rows (rbind). Requires that they are equal length |
“filter” | (Tidyverse) Used to subset rows (often conditionally) |
“select” | (Tidyverse) Used to select columns (variables) |
“unlist” | Simplifies an object from a list into a vector. This may coerce datatypes |
“aggregate” | Calculates a function on a dataframe’s variable y based on sorting according to the same dataframe’s x variables. Form: aggregate(y ~ x1 + x2, function to apply) |
“group_by” | (Tidyverse) Groups a dataframe according to a variable or variables. Oftyen used with summarise() |
“summarise” | (Tidyverse) performs a function on all groupings of a specified variable after being grouped according to group_by(). |
“rownames”, “colnames”" and “names”" | Accesses the name vector for rows, columns or lists/vectors respectively. Can be used to view them, or to replace them |
“t” | matrix transpose. If applied to rectangular non-matrix objects it will coerce them into a matrix imposing any datatype coercions necessary |
“gather” | (Tidyverse) Used when a variable is stored as column names. Gather creates a new variable from the column names (key) and a variable for the data previosuly stored under these columns names (values) |
“spread” | (Tidyverse) The opposite of “gather”. Spread takes a variable (key) and spreads it out as column names while populating the columns with values from another column (value) |
“separate” | (Tidyverse) Takes a selected column and splits it into two columns. Can split according to position (example 5th character) or according to a separator “-” or “/”. |
“sep” | specifies a separator. Used to join or separate |
Packages Covered | Brief explanation |
---|---|
“tidyverse” | A group of packages including: ggplot2, dplyr, tidyr, readr, purrr tibble, stringr and forcats |
“ggplot2” | A graphing tool based on the Grammer of Graphics |
“dplyr” | Data manipulation packages that relies on a small set of verbs |
“tidyr” | Provides functions to convert datasets to tidy data format |
“readr” | Fast solution for reading in (importing) rectangular data |
“purrr” | Expansion of the functional programming toolkit |
“tibble” | A reimagined dataframe |
“stringr” | Allows simple handling of string (character) data |
“forcats” | Simplified framework for handling of categorical data |