R Workshop

Ryan Yeung

2023-06-12

Overview

About Me
Why R?
Learning R
Doing R

About Me

Postdoctoral Fellow in the Levine lab
PhD at the University of Waterloo (Fernandes lab)
Almost entirely self-taught!
Only really abandoned SPSS during PhD 2!

Why R?

R Saves Time and Energy

“Psychology major Aesthetic 🙃”

“I’m so proud of myself for not crying 6 straight hours in…”

370 points (98% upvoted) on /r/psychologystudents

Source: https://archive.is/QDXdA

R Enables Research

neuroimaging

computational modeling

machine learning

R Gets Jobs

Source: https://archive.is/NuTN7

Learning R

You Have to Choose R

But it Doesn’t Have to Stink

Learn with/from others
Scour the internet ruthlessly
Everyone gets errors and needs to debug
Be kind to yourself!

(While this picture’s just a joke, code plagiarism is a real thing!
Always cite your sources!)

Doing R

R and RStudio

R: A programming language.
- Free and open source since 1995
RStudio: An integrated development environment (IDE) developed by Posit (formerly RStudio).

RStudio’s User Interface

Documentation: https://docs.posit.co/ide/user/ide/guide/ui/ui-panes.html

File Types

.R: R script.
- Only R code, always runs top to bottom.
.Rmd: R Markdown.
- Combines R code with Markdown. Runs code in chunks.
.qmd: Quarto Markdown.
- Recent update to .Rmd, mostly very similar but offers new features (e.g., more output types).
.RProj: R Project.
- Sets the working directory, remembers what files you last opened.

Starting a Project

RStudio >
File >
New Project >
New/Existing Directory >
Quarto Project

Starting a Script

RStudio >
File >
New File >
Quarto Document

YAML

Non-R code at the start of a script file (e.g., .Rmd, .qmd)
Sets options & metadata for the script file globally
Available options depend on the output format

---
title: "r_workshop_2023-06-12"
author: "Ryan Yeung"
date: '2023-06-12'
format:
  html:
    toc: true
    toc-location: left
    code-fold: true
    code-tools: true
    code-link: true
    self-contained: true
    theme: sandstone
    df-print: paged
---

Documentation: https://quarto.org/docs/output-formats/all-formats.html

Code Chunks

```{r code chunks example}
#| code-line-numbers: "|1,14"
#| output: false
read_csv("data/datacareer_28kjobs.csv", show_col_types = FALSE) %>% 
  filter(!(language == "Python")) %>% 
  ggplot(aes(x = popularity, y = role)) +
  geom_col() +
  facet_wrap(~language, ncol = 1) +
  labs(x = "% of Indeed Job Ads in 2017 (N = 28,732)",
       y = "Role") +
  theme_classic() + 
  theme(text = element_text(size = 18),
        axis.title = element_text(face = "bold", size = 24))
```

Run Current Line: Ctrl/Cmd + Enter
Run Current Chunk: Ctrl/Cmd + Shift + Enter
Insert/Break Chunk: Ctrl/Cmd + Alt/Option + I

Chunk Options

```{r chunk options example}
#| code-line-numbers: "2-3"
#| output: false
read_csv("data/datacareer_28kjobs.csv", show_col_types = FALSE) %>% 
  filter(!(language == "Python")) %>% 
  ggplot(aes(x = popularity, y = role)) +
  geom_col() +
  facet_wrap(~language, ncol = 1) +
  labs(x = "% of Indeed Job Ads in 2017 (N = 28,732)",
       y = "Role") +
  theme_classic() + 
  theme(text = element_text(size = 18),
        axis.title = element_text(face = "bold", size = 24))
```

Sets options for single chunks

Documentation: https://quarto.org/docs/reference/cells/cells-knitr.html

Setup Chunk

```{r setup example}
#| eval: false

library(tidyverse)

### if you haven't installed the tidyverse, e.g., 
### Error in library(tidyverse): there is no package called ‘tidyverse’)
### run install.packages("tidyverse") in the console
```

R Code

```{r add}
1 + 2
```

[1] 3

Objects and Assignment

```{r}
my_object <- 1 + 2  
```

Run code, then assign the output to my_object.

Objects and Assignment

```{r}
my_object <- 1 + 2    
my_object            
```

[1] 3

Run code, then assign the output to my_object.
Print my_object.

Insert Assignment Operator (“<-”): Alt/Opt + -

Printing Objects

```{r assign to object and print using parentheses}
(my_object <- 1 + 2)
```

[1] 3

Operating on Objects

```{r double my object}
my_object <- 1 + 2

(my_object_doubled <- my_object*2)
```

[1] 6

Functions

```{r use object as argument in function}
my_object <- 1 + 2

sqrt(my_object)
```

[1] 1.732051

function_name(arg1 = val1, arg2 = val2, …)

Object Types

```{r numeric/double}
my_object <- 1 + 2

class(my_object)
```

[1] "numeric"

```{r character/strings}
my_object <- "apple"

class(my_object)
```

[1] "character"

Error!

```{r}
#| eval: false
my_object <- "apple"

sqrt(my_object)
```

Console:
- Error in sqrt(my_object) : non-numeric argument to mathematical function

What Do You Want From Me??

In the Console, type ?sqrt and press enter.
- Usage: sqrt(x)
- Arguments: x a numeric or complex vector or array.

Also accessible through “Help” in the Output pane!

Comments & Markdown

```{r title}
### you can comment an entire line
#### with any number of # symbols
my_object <- 1 + 2 # or, you can comment at the end of a line

sqrt(my_object)
```

[1] 1.732051

This is Markdown!

Good Habits for Comments/Markdown

Comment liberally!
Try to describe why, rather than what
Use comments to “hide” code rather than deleting it

```{r comments example}
#| output: false
my_object <- 1 + 2
# my_object <- "apple"

sqrt(as.numeric(my_object)) # convert my_object to numeric, then sqrt

sqrt(as.numeric(my_object)) # sqrt only accepts numeric objects
```

Dataframes (DFs)

Tabular data format
- Rows and columns (variables) like a spreadsheet
- Today, we’ll be working with the msleep dataframe (included in the tidyverse)

                        name      genus  vore sleep_total brainwt
1                    Cheetah   Acinonyx carni        12.1      NA
2                 Owl monkey      Aotus  omni        17.0 0.01550
3            Mountain beaver Aplodontia herbi        14.4      NA
4 Greater short-tailed shrew    Blarina  omni        14.9 0.00029
5                        Cow        Bos herbi         4.0 0.42300
6           Three-toed sloth   Bradypus herbi        14.4      NA

For more info, enter ?msleep into the console!

Subsetting

Base R:

Square brackets and/or $

Tidyverse:

filter():
- filteR() = Rows
- Keep only the rows that meet this condition

select():
- seleCt() = Columns
- Keep only the columns that meet this condition

Subsetting: Filter

# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Let’s say you only want herbivores:

Keep only rows where vore is equal to "herbi"

Subsetting: Filter

```{r filtering}
filter(msleep, vore == "herbi")
```

# A tibble: 32 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 2 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 3 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 4 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
 5 Goat   Capri herbi Arti… lc                   5.3       0.6      NA      18.7
 6 Guine… Cavis herbi Rode… domesticated         9.4       0.8       0.217  14.6
 7 Chinc… Chin… herbi Rode… domesticated        12.5       1.5       0.117  11.5
 8 Tree … Dend… herbi Hyra… lc                   5.3       0.5      NA      18.7
 9 Asian… Elep… herbi Prob… en                   3.9      NA        NA      20.1
10 Horse  Equus herbi Peri… domesticated         2.9       0.6       1      21.1
# ℹ 22 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Subsetting: Select

# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Let’s say you only want sleep-related variables:

Keep only columns that start with "sleep"

Subsetting: Select

```{r selecting}
select(msleep, name, vore, starts_with("sleep"))
```

# A tibble: 83 × 5
   name                       vore  sleep_total sleep_rem sleep_cycle
   <chr>                      <chr>       <dbl>     <dbl>       <dbl>
 1 Cheetah                    carni        12.1      NA        NA    
 2 Owl monkey                 omni         17         1.8      NA    
 3 Mountain beaver            herbi        14.4       2.4      NA    
 4 Greater short-tailed shrew omni         14.9       2.3       0.133
 5 Cow                        herbi         4         0.7       0.667
 6 Three-toed sloth           herbi        14.4       2.2       0.767
 7 Northern fur seal          carni         8.7       1.4       0.383
 8 Vesper mouse               <NA>          7        NA        NA    
 9 Dog                        carni        10.1       2.9       0.333
10 Roe deer                   herbi         3        NA        NA    
# ℹ 73 more rows

Naming Conventions

Source: https://twitter.com/allison_horst/status/1205702878544875521

Good Habits for Naming

Should be (1) machine readable, (2) human readable, and (3) useful default ordering
- https://speakerdeck.com/jennybc/how-to-name-files

sleep_total, sleep_rem, sleep_cycle
- starts_with(“sleep”)
total_sleep, rem_sleep, sleep_cycle
- contains(“sleep”)
total, rem, cycle
- ???

Subsetting: Filter AND Select?

Which option works (i.e., doesn’t throw an error)?

```{r filtering and selecting}
#| eval: false
# filter(msleep, vore == "herbi")
# select(msleep, name, vore, starts_with("sleep"))

### OPTION A:
select(filter(msleep, vore == "herbi"), name, vore, starts_with("sleep"))

### OPTION B:
filter(select(msleep, name, vore, starts_with("sleep")), vore == "herbi")

### OPTION C:
df_filtered <- filter(msleep, vore == "herbi")
select(df_filtered, name, vore, starts_with("sleep"))

### OPTION D:
# all of them work!
```

There Must Be A Better Way…

```{r without pipes}
#| eval: false
select(filter(msleep, vore == "herbi"), name, vore, starts_with("sleep"))
```

```{r without pipes but prettier}
#| eval: false
select(filter(msleep, vore == "herbi"),
       name, vore, starts_with("sleep"))
```

Pipe Notation

```{r with pipes}
#| eval: false
msleep %>% 
  filter(vore == "herbi") %>% 
  select(name, vore, starts_with("sleep"))
```

Pipes (%>%) take the result of the code before it (on the left), and send that result into the code after it (on the right) as the first argument
If wanting to supply result as a different argument (not the first), can specify which argument as . (period)
Insert Pipe: Ctrl/Cmd + Shift + M

Why Pipes?

```{r}
msleep %>% 
  filter(vore == "herbi") %>% 
  select(name, vore, starts_with("sleep"))
```

# A tibble: 32 × 5
   name             vore  sleep_total sleep_rem sleep_cycle
   <chr>            <chr>       <dbl>     <dbl>       <dbl>
 1 Mountain beaver  herbi        14.4       2.4      NA    
 2 Cow              herbi         4         0.7       0.667
 3 Three-toed sloth herbi        14.4       2.2       0.767
 4 Roe deer         herbi         3        NA        NA    
 5 Goat             herbi         5.3       0.6      NA    
 6 Guinea pig       herbi         9.4       0.8       0.217
 7 Chinchilla       herbi        12.5       1.5       0.117
 8 Tree hyrax       herbi         5.3       0.5      NA    
 9 Asian elephant   herbi         3.9      NA        NA    
10 Horse            herbi         2.9       0.6       1    
# ℹ 22 more rows

Why Pipes?

```{r}
msleep %>% 
  # filter(vore == "herbi") %>% 
  select(name, vore, starts_with("sleep"))
```

# A tibble: 83 × 5
   name                       vore  sleep_total sleep_rem sleep_cycle
   <chr>                      <chr>       <dbl>     <dbl>       <dbl>
 1 Cheetah                    carni        12.1      NA        NA    
 2 Owl monkey                 omni         17         1.8      NA    
 3 Mountain beaver            herbi        14.4       2.4      NA    
 4 Greater short-tailed shrew omni         14.9       2.3       0.133
 5 Cow                        herbi         4         0.7       0.667
 6 Three-toed sloth           herbi        14.4       2.2       0.767
 7 Northern fur seal          carni         8.7       1.4       0.383
 8 Vesper mouse               <NA>          7        NA        NA    
 9 Dog                        carni        10.1       2.9       0.333
10 Roe deer                   herbi         3        NA        NA    
# ℹ 73 more rows

Pipe Quiz

Which option works (i.e., doesn’t throw an error)?

```{r pipe quiz}
#| eval: false

### OPTION A:
msleep %>% 
  select(filter(msleep, vore == "herbi"), vore, starts_with("sleep"))

### OPTION B:
filter(vore == "herbi") %>% 
  select(name, vore, starts_with("sleep"))

### OPTION C:
msleep %>% 
  filter(vore == "herbi") %>% 
  select(name, vore, starts_with("sleep")) %>% 

### OPTION D:
# none of them work!
```

Mutating

# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Let’s say you want animals’ amount of non-REM sleep:

Make a new variable, subtracting sleep_rem from sleep_total

Mutating

```{r mutate}
msleep %>% 
  mutate(sleep_nrem = sleep_total - sleep_rem) %>% 
  select(name, starts_with("sleep"))
```

# A tibble: 83 × 5
   name                       sleep_total sleep_rem sleep_cycle sleep_nrem
   <chr>                            <dbl>     <dbl>       <dbl>      <dbl>
 1 Cheetah                           12.1      NA        NA           NA  
 2 Owl monkey                        17         1.8      NA           15.2
 3 Mountain beaver                   14.4       2.4      NA           12  
 4 Greater short-tailed shrew        14.9       2.3       0.133       12.6
 5 Cow                                4         0.7       0.667        3.3
 6 Three-toed sloth                  14.4       2.2       0.767       12.2
 7 Northern fur seal                  8.7       1.4       0.383        7.3
 8 Vesper mouse                       7        NA        NA           NA  
 9 Dog                               10.1       2.9       0.333        7.2
10 Roe deer                           3        NA        NA           NA  
# ℹ 73 more rows

Plotting: Scatterplot

```{r scatterplot}
msleep %>% 
  ggplot(mapping = aes(x = sleep_total, y = sleep_rem),
         data = .) +  # '+' is specific to ggplot2, similar (but not equivalent) to the pipe!
  geom_point() 
```

Plotting: Boxplot

```{r boxplot}
msleep %>% 
  ggplot(aes(x = vore, y = sleep_total)) +
  geom_boxplot() 
```

Output

Render! (Knit in .Rmd)

Render/Knit at the top of the Source pane

RData and the Environment

RStudio asks to automatically store/reload environments
highly recommended to NOT store/reload environments!
- the script should recreate the environment on its own!
instead, objects can be manually stored and reloaded
- save() for multiple objects, saveRDS() for a single object

Packages I Can’t Live Without

Resources

Feel free to email me out of the blue! ryeung@research.baycrest.org