R Workshop

Ryan Yeung

2023-06-12

Overview

  • About Me
  • Why R?
  • Learning R
  • Doing R

About Me

  • Postdoctoral Fellow in the Levine lab
  • PhD at the University of Waterloo (Fernandes lab)
  • Almost entirely self-taught!
  • Only really abandoned SPSS during PhD 2!

Why R?

R Saves Time and Energy

“Psychology major Aesthetic 🙃”

“I’m so proud of myself for not crying 6 straight hours in…”

370 points (98% upvoted) on /r/psychologystudents

R Enables Research




neuroimaging


computational modeling


machine learning

R Gets Jobs

Learning R

You Have to Choose R

But it Doesn’t Have to Stink

  • Learn with/from others
  • Scour the internet ruthlessly
  • Everyone gets errors and needs to debug
  • Be kind to yourself!

(While this picture’s just a joke, code plagiarism is a real thing!
Always cite your sources!)

Doing R

R and RStudio

  • R: A programming language.
    • Free and open source since 1995
  • RStudio: An integrated development environment (IDE) developed by Posit (formerly RStudio).

RStudio’s User Interface

Pane Layout

File Types

  • .R: R script.
    • Only R code, always runs top to bottom.
  • .Rmd: R Markdown.
    • Combines R code with Markdown. Runs code in chunks.
  • .qmd: Quarto Markdown.
    • Recent update to .Rmd, mostly very similar but offers new features (e.g., more output types).
  • .RProj: R Project.
    • Sets the working directory, remembers what files you last opened.

Starting a Project

RStudio >
File >
New Project >
New/Existing Directory >
Quarto Project

Starting a Script

RStudio >
File >
New File >
Quarto Document

YAML

  • Non-R code at the start of a script file (e.g., .Rmd, .qmd)
  • Sets options & metadata for the script file globally
  • Available options depend on the output format
---
title: "r_workshop_2023-06-12"
author: "Ryan Yeung"
date: '2023-06-12'
format:
  html:
    toc: true
    toc-location: left
    code-fold: true
    code-tools: true
    code-link: true
    self-contained: true
    theme: sandstone
    df-print: paged
---

Code Chunks

```{r code chunks example}
#| code-line-numbers: "|1,14"
#| output: false
read_csv("data/datacareer_28kjobs.csv", show_col_types = FALSE) %>% 
  filter(!(language == "Python")) %>% 
  ggplot(aes(x = popularity, y = role)) +
  geom_col() +
  facet_wrap(~language, ncol = 1) +
  labs(x = "% of Indeed Job Ads in 2017 (N = 28,732)",
       y = "Role") +
  theme_classic() + 
  theme(text = element_text(size = 18),
        axis.title = element_text(face = "bold", size = 24))
```
  • Run Current Line: Ctrl/Cmd + Enter
  • Run Current Chunk: Ctrl/Cmd + Shift + Enter
  • Insert/Break Chunk: Ctrl/Cmd + Alt/Option + I

Chunk Options

```{r chunk options example}
#| code-line-numbers: "2-3"
#| output: false
read_csv("data/datacareer_28kjobs.csv", show_col_types = FALSE) %>% 
  filter(!(language == "Python")) %>% 
  ggplot(aes(x = popularity, y = role)) +
  geom_col() +
  facet_wrap(~language, ncol = 1) +
  labs(x = "% of Indeed Job Ads in 2017 (N = 28,732)",
       y = "Role") +
  theme_classic() + 
  theme(text = element_text(size = 18),
        axis.title = element_text(face = "bold", size = 24))
```
  • Sets options for single chunks

Setup Chunk

```{r setup example}
#| eval: false

library(tidyverse)

### if you haven't installed the tidyverse, e.g., 
### Error in library(tidyverse): there is no package called ‘tidyverse’)
### run install.packages("tidyverse") in the console
```

R Code

```{r add}
1 + 2
```
[1] 3

Objects and Assignment

```{r}
my_object <- 1 + 2  
```
  1. Run code, then assign the output to my_object.

Objects and Assignment

```{r}
my_object <- 1 + 2    
my_object            
```
[1] 3
  1. Run code, then assign the output to my_object.
  2. Print my_object.
  • Insert Assignment Operator (“<-”): Alt/Opt + -

Printing Objects

```{r assign to object and print using parentheses}
(my_object <- 1 + 2)
```
[1] 3

Operating on Objects

```{r double my object}
my_object <- 1 + 2

(my_object_doubled <- my_object*2)
```
[1] 6

Functions

```{r use object as argument in function}
my_object <- 1 + 2

sqrt(my_object)
```
[1] 1.732051
  • function_name(arg1 = val1, arg2 = val2, …)

Object Types

```{r numeric/double}
my_object <- 1 + 2

class(my_object)
```
[1] "numeric"


```{r character/strings}
my_object <- "apple"

class(my_object)
```
[1] "character"

Error!

```{r}
#| eval: false
my_object <- "apple"

sqrt(my_object)
```
  • Console:
    • Error in sqrt(my_object) : non-numeric argument to mathematical function

What Do You Want From Me??

  • In the Console, type ?sqrt and press enter.
    • Usage: sqrt(x)
    • Arguments: x a numeric or complex vector or array.

Also accessible through “Help” in the Output pane!

Comments & Markdown

```{r title}
### you can comment an entire line
#### with any number of # symbols
my_object <- 1 + 2 # or, you can comment at the end of a line

sqrt(my_object)
```
[1] 1.732051

This is Markdown!

Good Habits for Comments/Markdown

  • Comment liberally!
  • Try to describe why, rather than what
  • Use comments to “hide” code rather than deleting it
```{r comments example}
#| output: false
my_object <- 1 + 2
# my_object <- "apple"

sqrt(as.numeric(my_object)) # convert my_object to numeric, then sqrt

sqrt(as.numeric(my_object)) # sqrt only accepts numeric objects
```

Dataframes (DFs)

  • Tabular data format
    • Rows and columns (variables) like a spreadsheet
    • Today, we’ll be working with the msleep dataframe (included in the tidyverse)
                        name      genus  vore sleep_total brainwt
1                    Cheetah   Acinonyx carni        12.1      NA
2                 Owl monkey      Aotus  omni        17.0 0.01550
3            Mountain beaver Aplodontia herbi        14.4      NA
4 Greater short-tailed shrew    Blarina  omni        14.9 0.00029
5                        Cow        Bos herbi         4.0 0.42300
6           Three-toed sloth   Bradypus herbi        14.4      NA


For more info, enter ?msleep into the console!

Subsetting

Base R:

  • Square brackets and/or $

Tidyverse:

  • filter():
    • filteR() = Rows
    • Keep only the rows that meet this condition
  • select():
    • seleCt() = Columns
    • Keep only the columns that meet this condition

Subsetting: Filter

# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Let’s say you only want herbivores:

  • Keep only rows where vore is equal to "herbi"

Subsetting: Filter

```{r filtering}
filter(msleep, vore == "herbi")
```
# A tibble: 32 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 2 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 3 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 4 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
 5 Goat   Capri herbi Arti… lc                   5.3       0.6      NA      18.7
 6 Guine… Cavis herbi Rode… domesticated         9.4       0.8       0.217  14.6
 7 Chinc… Chin… herbi Rode… domesticated        12.5       1.5       0.117  11.5
 8 Tree … Dend… herbi Hyra… lc                   5.3       0.5      NA      18.7
 9 Asian… Elep… herbi Prob… en                   3.9      NA        NA      20.1
10 Horse  Equus herbi Peri… domesticated         2.9       0.6       1      21.1
# ℹ 22 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Subsetting: Select

# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Let’s say you only want sleep-related variables:

  • Keep only columns that start with "sleep"

Subsetting: Select

```{r selecting}
select(msleep, name, vore, starts_with("sleep"))
```
# A tibble: 83 × 5
   name                       vore  sleep_total sleep_rem sleep_cycle
   <chr>                      <chr>       <dbl>     <dbl>       <dbl>
 1 Cheetah                    carni        12.1      NA        NA    
 2 Owl monkey                 omni         17         1.8      NA    
 3 Mountain beaver            herbi        14.4       2.4      NA    
 4 Greater short-tailed shrew omni         14.9       2.3       0.133
 5 Cow                        herbi         4         0.7       0.667
 6 Three-toed sloth           herbi        14.4       2.2       0.767
 7 Northern fur seal          carni         8.7       1.4       0.383
 8 Vesper mouse               <NA>          7        NA        NA    
 9 Dog                        carni        10.1       2.9       0.333
10 Roe deer                   herbi         3        NA        NA    
# ℹ 73 more rows

Naming Conventions

Pick one and stick with it!

Good Habits for Naming

  • sleep_total, sleep_rem, sleep_cycle
    • starts_with(“sleep”)
  • total_sleep, rem_sleep, sleep_cycle
    • contains(“sleep”)
  • total, rem, cycle
    • ???

Subsetting: Filter AND Select?

Which option works (i.e., doesn’t throw an error)?

```{r filtering and selecting}
#| eval: false
# filter(msleep, vore == "herbi")
# select(msleep, name, vore, starts_with("sleep"))

### OPTION A:
select(filter(msleep, vore == "herbi"), name, vore, starts_with("sleep"))

### OPTION B:
filter(select(msleep, name, vore, starts_with("sleep")), vore == "herbi")

### OPTION C:
df_filtered <- filter(msleep, vore == "herbi")
select(df_filtered, name, vore, starts_with("sleep"))

### OPTION D:
# all of them work!
```

There Must Be A Better Way…

```{r without pipes}
#| eval: false
select(filter(msleep, vore == "herbi"), name, vore, starts_with("sleep"))
```


```{r without pipes but prettier}
#| eval: false
select(filter(msleep, vore == "herbi"),
       name, vore, starts_with("sleep"))
```

Pipe Notation

```{r with pipes}
#| eval: false
msleep %>% 
  filter(vore == "herbi") %>% 
  select(name, vore, starts_with("sleep"))
```
  • Pipes (%>%) take the result of the code before it (on the left), and send that result into the code after it (on the right) as the first argument
  • If wanting to supply result as a different argument (not the first), can specify which argument as . (period)
  • Insert Pipe: Ctrl/Cmd + Shift + M

Why Pipes?

```{r}
msleep %>% 
  filter(vore == "herbi") %>% 
  select(name, vore, starts_with("sleep"))
```
# A tibble: 32 × 5
   name             vore  sleep_total sleep_rem sleep_cycle
   <chr>            <chr>       <dbl>     <dbl>       <dbl>
 1 Mountain beaver  herbi        14.4       2.4      NA    
 2 Cow              herbi         4         0.7       0.667
 3 Three-toed sloth herbi        14.4       2.2       0.767
 4 Roe deer         herbi         3        NA        NA    
 5 Goat             herbi         5.3       0.6      NA    
 6 Guinea pig       herbi         9.4       0.8       0.217
 7 Chinchilla       herbi        12.5       1.5       0.117
 8 Tree hyrax       herbi         5.3       0.5      NA    
 9 Asian elephant   herbi         3.9      NA        NA    
10 Horse            herbi         2.9       0.6       1    
# ℹ 22 more rows

Why Pipes?

```{r}
msleep %>% 
  # filter(vore == "herbi") %>% 
  select(name, vore, starts_with("sleep"))
```
# A tibble: 83 × 5
   name                       vore  sleep_total sleep_rem sleep_cycle
   <chr>                      <chr>       <dbl>     <dbl>       <dbl>
 1 Cheetah                    carni        12.1      NA        NA    
 2 Owl monkey                 omni         17         1.8      NA    
 3 Mountain beaver            herbi        14.4       2.4      NA    
 4 Greater short-tailed shrew omni         14.9       2.3       0.133
 5 Cow                        herbi         4         0.7       0.667
 6 Three-toed sloth           herbi        14.4       2.2       0.767
 7 Northern fur seal          carni         8.7       1.4       0.383
 8 Vesper mouse               <NA>          7        NA        NA    
 9 Dog                        carni        10.1       2.9       0.333
10 Roe deer                   herbi         3        NA        NA    
# ℹ 73 more rows

Pipe Quiz

Which option works (i.e., doesn’t throw an error)?

```{r pipe quiz}
#| eval: false

### OPTION A:
msleep %>% 
  select(filter(msleep, vore == "herbi"), vore, starts_with("sleep"))

### OPTION B:
filter(vore == "herbi") %>% 
  select(name, vore, starts_with("sleep"))

### OPTION C:
msleep %>% 
  filter(vore == "herbi") %>% 
  select(name, vore, starts_with("sleep")) %>% 

### OPTION D:
# none of them work!
```

Mutating

# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Let’s say you want animals’ amount of non-REM sleep:

  • Make a new variable, subtracting sleep_rem from sleep_total

Mutating

```{r mutate}
msleep %>% 
  mutate(sleep_nrem = sleep_total - sleep_rem) %>% 
  select(name, starts_with("sleep"))
```
# A tibble: 83 × 5
   name                       sleep_total sleep_rem sleep_cycle sleep_nrem
   <chr>                            <dbl>     <dbl>       <dbl>      <dbl>
 1 Cheetah                           12.1      NA        NA           NA  
 2 Owl monkey                        17         1.8      NA           15.2
 3 Mountain beaver                   14.4       2.4      NA           12  
 4 Greater short-tailed shrew        14.9       2.3       0.133       12.6
 5 Cow                                4         0.7       0.667        3.3
 6 Three-toed sloth                  14.4       2.2       0.767       12.2
 7 Northern fur seal                  8.7       1.4       0.383        7.3
 8 Vesper mouse                       7        NA        NA           NA  
 9 Dog                               10.1       2.9       0.333        7.2
10 Roe deer                           3        NA        NA           NA  
# ℹ 73 more rows

Plotting: Scatterplot

```{r scatterplot}
msleep %>% 
  ggplot(mapping = aes(x = sleep_total, y = sleep_rem),
         data = .) +  # '+' is specific to ggplot2, similar (but not equivalent) to the pipe!
  geom_point() 
```

Plotting: Boxplot

```{r boxplot}
msleep %>% 
  ggplot(aes(x = vore, y = sleep_total)) +
  geom_boxplot() 
```

Output

Render! (Knit in .Rmd)

Render/Knit at the top of the Source pane

RData and the Environment

  • RStudio asks to automatically store/reload environments
  • highly recommended to NOT store/reload environments!
    • the script should recreate the environment on its own!
  • instead, objects can be manually stored and reloaded

Packages I Can’t Live Without

Resources

  • Feel free to email me out of the blue! ryeung@research.baycrest.org