Surviving #TidyTuesday

By Jesse Mostipak in tutorial

May 31, 2021

The hardest part about Survivor

This week’s #TidyTuesday dataset focused on the reality show Survivor. I didn’t know much about the show when the data dropped, so I spent a couple of days watching the first half of Season 7 to get acclimated. At this point I’m convinced that the hardest part of Survivor isn’t being in a remote location, or doing physical challenges, or even having to hunt your own food – it’s quickly and efficiently figuring out the ever-shifting political landscape of the strangers you find yourself with.

If you’re interested in the full two-hour stream it’s up on Twitch!

Digging into the data

Because I’m focused on creating beginner-level content, which I define as content that someone new-ish to R could look at and recognize it from a {ggplot} reference (such as the ggplot cheat sheet or R for Data Science text) and feel confident that they could also recreate this plot, I had to make some choices in how to approach this data.

The GitHub repo indicates that there are a multitude of datasets available to us, which presents the opportunity to go over joins while also creating a richer dataset. However the initial summary dataset has some relatively straightforward wrangling steps that can be tricky for beginners, and in the end this is where I chose to focus, since we could look at:

  • converting all character data into factors
  • using pivot_longer()
  • creating time intervals using {lubridate}

So without further ado, I’ve got two videos and the associated code for you below. The first video goes through the wrangling steps, and the second video picks up with the scatterplot we created.

Wrangling the Survivor summary dataset

Video walkthrough:

And the code:
Set up our environment:

library(tidyverse)
library(lubridate)

Import our data:

summary <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-01/summary.csv')

Explore wrangling options:

# converting characters to factors
summary %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  glimpse()

# calculating intervals: days filmed, days aired
summary %>% 
  mutate(days_filming = time_length(interval(filming_started, filming_ended), unit = "day")) %>% 
  mutate(days_aired = time_length(interval(premiered, ended), unit = "day")) %>% 
  select(-premiered:-filming_ended) %>% 
  glimpse()

# pivoting data - combining metrics for show type
summary %>% 
  pivot_longer(
    cols = viewers_premier:viewers_reunion,
    names_to = "show_type",
    names_prefix = "viewers_",
    values_to = "total_views",
    values_drop_na = TRUE
  ) %>% 
  glimpse()

Our wrangling works, so let’s go ahead and remove the glimpse() functions, put all of the steps together, and assign everything to a new variable, summary_tidy:

summary_tidy <- summary %>% 
  # convert character cols to factors
  mutate(across(where(is.character), as.factor)) %>% 
  # calculate days filmed
  mutate(days_filming = time_length(interval(filming_started, filming_ended), unit = "day")) %>% 
  # calculate days aired
  mutate(days_aired = time_length(interval(premiered, ended), unit = "day")) %>% 
  # remove our four original date columns
  select(-premiered:-filming_ended) %>% 
  # pivot dataset
  pivot_longer(
    cols = viewers_premier:viewers_reunion,
    names_to = "show_type",
    names_prefix = "viewers_",
    values_to = "total_views",
    values_drop_na = TRUE
  ) 

# let's not forget to check our work!
glimpse(summary_tidy)

Creating a Survivor scatterplot

Video walkthrough:

And the code:
This code builds directly off of our wrangling code, so be sure to run the above code before running this chunk, which creates our scatterplot:

summary_tidy %>% 
  filter(show_type %in% c("premier", "finale")) %>% 
  ggplot(aes(x = viewers_mean, y = total_views)) +
  geom_point(aes(color = show_type))

Scatterplot showing the average viewers by total views for the show Survivor, broken out by show type (finale or premiere). We see a positive linear relationship for both show types, although there are outliers present for the finale and premiere starting at around 25 average viewers (scale of 25 is unknown - could be millions?)

Next steps

There are a multitude of directions you can go with our resulting scatterplot, depending on what you’re looking to dig into a little bit with R. My suggestions are to try:

  • adding in a title and updating the axis labels
  • changing the theme of the plot
  • adding regression lines to each of the datasets
  • doing a deep dive into the outliers we see at the far right of the graph – what makes those points unique? (This would also be a great starting point for additional visualizations as well as a blog post!)

Photo by Karim MANJRA on Unsplash

Posted on:
May 31, 2021
Length:
3 minute read, 622 words
Categories:
tutorial
Tags:
R Survivor scatterplot
See Also:
Stream makes a streamplot
"Master" of string manipulation
Data Science Twitch Streamers Round Up