Cheese Data for Teaching Data Entry and Transformation

R
teaching
data
Using the cheese.com dataset from TidyTuesday to create practice exercises for SPSS and JAMOVI.
Published

June 15, 2024

I wanted to create a practice assignment for students learning data entry and transformation in SPSS and JAMOVI. The tidytuesday package featured a fun dataset in June 2024 with characteristics of different cheeses from cheese.com, which seemed like a good candidate—it has a mix of numeric and categorical variables, some messy formatting to clean up, and it’s about cheese.

Loading the Data

library(tidyverse)
library(here)
library(tinytable)

cheeses <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-06-04/cheeses.csv')

glimpse(cheeses)
Rows: 1,187
Columns: 19
$ cheese          <chr> "Aarewasser", "Abbaye de Belloc", "Abbaye de Belval", …
$ url             <chr> "https://www.cheese.com/aarewasser/", "https://www.che…
$ milk            <chr> "cow", "sheep", "cow", "cow", "cow", "cow", "cow", "co…
$ country         <chr> "Switzerland", "France", "France", "France", "France",…
$ region          <chr> NA, "Pays Basque", NA, "Burgundy", "Savoie", "province…
$ family          <chr> NA, NA, NA, NA, NA, NA, NA, "Cheddar", NA, NA, NA, NA,…
$ type            <chr> "semi-soft", "semi-hard, artisan", "semi-hard", "semi-…
$ fat_content     <chr> NA, NA, "40-46%", NA, NA, NA, "50%", NA, "45%", NA, NA…
$ calcium_content <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ texture         <chr> "buttery", "creamy, dense, firm", "elastic", "creamy, …
$ rind            <chr> "washed", "natural", "washed", "washed", "washed", "wa…
$ color           <chr> "yellow", "yellow", "ivory", "white", "white", "pale y…
$ flavor          <chr> "sweet", "burnt caramel", NA, "acidic, milky, smooth",…
$ aroma           <chr> "buttery", "lanoline", "aromatic", "barnyardy, earthy"…
$ vegetarian      <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ vegan           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ synonyms        <chr> NA, "Abbaye Notre-Dame de Belloc", NA, NA, NA, NA, NA,…
$ alt_spellings   <chr> NA, NA, NA, NA, "Tamié, Trappiste de Tamie, Abbey of T…
$ producers       <chr> "Jumi", NA, NA, NA, NA, "Abbaye Cistercienne NOTRE-DAM…

Filtering for Complete Cases

For the assignment, I only want cheeses that have values for at least one of fat_content or calcium_content. First, a quick count of how many that gives us:

cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content)) |>
  summarise(count = n())
# A tibble: 1 × 1
  count
  <int>
1   249

Then I filter down to just those rows and select the columns I care about:

filtered_cheeses <- cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content)) |>
  select(cheese, url, milk, country, family, type, vegetarian, color, fat_content, calcium_content)

glimpse(filtered_cheeses)
Rows: 249
Columns: 10
$ cheese          <chr> "Abbaye de Belval", "Abbaye du Mont des Cats", "Aberta…
$ url             <chr> "https://www.cheese.com/abbaye-de-belval/", "https://w…
$ milk            <chr> "cow", "cow", "sheep", "sheep", "cow", "cow", "cow", "…
$ country         <chr> "France", "France", "Czech Republic", "United Kingdom"…
$ family          <chr> NA, NA, NA, NA, "Blue", "Cheddar", "Cheddar", NA, NA, …
$ type            <chr> "semi-hard", "semi-soft, artisan, brined", "hard, arti…
$ vegetarian      <lgl> FALSE, FALSE, FALSE, TRUE, NA, NA, NA, FALSE, FALSE, F…
$ color           <chr> "ivory", "pale yellow", "pale yellow", NA, "blue", "iv…
$ fat_content     <chr> "40-46%", "50%", "45%", "52%", "50%", "12%", "14%", "5…
$ calcium_content <chr> NA, NA, NA, NA, NA, NA, NA, "26 mg/100g", NA, NA, NA, …

Cleaning the Numeric Columns

The fat_content column is stored as character because some values have percentage signs or ranges. This converts everything to numeric, taking the first value when there’s a range:

cheese_data <- filtered_cheeses |>
  mutate(
    fat_content = case_when(
      is.na(fat_content) ~ NA_real_,
      str_detect(fat_content, "%") ~ as.numeric(str_remove(fat_content, "%")),
      str_detect(fat_content, "-") ~ as.numeric(str_extract(fat_content, "^[0-9.]+")),
      TRUE ~ as.numeric(fat_content)
    )
  )

Same issue with calcium_content—it includes units like “mg/100g” that need to be stripped out. I’m also filtering to only rows where calcium content exists, since that’s what I need for the assignment:

cheese_data <- cheese_data |>
  filter(!is.na(calcium_content)) |>
  mutate(
    calcium_content = case_when(
      is.na(calcium_content) ~ NA_real_,
      str_detect(calcium_content, "mg/100g") ~ as.numeric(str_remove(calcium_content, " mg/100g")),
      TRUE ~ as.numeric(calcium_content)
    )
  )

Creating a Milk Source Variable

The milk column sometimes lists multiple animals separated by commas. I want a variable that indicates whether the cheese uses milk from a single source or multiple sources:

cheese_data <- cheese_data |>
  mutate(
    id = row_number(),
    milk_source = case_when(
      str_detect(milk, ",") ~ "multiple",
      TRUE ~ str_trim(milk)
    )
  )

Quick check of the distribution:

cheese_data |>
  count(milk_source) |>
  mutate(proportion = n / sum(n))
# A tibble: 5 × 3
  milk_source       n proportion
  <chr>         <int>      <dbl>
1 cow              17       0.68
2 multiple          5       0.2 
3 sheep             1       0.04
4 water buffalo     1       0.04
5 <NA>              1       0.04

Final Dataset

Here’s what the cleaned data looks like. The url column links to each cheese’s page on cheese.com:

cheese_data |>
  mutate(
    url = sprintf('<a href="%s" target="_blank">%s</a>', url, cheese)
  ) |>
  select(id, cheese, url, milk_source, country, fat_content, calcium_content) |>
  tt() |>
  format_tt(escape = FALSE)
id cheese url milk_source country fat_content calcium_content
1 Affidelice au Chablis Affidelice au Chablis cow France 55.0 26
2 Amul Emmental Amul Emmental cow India 46.0 488
3 Amul Gouda Amul Gouda cow India 46.0 492
4 Amul Pizza Mozzarella Cheese Amul Pizza Mozzarella Cheese cow India NA 492
5 Amul Processed Cheese Amul Processed Cheese multiple India 26.0 343
6 Anthotyro Anthotyro multiple Greece 30.0 318
7 Basils Original Rauchkäse Basils Original Rauchkäse cow Germany 25.5 700
8 Bavaria blu Bavaria blu cow Germany NA 450
9 Bianco Bianco cow Germany NA 725
10 Bonifaz Bonifaz cow Germany NA 430
11 Breakfast Cheese Breakfast Cheese cow United States NA 90
12 Brebis du Lavort Brebis du Lavort sheep France NA 1050
13 Brunost Brunost multiple Denmark, Finland, Germany, Iceland, Norway, Sweden NA 360
14 Castelmagno Castelmagno multiple Italy NA 4768
15 Limburger Limburger cow Belgium, Germany, Netherlands 42.0 497
16 Paneer Paneer multiple Bangladesh, India NA 208
17 Petida Petida cow Germany 55.0 190
18 President Fat Free Feta President Fat Free Feta cow France, United States NA 30
19 Prima Donna fino Prima Donna fino cow Netherlands NA 921
20 Prima Donna forte Prima Donna forte NA Netherlands NA 990
21 Prima Donna leggero Prima Donna leggero cow Netherlands NA 1071
22 Prima Donna maturo Prima Donna maturo cow Netherlands NA 749
23 Provoleta Provoleta water buffalo Argentina 45.0 316
24 Provolone del Monaco Provolone del Monaco cow Italy 40.5 157
25 Seriously Strong Cheddar Seriously Strong Cheddar cow England, Scotland, United Kingdom 34.4 740

Exporting for Making Student Materials

library(haven)

write_sav(cheese_data, here("posts", "cheese", "data", "cheese_data.sav"))
write_csv(cheese_data, here("posts", "cheese", "data", "cheese_data.csv"))
Back to top