Cheese Data for Teaching Data Entry and Transformation – Emma W. Marshall’s personal site

I wanted to create a practice assignment for students learning data entry and transformation in SPSS and JAMOVI. The tidytuesday package featured a fun dataset in June 2024 with characteristics of different cheeses from cheese.com, which seemed like a good candidate—it has a mix of numeric and categorical variables, some messy formatting to clean up, and it’s about cheese.

Loading the Data

library(tidyverse)
library(here)
library(tinytable)

cheeses <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-06-04/cheeses.csv')

glimpse(cheeses)

Rows: 1,187
Columns: 19
$ cheese          <chr> "Aarewasser", "Abbaye de Belloc", "Abbaye de Belval", …
$ url             <chr> "https://www.cheese.com/aarewasser/", "https://www.che…
$ milk            <chr> "cow", "sheep", "cow", "cow", "cow", "cow", "cow", "co…
$ country         <chr> "Switzerland", "France", "France", "France", "France",…
$ region          <chr> NA, "Pays Basque", NA, "Burgundy", "Savoie", "province…
$ family          <chr> NA, NA, NA, NA, NA, NA, NA, "Cheddar", NA, NA, NA, NA,…
$ type            <chr> "semi-soft", "semi-hard, artisan", "semi-hard", "semi-…
$ fat_content     <chr> NA, NA, "40-46%", NA, NA, NA, "50%", NA, "45%", NA, NA…
$ calcium_content <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ texture         <chr> "buttery", "creamy, dense, firm", "elastic", "creamy, …
$ rind            <chr> "washed", "natural", "washed", "washed", "washed", "wa…
$ color           <chr> "yellow", "yellow", "ivory", "white", "white", "pale y…
$ flavor          <chr> "sweet", "burnt caramel", NA, "acidic, milky, smooth",…
$ aroma           <chr> "buttery", "lanoline", "aromatic", "barnyardy, earthy"…
$ vegetarian      <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ vegan           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ synonyms        <chr> NA, "Abbaye Notre-Dame de Belloc", NA, NA, NA, NA, NA,…
$ alt_spellings   <chr> NA, NA, NA, NA, "Tamié, Trappiste de Tamie, Abbey of T…
$ producers       <chr> "Jumi", NA, NA, NA, NA, "Abbaye Cistercienne NOTRE-DAM…

Filtering for Complete Cases

For the assignment, I only want cheeses that have values for at least one of fat_content or calcium_content. First, a quick count of how many that gives us:

cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content)) |>
  summarise(count = n())

# A tibble: 1 × 1
  count
  <int>
1   249

Then I filter down to just those rows and select the columns I care about:

filtered_cheeses <- cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content)) |>
  select(cheese, url, milk, country, family, type, vegetarian, color, fat_content, calcium_content)

glimpse(filtered_cheeses)

Rows: 249
Columns: 10
$ cheese          <chr> "Abbaye de Belval", "Abbaye du Mont des Cats", "Aberta…
$ url             <chr> "https://www.cheese.com/abbaye-de-belval/", "https://w…
$ milk            <chr> "cow", "cow", "sheep", "sheep", "cow", "cow", "cow", "…
$ country         <chr> "France", "France", "Czech Republic", "United Kingdom"…
$ family          <chr> NA, NA, NA, NA, "Blue", "Cheddar", "Cheddar", NA, NA, …
$ type            <chr> "semi-hard", "semi-soft, artisan, brined", "hard, arti…
$ vegetarian      <lgl> FALSE, FALSE, FALSE, TRUE, NA, NA, NA, FALSE, FALSE, F…
$ color           <chr> "ivory", "pale yellow", "pale yellow", NA, "blue", "iv…
$ fat_content     <chr> "40-46%", "50%", "45%", "52%", "50%", "12%", "14%", "5…
$ calcium_content <chr> NA, NA, NA, NA, NA, NA, NA, "26 mg/100g", NA, NA, NA, …

Cleaning the Numeric Columns

The fat_content column is stored as character because some values have percentage signs or ranges. This converts everything to numeric, taking the first value when there’s a range:

cheese_data <- filtered_cheeses |>
  mutate(
    fat_content = case_when(
      is.na(fat_content) ~ NA_real_,
      str_detect(fat_content, "%") ~ as.numeric(str_remove(fat_content, "%")),
      str_detect(fat_content, "-") ~ as.numeric(str_extract(fat_content, "^[0-9.]+")),
      TRUE ~ as.numeric(fat_content)
    )
  )

Same issue with calcium_content—it includes units like “mg/100g” that need to be stripped out. I’m also filtering to only rows where calcium content exists, since that’s what I need for the assignment:

cheese_data <- cheese_data |>
  filter(!is.na(calcium_content)) |>
  mutate(
    calcium_content = case_when(
      is.na(calcium_content) ~ NA_real_,
      str_detect(calcium_content, "mg/100g") ~ as.numeric(str_remove(calcium_content, " mg/100g")),
      TRUE ~ as.numeric(calcium_content)
    )
  )

Creating a Milk Source Variable

The milk column sometimes lists multiple animals separated by commas. I want a variable that indicates whether the cheese uses milk from a single source or multiple sources:

cheese_data <- cheese_data |>
  mutate(
    id = row_number(),
    milk_source = case_when(
      str_detect(milk, ",") ~ "multiple",
      TRUE ~ str_trim(milk)
    )
  )

Quick check of the distribution:

cheese_data |>
  count(milk_source) |>
  mutate(proportion = n / sum(n))

# A tibble: 5 × 3
  milk_source       n proportion
  <chr>         <int>      <dbl>
1 cow              17       0.68
2 multiple          5       0.2 
3 sheep             1       0.04
4 water buffalo     1       0.04
5 <NA>              1       0.04

Final Dataset

Here’s what the cleaned data looks like. The url column links to each cheese’s page on cheese.com:

cheese_data |>
  mutate(
    url = sprintf('<a href="%s" target="_blank">%s</a>', url, cheese)
  ) |>
  select(id, cheese, url, milk_source, country, fat_content, calcium_content) |>
  tt() |>
  format_tt(escape = FALSE)

id	cheese	url	milk_source	country	fat_content	calcium_content
1	Affidelice au Chablis	Affidelice au Chablis	cow	France	55.0	26
2	Amul Emmental	Amul Emmental	cow	India	46.0	488
3	Amul Gouda	Amul Gouda	cow	India	46.0	492
4	Amul Pizza Mozzarella Cheese	Amul Pizza Mozzarella Cheese	cow	India	NA	492
5	Amul Processed Cheese	Amul Processed Cheese	multiple	India	26.0	343
6	Anthotyro	Anthotyro	multiple	Greece	30.0	318
7	Basils Original Rauchkäse	Basils Original Rauchkäse	cow	Germany	25.5	700
8	Bavaria blu	Bavaria blu	cow	Germany	NA	450
9	Bianco	Bianco	cow	Germany	NA	725
10	Bonifaz	Bonifaz	cow	Germany	NA	430
11	Breakfast Cheese	Breakfast Cheese	cow	United States	NA	90
12	Brebis du Lavort	Brebis du Lavort	sheep	France	NA	1050
13	Brunost	Brunost	multiple	Denmark, Finland, Germany, Iceland, Norway, Sweden	NA	360
14	Castelmagno	Castelmagno	multiple	Italy	NA	4768
15	Limburger	Limburger	cow	Belgium, Germany, Netherlands	42.0	497
16	Paneer	Paneer	multiple	Bangladesh, India	NA	208
17	Petida	Petida	cow	Germany	55.0	190
18	President Fat Free Feta	President Fat Free Feta	cow	France, United States	NA	30
19	Prima Donna fino	Prima Donna fino	cow	Netherlands	NA	921
20	Prima Donna forte	Prima Donna forte	NA	Netherlands	NA	990
21	Prima Donna leggero	Prima Donna leggero	cow	Netherlands	NA	1071
22	Prima Donna maturo	Prima Donna maturo	cow	Netherlands	NA	749
23	Provoleta	Provoleta	water buffalo	Argentina	45.0	316
24	Provolone del Monaco	Provolone del Monaco	cow	Italy	40.5	157
25	Seriously Strong Cheddar	Seriously Strong Cheddar	cow	England, Scotland, United Kingdom	34.4	740

Exporting for Making Student Materials

library(haven)

write_sav(cheese_data, here("posts", "cheese", "data", "cheese_data.sav"))
write_csv(cheese_data, here("posts", "cheese", "data", "cheese_data.csv"))

Reuse

CC BY 4.0