Cheese

Creating an assignment to practice data entry and transformation tasks in SPSS and Jamovi using data from cheese.com.

tidytuesday r package featured a fun data set in June 2024 that pulled characteristics of different types of cheeses listed on cheese.com.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cheeses <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-06-04/cheeses.csv')
Rows: 1187 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): cheese, url, milk, country, region, family, type, fat_content, cal...
lgl  (2): vegetarian, vegan

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(cheeses)
Rows: 1,187
Columns: 19
$ cheese          <chr> "Aarewasser", "Abbaye de Belloc", "Abbaye de Belval", …
$ url             <chr> "https://www.cheese.com/aarewasser/", "https://www.che…
$ milk            <chr> "cow", "sheep", "cow", "cow", "cow", "cow", "cow", "co…
$ country         <chr> "Switzerland", "France", "France", "France", "France",…
$ region          <chr> NA, "Pays Basque", NA, "Burgundy", "Savoie", "province…
$ family          <chr> NA, NA, NA, NA, NA, NA, NA, "Cheddar", NA, NA, NA, NA,…
$ type            <chr> "semi-soft", "semi-hard, artisan", "semi-hard", "semi-…
$ fat_content     <chr> NA, NA, "40-46%", NA, NA, NA, "50%", NA, "45%", NA, NA…
$ calcium_content <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ texture         <chr> "buttery", "creamy, dense, firm", "elastic", "creamy, …
$ rind            <chr> "washed", "natural", "washed", "washed", "washed", "wa…
$ color           <chr> "yellow", "yellow", "ivory", "white", "white", "pale y…
$ flavor          <chr> "sweet", "burnt caramel", NA, "acidic, milky, smooth",…
$ aroma           <chr> "buttery", "lanoline", "aromatic", "barnyardy, earthy"…
$ vegetarian      <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ vegan           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ synonyms        <chr> NA, "Abbaye Notre-Dame de Belloc", NA, NA, NA, NA, NA,…
$ alt_spellings   <chr> NA, NA, NA, NA, "Tamié, Trappiste de Tamie, Abbey of T…
$ producers       <chr> "Jumi", NA, NA, NA, NA, "Abbaye Cistercienne NOTRE-DAM…

To filter the data frame to count the number of items that have a value for fat_content and calcium_content, and then create a new data frame that only includes those items that are not missing for at least one of those columns, you can use the dplyr package in R. Below are the steps you’ll need to follow:

Step 1: Load the necessary library Ensure you have the dplyr package loaded:

library(dplyr) Step 2: Count the number of items with non-missing values for fat_content and calcium_content You can use the filter() function to check for non-missing values and then count them using summarize():

# Assuming your data frame is named 'cheeses'
count_non_missing <- cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content) ) |>
  summarize(count = n())

# Print the count
print(count_non_missing)
# A tibble: 1 × 1
  count
  <int>
1   249

Step 3: Create a new data frame with items that are not missing for at least one of those columns You can use the same filter() function to create a new data frame:

# Create a new data frame with non-missing values for at least one of the specified columns
filtered_cheeses <- cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content))

# Print the new data frame
print(filtered_cheeses)
# A tibble: 249 × 19
   cheese    url   milk  country region family type  fat_content calcium_content
   <chr>     <chr> <chr> <chr>   <chr>  <chr>  <chr> <chr>       <chr>          
 1 Abbaye d… http… cow   France  <NA>   <NA>   semi… 40-46%      <NA>           
 2 Abbaye d… http… cow   France  Nord-… <NA>   semi… 50%         <NA>           
 3 Abertam   http… sheep Czech … Karlo… <NA>   hard… 45%         <NA>           
 4 Acorn     http… sheep United… Betha… <NA>   hard… 52%         <NA>           
 5 Adelost   http… cow   Sweden  <NA>   Blue   semi… 50%         <NA>           
 6 ADL Bric… http… cow   Canada  Princ… Chedd… semi… 12%         <NA>           
 7 ADL Mild… http… cow   Canada  Princ… Chedd… semi… 14%         <NA>           
 8 Affideli… http… cow   France  Burgu… <NA>   soft  55%         26 mg/100g     
 9 Aisy Cen… http… cow   France  Burgu… <NA>   semi… 50%         <NA>           
10 Allgauer… http… cow   Germany Swabia <NA>   hard  45%         <NA>           
# ℹ 239 more rows
# ℹ 10 more variables: texture <chr>, rind <chr>, color <chr>, flavor <chr>,
#   aroma <chr>, vegetarian <lgl>, vegan <lgl>, synonyms <chr>,
#   alt_spellings <chr>, producers <chr>

Explanation of the Code Filtering: The filter(!is.na(fat_content) | !is.na(calcium_content)) checks for rows where at least one of the two columns (fat_content or calcium_content) is not NA. Counting: summarize(count = n()) counts the number of rows that meet the filter condition. New Data Frame: The result of the filter is stored in filtered_cheeses, which only includes the rows where at least one of the specified columns has a value. Complete Example

library(dplyr)
library(tidyverse)

# Count the number of items with non-missing values for fat_content and calcium_content
count_non_missing <- cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content)) |>
  summarize(count = n())

print(count_non_missing)
# A tibble: 1 × 1
  count
  <int>
1   249
# Create a new data frame with items that are not missing for at least one of those columns
filtered_cheeses <- cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content)) |> 
  # limit to columns we are interested in
  select(c(cheese, url, milk, country, family, type, vegetarian, color, fat_content, calcium_content))


print(filtered_cheeses)
# A tibble: 249 × 10
   cheese          url   milk  country family type  vegetarian color fat_content
   <chr>           <chr> <chr> <chr>   <chr>  <chr> <lgl>      <chr> <chr>      
 1 Abbaye de Belv… http… cow   France  <NA>   semi… FALSE      ivory 40-46%     
 2 Abbaye du Mont… http… cow   France  <NA>   semi… FALSE      pale… 50%        
 3 Abertam         http… sheep Czech … <NA>   hard… FALSE      pale… 45%        
 4 Acorn           http… sheep United… <NA>   hard… TRUE       <NA>  52%        
 5 Adelost         http… cow   Sweden  Blue   semi… NA         blue  50%        
 6 ADL Brick Chee… http… cow   Canada  Chedd… semi… NA         ivory 12%        
 7 ADL Mild Chedd… http… cow   Canada  Chedd… semi… NA         yell… 14%        
 8 Affidelice au … http… cow   France  <NA>   soft  FALSE      oran… 55%        
 9 Aisy Cendre     http… cow   France  <NA>   semi… FALSE      white 50%        
10 Allgauer Emmen… http… cow   Germany <NA>   hard  FALSE      yell… 45%        
# ℹ 239 more rows
# ℹ 1 more variable: calcium_content <chr>

To convert the fat_content column from character to numeric, you can use the as.numeric() function along with some additional handling for the different formats in the data.

# Convert fat_content to numeric
cheese_data <- filtered_cheeses |> 
  mutate(
    fat_content = case_when(
    is.na(fat_content) ~ as.numeric(NA),
    grepl("%", fat_content) ~ as.numeric(gsub("%", "", fat_content)),
    grepl("-", fat_content) ~ as.numeric(strsplit(fat_content, "-")[[1]][1]),
    TRUE ~ as.numeric(fat_content)
  ))
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `fat_content = case_when(...)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
# Print the updated data frame
print(cheese_data)
# A tibble: 249 × 10
   cheese          url   milk  country family type  vegetarian color fat_content
   <chr>           <chr> <chr> <chr>   <chr>  <chr> <lgl>      <chr>       <dbl>
 1 Abbaye de Belv… http… cow   France  <NA>   semi… FALSE      ivory          NA
 2 Abbaye du Mont… http… cow   France  <NA>   semi… FALSE      pale…          50
 3 Abertam         http… sheep Czech … <NA>   hard… FALSE      pale…          45
 4 Acorn           http… sheep United… <NA>   hard… TRUE       <NA>           52
 5 Adelost         http… cow   Sweden  Blue   semi… NA         blue           50
 6 ADL Brick Chee… http… cow   Canada  Chedd… semi… NA         ivory          12
 7 ADL Mild Chedd… http… cow   Canada  Chedd… semi… NA         yell…          14
 8 Affidelice au … http… cow   France  <NA>   soft  FALSE      oran…          55
 9 Aisy Cendre     http… cow   France  <NA>   semi… FALSE      white          50
10 Allgauer Emmen… http… cow   Germany <NA>   hard  FALSE      yell…          45
# ℹ 239 more rows
# ℹ 1 more variable: calcium_content <chr>

Deal with the calcium_content

cheese_calc <- cheese_data |>
  filter(!is.na(calcium_content))

print(cheese_calc)
# A tibble: 25 × 10
   cheese          url   milk  country family type  vegetarian color fat_content
   <chr>           <chr> <chr> <chr>   <chr>  <chr> <lgl>      <chr>       <dbl>
 1 Affidelice au … http… cow   France  <NA>   soft  FALSE      oran…        55  
 2 Amul Emmental   http… cow   India   Swiss… semi… TRUE       yell…        46  
 3 Amul Gouda      http… cow   India   Gouda  semi… TRUE       yell…        46  
 4 Amul Pizza Moz… http… cow   India   Mozza… semi… TRUE       yell…        NA  
 5 Amul Processed… http… cow,… India   Chedd… hard… TRUE       yell…        26  
 6 Anthotyro       http… goat… Greece  <NA>   hard… NA         white        30  
 7 Basils Origina… http… cow   Germany <NA>   semi… FALSE      pale…        25.5
 8 Bavaria blu     http… cow   Germany Blue   soft… FALSE      cream        NA  
 9 Bianco          http… cow   Germany <NA>   semi… FALSE      pale…        NA  
10 Bonifaz         http… cow   Germany <NA>   soft  FALSE      cream        NA  
# ℹ 15 more rows
# ℹ 1 more variable: calcium_content <chr>

To convert the calcium_content column from character format (which may include units like “mg/100g”) to numeric, you can use a similar approach as we did for fat_content. You’ll want to extract the numeric part of the string and convert it to a numeric type.

# Convert calcium_content to numeric
cheese_data <- cheese_calc |>
  mutate(
    calcium_content = case_when(
      is.na(calcium_content) ~ as.numeric(NA),
      grepl("mg/100g", calcium_content) ~ as.numeric(gsub(" mg/100g", "", calcium_content)),
      TRUE ~ as.numeric(calcium_content)
    )
  )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `calcium_content = case_when(...)`.
Caused by warning:
! NAs introduced by coercion
# Print the updated data frame
#print(cheese_data)

knitr::kable(cheese_data)
cheese url milk country family type vegetarian color fat_content calcium_content
Affidelice au Chablis https://www.cheese.com/affidelice-au-chablis/ cow France NA soft FALSE orange 55.0 26
Amul Emmental https://www.cheese.com/amul-emmental/ cow India Swiss Cheese semi-hard TRUE yellow 46.0 488
Amul Gouda https://www.cheese.com/amul-gouda/ cow India Gouda semi-hard TRUE yellow 46.0 492
Amul Pizza Mozzarella Cheese https://www.cheese.com/amul-pizza-mozzarella-cheese/ cow India Mozzarella semi-soft, processed TRUE yellow NA 492
Amul Processed Cheese https://www.cheese.com/amul-processed-cheese/ cow, water buffalo India Cheddar hard, processed TRUE yellow 26.0 343
Anthotyro https://www.cheese.com/anthotyro/ goat, sheep Greece NA hard, whey NA white 30.0 318
Basils Original Rauchkäse https://www.cheese.com/basils-original-rauchkase/ cow Germany NA semi-soft FALSE pale yellow 25.5 700
Bavaria blu https://www.cheese.com/bavaria-blu/ cow Germany Blue soft, blue-veined FALSE cream NA 450
Bianco https://www.cheese.com/bianco/ cow Germany NA semi-hard FALSE pale yellow NA 725
Bonifaz https://www.cheese.com/bonifaz/ cow Germany NA soft FALSE cream NA 430
Breakfast Cheese https://www.cheese.com/breakfast-cheese/ cow United States NA fresh firm, soft-ripened TRUE white NA 90
Brebis du Lavort https://www.cheese.com/brebis-du-lavort/ sheep France NA semi-hard, artisan FALSE ivory NA 1050
Brunost https://www.cheese.com/brunost/ cow, goat Denmark, Finland, Germany, Iceland, Norway, Sweden NA semi-soft, whey NA brown NA 360
Castelmagno https://www.cheese.com/castelmagno/ cow, goat, sheep Italy Blue semi-hard FALSE ivory NA 4768
Limburger https://www.cheese.com/limburger/ cow Belgium, Germany, Netherlands NA semi-soft, smear-ripened FALSE straw 42.0 497
Paneer https://www.cheese.com/paneer/ cow, water buffalo Bangladesh, India Cottage fresh firm TRUE white NA 208
Petida https://www.cheese.com/petida/ cow Germany NA soft, brined TRUE white 55.0 190
President Fat Free Feta https://www.cheese.com/president-fat-free-feta/ cow France, United States Feta firm, artisan, brined NA white NA 30
Prima Donna fino https://www.cheese.com/prima-donna-fino/ cow Netherlands Parmesan hard NA pale yellow NA 921
Prima Donna forte https://www.cheese.com/prima-donna-forte/ NA Netherlands Parmesan hard NA yellow NA 990
Prima Donna leggero https://www.cheese.com/prima-donna-leggero/ cow Netherlands Parmesan hard NA yellow NA 1071
Prima Donna maturo https://www.cheese.com/prima-donna-maturo/ cow Netherlands Parmesan hard NA yellow NA 749
Provoleta https://www.cheese.com/provoleta/ water buffalo Argentina Pasta filata semi-hard, artisan FALSE pale yellow 45.0 316
Provolone del Monaco https://www.cheese.com/provolone-del-monaco/ cow Italy Pasta filata semi-hard, artisan FALSE pale yellow 40.5 157
Seriously Strong Cheddar https://www.cheese.com/seriously-strong-cheddar/ cow England, Scotland, United Kingdom Cheddar hard FALSE yellow 34.4 740

Next we want to convert these variables to a more unable format for data analysis. We can use functions from stringr() to extract the animal types from the milk variable and create a new variable that indicates whether the milk is from a single animal or multiple animals

library(dplyr)
library(stringr)
# Extract animal types and create a new variable for milk source
cheese_data <- cheese_data |>
  mutate(    # add id number column 
    id = row_number(),
    milk_source = case_when(
      str_detect(milk, ",") ~ "multiple",
      TRUE ~ str_trim(milk)
    )
  )

knitr::kable(cheese_data)
cheese url milk country family type vegetarian color fat_content calcium_content id milk_source
Affidelice au Chablis https://www.cheese.com/affidelice-au-chablis/ cow France NA soft FALSE orange 55.0 26 1 cow
Amul Emmental https://www.cheese.com/amul-emmental/ cow India Swiss Cheese semi-hard TRUE yellow 46.0 488 2 cow
Amul Gouda https://www.cheese.com/amul-gouda/ cow India Gouda semi-hard TRUE yellow 46.0 492 3 cow
Amul Pizza Mozzarella Cheese https://www.cheese.com/amul-pizza-mozzarella-cheese/ cow India Mozzarella semi-soft, processed TRUE yellow NA 492 4 cow
Amul Processed Cheese https://www.cheese.com/amul-processed-cheese/ cow, water buffalo India Cheddar hard, processed TRUE yellow 26.0 343 5 multiple
Anthotyro https://www.cheese.com/anthotyro/ goat, sheep Greece NA hard, whey NA white 30.0 318 6 multiple
Basils Original Rauchkäse https://www.cheese.com/basils-original-rauchkase/ cow Germany NA semi-soft FALSE pale yellow 25.5 700 7 cow
Bavaria blu https://www.cheese.com/bavaria-blu/ cow Germany Blue soft, blue-veined FALSE cream NA 450 8 cow
Bianco https://www.cheese.com/bianco/ cow Germany NA semi-hard FALSE pale yellow NA 725 9 cow
Bonifaz https://www.cheese.com/bonifaz/ cow Germany NA soft FALSE cream NA 430 10 cow
Breakfast Cheese https://www.cheese.com/breakfast-cheese/ cow United States NA fresh firm, soft-ripened TRUE white NA 90 11 cow
Brebis du Lavort https://www.cheese.com/brebis-du-lavort/ sheep France NA semi-hard, artisan FALSE ivory NA 1050 12 sheep
Brunost https://www.cheese.com/brunost/ cow, goat Denmark, Finland, Germany, Iceland, Norway, Sweden NA semi-soft, whey NA brown NA 360 13 multiple
Castelmagno https://www.cheese.com/castelmagno/ cow, goat, sheep Italy Blue semi-hard FALSE ivory NA 4768 14 multiple
Limburger https://www.cheese.com/limburger/ cow Belgium, Germany, Netherlands NA semi-soft, smear-ripened FALSE straw 42.0 497 15 cow
Paneer https://www.cheese.com/paneer/ cow, water buffalo Bangladesh, India Cottage fresh firm TRUE white NA 208 16 multiple
Petida https://www.cheese.com/petida/ cow Germany NA soft, brined TRUE white 55.0 190 17 cow
President Fat Free Feta https://www.cheese.com/president-fat-free-feta/ cow France, United States Feta firm, artisan, brined NA white NA 30 18 cow
Prima Donna fino https://www.cheese.com/prima-donna-fino/ cow Netherlands Parmesan hard NA pale yellow NA 921 19 cow
Prima Donna forte https://www.cheese.com/prima-donna-forte/ NA Netherlands Parmesan hard NA yellow NA 990 20 NA
Prima Donna leggero https://www.cheese.com/prima-donna-leggero/ cow Netherlands Parmesan hard NA yellow NA 1071 21 cow
Prima Donna maturo https://www.cheese.com/prima-donna-maturo/ cow Netherlands Parmesan hard NA yellow NA 749 22 cow
Provoleta https://www.cheese.com/provoleta/ water buffalo Argentina Pasta filata semi-hard, artisan FALSE pale yellow 45.0 316 23 water buffalo
Provolone del Monaco https://www.cheese.com/provolone-del-monaco/ cow Italy Pasta filata semi-hard, artisan FALSE pale yellow 40.5 157 24 cow
Seriously Strong Cheddar https://www.cheese.com/seriously-strong-cheddar/ cow England, Scotland, United Kingdom Cheddar hard FALSE yellow 34.4 740 25 cow

group_by(milk_source): This function groups the data by the milk_source column. summarise(count = n()): This function summarizes the grouped data by counting the number of occurrences in each group with n(). .groups = ‘drop’: This argument is used to drop the grouping structure after summarization, returning a regular data frame. print(milk_source_counts): This prints the resulting summary data frame.

library(dplyr)

# Count the number of cheeses for each milk source and also the proportion 
milk_source_counts <- cheese_data  |> 
  group_by(milk_source) |> 
  summarise(count = n(), .groups = 'drop') |> 
  mutate(proportion = count / sum(count))


# Print the results
print(milk_source_counts)
# A tibble: 5 × 3
  milk_source   count proportion
  <chr>         <int>      <dbl>
1 cow              17       0.68
2 multiple          5       0.2 
3 sheep             1       0.04
4 water buffalo     1       0.04
5 <NA>              1       0.04

We want to do the same thing for other

Now, that we have created that new variable we want to select only the columns we want for our final product.

Back to top