Cheese

Creating an assignment to practice data entry and transformation tasks in SPSS and Jamovi using data from cheese.com.

tidytuesday r package featured a fun data set in June 2024 that pulled characteristics of different types of cheeses listed on cheese.com.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

cheeses <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-06-04/cheeses.csv')

Rows: 1187 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): cheese, url, milk, country, region, family, type, fat_content, cal...
lgl  (2): vegetarian, vegan

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(cheeses)

Rows: 1,187
Columns: 19
$ cheese          <chr> "Aarewasser", "Abbaye de Belloc", "Abbaye de Belval", …
$ url             <chr> "https://www.cheese.com/aarewasser/", "https://www.che…
$ milk            <chr> "cow", "sheep", "cow", "cow", "cow", "cow", "cow", "co…
$ country         <chr> "Switzerland", "France", "France", "France", "France",…
$ region          <chr> NA, "Pays Basque", NA, "Burgundy", "Savoie", "province…
$ family          <chr> NA, NA, NA, NA, NA, NA, NA, "Cheddar", NA, NA, NA, NA,…
$ type            <chr> "semi-soft", "semi-hard, artisan", "semi-hard", "semi-…
$ fat_content     <chr> NA, NA, "40-46%", NA, NA, NA, "50%", NA, "45%", NA, NA…
$ calcium_content <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ texture         <chr> "buttery", "creamy, dense, firm", "elastic", "creamy, …
$ rind            <chr> "washed", "natural", "washed", "washed", "washed", "wa…
$ color           <chr> "yellow", "yellow", "ivory", "white", "white", "pale y…
$ flavor          <chr> "sweet", "burnt caramel", NA, "acidic, milky, smooth",…
$ aroma           <chr> "buttery", "lanoline", "aromatic", "barnyardy, earthy"…
$ vegetarian      <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ vegan           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ synonyms        <chr> NA, "Abbaye Notre-Dame de Belloc", NA, NA, NA, NA, NA,…
$ alt_spellings   <chr> NA, NA, NA, NA, "Tamié, Trappiste de Tamie, Abbey of T…
$ producers       <chr> "Jumi", NA, NA, NA, NA, "Abbaye Cistercienne NOTRE-DAM…

To filter the data frame to count the number of items that have a value for fat_content and calcium_content, and then create a new data frame that only includes those items that are not missing for at least one of those columns, you can use the dplyr package in R. Below are the steps you’ll need to follow:

Step 1: Load the necessary library Ensure you have the dplyr package loaded:

library(dplyr) Step 2: Count the number of items with non-missing values for fat_content and calcium_content You can use the filter() function to check for non-missing values and then count them using summarize():

# Assuming your data frame is named 'cheeses'
count_non_missing <- cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content) ) |>
  summarize(count = n())

# Print the count
print(count_non_missing)

# A tibble: 1 × 1
  count
  <int>
1   249

Step 3: Create a new data frame with items that are not missing for at least one of those columns You can use the same filter() function to create a new data frame:

# Create a new data frame with non-missing values for at least one of the specified columns
filtered_cheeses <- cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content))

# Print the new data frame
print(filtered_cheeses)

# A tibble: 249 × 19
   cheese    url   milk  country region family type  fat_content calcium_content
   <chr>     <chr> <chr> <chr>   <chr>  <chr>  <chr> <chr>       <chr>          
 1 Abbaye d… http… cow   France  <NA>   <NA>   semi… 40-46%      <NA>           
 2 Abbaye d… http… cow   France  Nord-… <NA>   semi… 50%         <NA>           
 3 Abertam   http… sheep Czech … Karlo… <NA>   hard… 45%         <NA>           
 4 Acorn     http… sheep United… Betha… <NA>   hard… 52%         <NA>           
 5 Adelost   http… cow   Sweden  <NA>   Blue   semi… 50%         <NA>           
 6 ADL Bric… http… cow   Canada  Princ… Chedd… semi… 12%         <NA>           
 7 ADL Mild… http… cow   Canada  Princ… Chedd… semi… 14%         <NA>           
 8 Affideli… http… cow   France  Burgu… <NA>   soft  55%         26 mg/100g     
 9 Aisy Cen… http… cow   France  Burgu… <NA>   semi… 50%         <NA>           
10 Allgauer… http… cow   Germany Swabia <NA>   hard  45%         <NA>           
# ℹ 239 more rows
# ℹ 10 more variables: texture <chr>, rind <chr>, color <chr>, flavor <chr>,
#   aroma <chr>, vegetarian <lgl>, vegan <lgl>, synonyms <chr>,
#   alt_spellings <chr>, producers <chr>

Explanation of the Code Filtering: The filter(!is.na(fat_content) | !is.na(calcium_content)) checks for rows where at least one of the two columns (fat_content or calcium_content) is not NA. Counting: summarize(count = n()) counts the number of rows that meet the filter condition. New Data Frame: The result of the filter is stored in filtered_cheeses, which only includes the rows where at least one of the specified columns has a value. Complete Example

library(dplyr)
library(tidyverse)

# Count the number of items with non-missing values for fat_content and calcium_content
count_non_missing <- cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content)) |>
  summarize(count = n())

print(count_non_missing)

# A tibble: 1 × 1
  count
  <int>
1   249

# Create a new data frame with items that are not missing for at least one of those columns
filtered_cheeses <- cheeses |>
  filter(!is.na(fat_content) | !is.na(calcium_content)) |> 
  # limit to columns we are interested in
  select(c(cheese, url, milk, country, family, type, vegetarian, color, fat_content, calcium_content))


print(filtered_cheeses)

# A tibble: 249 × 10
   cheese          url   milk  country family type  vegetarian color fat_content
   <chr>           <chr> <chr> <chr>   <chr>  <chr> <lgl>      <chr> <chr>      
 1 Abbaye de Belv… http… cow   France  <NA>   semi… FALSE      ivory 40-46%     
 2 Abbaye du Mont… http… cow   France  <NA>   semi… FALSE      pale… 50%        
 3 Abertam         http… sheep Czech … <NA>   hard… FALSE      pale… 45%        
 4 Acorn           http… sheep United… <NA>   hard… TRUE       <NA>  52%        
 5 Adelost         http… cow   Sweden  Blue   semi… NA         blue  50%        
 6 ADL Brick Chee… http… cow   Canada  Chedd… semi… NA         ivory 12%        
 7 ADL Mild Chedd… http… cow   Canada  Chedd… semi… NA         yell… 14%        
 8 Affidelice au … http… cow   France  <NA>   soft  FALSE      oran… 55%        
 9 Aisy Cendre     http… cow   France  <NA>   semi… FALSE      white 50%        
10 Allgauer Emmen… http… cow   Germany <NA>   hard  FALSE      yell… 45%        
# ℹ 239 more rows
# ℹ 1 more variable: calcium_content <chr>

To convert the fat_content column from character to numeric, you can use the as.numeric() function along with some additional handling for the different formats in the data.

# Convert fat_content to numeric
cheese_data <- filtered_cheeses |> 
  mutate(
    fat_content = case_when(
    is.na(fat_content) ~ as.numeric(NA),
    grepl("%", fat_content) ~ as.numeric(gsub("%", "", fat_content)),
    grepl("-", fat_content) ~ as.numeric(strsplit(fat_content, "-")[[1]][1]),
    TRUE ~ as.numeric(fat_content)
  ))

Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `fat_content = case_when(...)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

# Print the updated data frame
print(cheese_data)

# A tibble: 249 × 10
   cheese          url   milk  country family type  vegetarian color fat_content
   <chr>           <chr> <chr> <chr>   <chr>  <chr> <lgl>      <chr>       <dbl>
 1 Abbaye de Belv… http… cow   France  <NA>   semi… FALSE      ivory          NA
 2 Abbaye du Mont… http… cow   France  <NA>   semi… FALSE      pale…          50
 3 Abertam         http… sheep Czech … <NA>   hard… FALSE      pale…          45
 4 Acorn           http… sheep United… <NA>   hard… TRUE       <NA>           52
 5 Adelost         http… cow   Sweden  Blue   semi… NA         blue           50
 6 ADL Brick Chee… http… cow   Canada  Chedd… semi… NA         ivory          12
 7 ADL Mild Chedd… http… cow   Canada  Chedd… semi… NA         yell…          14
 8 Affidelice au … http… cow   France  <NA>   soft  FALSE      oran…          55
 9 Aisy Cendre     http… cow   France  <NA>   semi… FALSE      white          50
10 Allgauer Emmen… http… cow   Germany <NA>   hard  FALSE      yell…          45
# ℹ 239 more rows
# ℹ 1 more variable: calcium_content <chr>

Deal with the calcium_content

cheese_calc <- cheese_data |>
  filter(!is.na(calcium_content))

print(cheese_calc)

# A tibble: 25 × 10
   cheese          url   milk  country family type  vegetarian color fat_content
   <chr>           <chr> <chr> <chr>   <chr>  <chr> <lgl>      <chr>       <dbl>
 1 Affidelice au … http… cow   France  <NA>   soft  FALSE      oran…        55  
 2 Amul Emmental   http… cow   India   Swiss… semi… TRUE       yell…        46  
 3 Amul Gouda      http… cow   India   Gouda  semi… TRUE       yell…        46  
 4 Amul Pizza Moz… http… cow   India   Mozza… semi… TRUE       yell…        NA  
 5 Amul Processed… http… cow,… India   Chedd… hard… TRUE       yell…        26  
 6 Anthotyro       http… goat… Greece  <NA>   hard… NA         white        30  
 7 Basils Origina… http… cow   Germany <NA>   semi… FALSE      pale…        25.5
 8 Bavaria blu     http… cow   Germany Blue   soft… FALSE      cream        NA  
 9 Bianco          http… cow   Germany <NA>   semi… FALSE      pale…        NA  
10 Bonifaz         http… cow   Germany <NA>   soft  FALSE      cream        NA  
# ℹ 15 more rows
# ℹ 1 more variable: calcium_content <chr>

To convert the calcium_content column from character format (which may include units like “mg/100g”) to numeric, you can use a similar approach as we did for fat_content. You’ll want to extract the numeric part of the string and convert it to a numeric type.

# Convert calcium_content to numeric
cheese_data <- cheese_calc |>
  mutate(
    calcium_content = case_when(
      is.na(calcium_content) ~ as.numeric(NA),
      grepl("mg/100g", calcium_content) ~ as.numeric(gsub(" mg/100g", "", calcium_content)),
      TRUE ~ as.numeric(calcium_content)
    )
  )

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `calcium_content = case_when(...)`.
Caused by warning:
! NAs introduced by coercion

# Print the updated data frame
#print(cheese_data)

knitr::kable(cheese_data)

cheese	url	milk	country	family	type	vegetarian	color	fat_content	calcium_content
Affidelice au Chablis	https://www.cheese.com/affidelice-au-chablis/	cow	France	NA	soft	FALSE	orange	55.0	26
Amul Emmental	https://www.cheese.com/amul-emmental/	cow	India	Swiss Cheese	semi-hard	TRUE	yellow	46.0	488
Amul Gouda	https://www.cheese.com/amul-gouda/	cow	India	Gouda	semi-hard	TRUE	yellow	46.0	492
Amul Pizza Mozzarella Cheese	https://www.cheese.com/amul-pizza-mozzarella-cheese/	cow	India	Mozzarella	semi-soft, processed	TRUE	yellow	NA	492
Amul Processed Cheese	https://www.cheese.com/amul-processed-cheese/	cow, water buffalo	India	Cheddar	hard, processed	TRUE	yellow	26.0	343
Anthotyro	https://www.cheese.com/anthotyro/	goat, sheep	Greece	NA	hard, whey	NA	white	30.0	318
Basils Original Rauchkäse	https://www.cheese.com/basils-original-rauchkase/	cow	Germany	NA	semi-soft	FALSE	pale yellow	25.5	700
Bavaria blu	https://www.cheese.com/bavaria-blu/	cow	Germany	Blue	soft, blue-veined	FALSE	cream	NA	450
Bianco	https://www.cheese.com/bianco/	cow	Germany	NA	semi-hard	FALSE	pale yellow	NA	725
Bonifaz	https://www.cheese.com/bonifaz/	cow	Germany	NA	soft	FALSE	cream	NA	430
Breakfast Cheese	https://www.cheese.com/breakfast-cheese/	cow	United States	NA	fresh firm, soft-ripened	TRUE	white	NA	90
Brebis du Lavort	https://www.cheese.com/brebis-du-lavort/	sheep	France	NA	semi-hard, artisan	FALSE	ivory	NA	1050
Brunost	https://www.cheese.com/brunost/	cow, goat	Denmark, Finland, Germany, Iceland, Norway, Sweden	NA	semi-soft, whey	NA	brown	NA	360
Castelmagno	https://www.cheese.com/castelmagno/	cow, goat, sheep	Italy	Blue	semi-hard	FALSE	ivory	NA	4768
Limburger	https://www.cheese.com/limburger/	cow	Belgium, Germany, Netherlands	NA	semi-soft, smear-ripened	FALSE	straw	42.0	497
Paneer	https://www.cheese.com/paneer/	cow, water buffalo	Bangladesh, India	Cottage	fresh firm	TRUE	white	NA	208
Petida	https://www.cheese.com/petida/	cow	Germany	NA	soft, brined	TRUE	white	55.0	190
President Fat Free Feta	https://www.cheese.com/president-fat-free-feta/	cow	France, United States	Feta	firm, artisan, brined	NA	white	NA	30
Prima Donna fino	https://www.cheese.com/prima-donna-fino/	cow	Netherlands	Parmesan	hard	NA	pale yellow	NA	921
Prima Donna forte	https://www.cheese.com/prima-donna-forte/	NA	Netherlands	Parmesan	hard	NA	yellow	NA	990
Prima Donna leggero	https://www.cheese.com/prima-donna-leggero/	cow	Netherlands	Parmesan	hard	NA	yellow	NA	1071
Prima Donna maturo	https://www.cheese.com/prima-donna-maturo/	cow	Netherlands	Parmesan	hard	NA	yellow	NA	749
Provoleta	https://www.cheese.com/provoleta/	water buffalo	Argentina	Pasta filata	semi-hard, artisan	FALSE	pale yellow	45.0	316
Provolone del Monaco	https://www.cheese.com/provolone-del-monaco/	cow	Italy	Pasta filata	semi-hard, artisan	FALSE	pale yellow	40.5	157
Seriously Strong Cheddar	https://www.cheese.com/seriously-strong-cheddar/	cow	England, Scotland, United Kingdom	Cheddar	hard	FALSE	yellow	34.4	740

Next we want to convert these variables to a more unable format for data analysis. We can use functions from stringr() to extract the animal types from the milk variable and create a new variable that indicates whether the milk is from a single animal or multiple animals

library(dplyr)
library(stringr)
# Extract animal types and create a new variable for milk source
cheese_data <- cheese_data |>
  mutate(    # add id number column 
    id = row_number(),
    milk_source = case_when(
      str_detect(milk, ",") ~ "multiple",
      TRUE ~ str_trim(milk)
    )
  )

knitr::kable(cheese_data)

cheese	url	milk	country	family	type	vegetarian	color	fat_content	calcium_content	id	milk_source
Affidelice au Chablis	https://www.cheese.com/affidelice-au-chablis/	cow	France	NA	soft	FALSE	orange	55.0	26	1	cow
Amul Emmental	https://www.cheese.com/amul-emmental/	cow	India	Swiss Cheese	semi-hard	TRUE	yellow	46.0	488	2	cow
Amul Gouda	https://www.cheese.com/amul-gouda/	cow	India	Gouda	semi-hard	TRUE	yellow	46.0	492	3	cow
Amul Pizza Mozzarella Cheese	https://www.cheese.com/amul-pizza-mozzarella-cheese/	cow	India	Mozzarella	semi-soft, processed	TRUE	yellow	NA	492	4	cow
Amul Processed Cheese	https://www.cheese.com/amul-processed-cheese/	cow, water buffalo	India	Cheddar	hard, processed	TRUE	yellow	26.0	343	5	multiple
Anthotyro	https://www.cheese.com/anthotyro/	goat, sheep	Greece	NA	hard, whey	NA	white	30.0	318	6	multiple
Basils Original Rauchkäse	https://www.cheese.com/basils-original-rauchkase/	cow	Germany	NA	semi-soft	FALSE	pale yellow	25.5	700	7	cow
Bavaria blu	https://www.cheese.com/bavaria-blu/	cow	Germany	Blue	soft, blue-veined	FALSE	cream	NA	450	8	cow
Bianco	https://www.cheese.com/bianco/	cow	Germany	NA	semi-hard	FALSE	pale yellow	NA	725	9	cow
Bonifaz	https://www.cheese.com/bonifaz/	cow	Germany	NA	soft	FALSE	cream	NA	430	10	cow
Breakfast Cheese	https://www.cheese.com/breakfast-cheese/	cow	United States	NA	fresh firm, soft-ripened	TRUE	white	NA	90	11	cow
Brebis du Lavort	https://www.cheese.com/brebis-du-lavort/	sheep	France	NA	semi-hard, artisan	FALSE	ivory	NA	1050	12	sheep
Brunost	https://www.cheese.com/brunost/	cow, goat	Denmark, Finland, Germany, Iceland, Norway, Sweden	NA	semi-soft, whey	NA	brown	NA	360	13	multiple
Castelmagno	https://www.cheese.com/castelmagno/	cow, goat, sheep	Italy	Blue	semi-hard	FALSE	ivory	NA	4768	14	multiple
Limburger	https://www.cheese.com/limburger/	cow	Belgium, Germany, Netherlands	NA	semi-soft, smear-ripened	FALSE	straw	42.0	497	15	cow
Paneer	https://www.cheese.com/paneer/	cow, water buffalo	Bangladesh, India	Cottage	fresh firm	TRUE	white	NA	208	16	multiple
Petida	https://www.cheese.com/petida/	cow	Germany	NA	soft, brined	TRUE	white	55.0	190	17	cow
President Fat Free Feta	https://www.cheese.com/president-fat-free-feta/	cow	France, United States	Feta	firm, artisan, brined	NA	white	NA	30	18	cow
Prima Donna fino	https://www.cheese.com/prima-donna-fino/	cow	Netherlands	Parmesan	hard	NA	pale yellow	NA	921	19	cow
Prima Donna forte	https://www.cheese.com/prima-donna-forte/	NA	Netherlands	Parmesan	hard	NA	yellow	NA	990	20	NA
Prima Donna leggero	https://www.cheese.com/prima-donna-leggero/	cow	Netherlands	Parmesan	hard	NA	yellow	NA	1071	21	cow
Prima Donna maturo	https://www.cheese.com/prima-donna-maturo/	cow	Netherlands	Parmesan	hard	NA	yellow	NA	749	22	cow
Provoleta	https://www.cheese.com/provoleta/	water buffalo	Argentina	Pasta filata	semi-hard, artisan	FALSE	pale yellow	45.0	316	23	water buffalo
Provolone del Monaco	https://www.cheese.com/provolone-del-monaco/	cow	Italy	Pasta filata	semi-hard, artisan	FALSE	pale yellow	40.5	157	24	cow
Seriously Strong Cheddar	https://www.cheese.com/seriously-strong-cheddar/	cow	England, Scotland, United Kingdom	Cheddar	hard	FALSE	yellow	34.4	740	25	cow

group_by(milk_source): This function groups the data by the milk_source column. summarise(count = n()): This function summarizes the grouped data by counting the number of occurrences in each group with n(). .groups = ‘drop’: This argument is used to drop the grouping structure after summarization, returning a regular data frame. print(milk_source_counts): This prints the resulting summary data frame.

library(dplyr)

# Count the number of cheeses for each milk source and also the proportion 
milk_source_counts <- cheese_data  |> 
  group_by(milk_source) |> 
  summarise(count = n(), .groups = 'drop') |> 
  mutate(proportion = count / sum(count))


# Print the results
print(milk_source_counts)

# A tibble: 5 × 3
  milk_source   count proportion
  <chr>         <int>      <dbl>
1 cow              17       0.68
2 multiple          5       0.2 
3 sheep             1       0.04
4 water buffalo     1       0.04
5 <NA>              1       0.04

We want to do the same thing for other

Now, that we have created that new variable we want to select only the columns we want for our final product.