Logistic Regression Viz – emmarshall.com

Acknowledgements

This is based on this very helpful and informative blogpost by Solomon Kurz. I am just trying to replicate it and make it work with my data.

Load Data

library(broom)
library(ggdist)
library(ggtext)
library(ggplot2)

data <- read.csv("data/lay-vig2.csv")

hind <- data |> 
  hind_scoring() |> 
  mutate(sample = "juror") |> 
  janitor::clean_names() |> 
  select(
    sample, condition, probcause, 
    prob_drugs, prob_gun, prob_finance, prob_nothing, prob_guilt, 
    knowitallalong, difficultpredict, clearvivid, foreseeable, 
    worse_error, foreseeability, decisioneval, age, gender, race, relevance, 
    starts_with("believable_"), starts_with("relevant_")
  )
  
glimpse(hind)

Rows: 164
Columns: 29
$ sample           <chr> "juror", "juror", "juror", "juror", "juror", "juror",…
$ condition        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ probcause        <int> 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1,…
$ prob_drugs       <int> 10, 10, 20, 80, 60, 70, 70, 50, 25, 20, 40, 40, 20, 8…
$ prob_gun         <int> 0, 30, 5, 9, 15, 10, 10, 10, 0, 5, 20, 20, 10, 0, 20,…
$ prob_finance     <int> 0, 0, 5, 1, 5, 10, 0, 10, 0, 2, 10, 20, 20, 0, 0, 2, …
$ prob_nothing     <int> 90, 60, 70, 10, 20, 10, 20, 30, 75, 73, 30, 20, 50, 2…
$ prob_guilt       <dbl> 10, 40, 30, 90, 80, 90, 80, 70, 25, 27, 70, 80, 50, 8…
$ knowitallalong   <int> 1, 2, 2, 4, 3, 1, 4, 2, 2, 3, 3, 4, 4, 4, 4, 1, 1, 2,…
$ difficultpredict <int> 5, 2, 4, 2, 2, 4, 2, 4, 5, 5, 5, 2, 2, 4, 2, 3, 5, 2,…
$ clearvivid       <int> 1, 2, 2, 4, 5, 3, 4, 1, 2, 4, 2, 4, 3, 4, 3, 3, 1, 4,…
$ foreseeable      <int> 1, 2, 2, 4, 5, 2, 4, 2, 2, 4, 3, 4, 3, 4, 3, 2, 3, 3,…
$ worse_error      <int> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,…
$ foreseeability   <dbl> 1.00, 1.75, 1.75, 3.25, 3.50, 1.75, 3.25, 1.50, 1.75,…
$ decisioneval     <dbl> 1.2, 5.0, 2.8, 5.6, 6.4, 6.8, 5.4, 6.0, 1.8, 4.0, 4.8…
$ age              <int> 37, 31, 31, 34, 27, 27, 33, 40, 44, 58, 29, 24, 25, 2…
$ gender           <int> 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1,…
$ race             <int> 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 4, 1,…
$ relevance        <int> 9, 30, 21, 38, 40, 37, 32, 34, 8, 16, 42, 36, 25, 37,…
$ believable_1     <int> 9, 7, 6, 9, 8, 9, 6, 8, 8, 8, 4, 5, 7, 2, 9, 7, 2, 6,…
$ believable_2     <int> 5, 7, 4, 9, 7, 8, 7, 7, 9, 9, 7, 9, 4, 4, 8, 4, 2, 8,…
$ believable_3     <int> 5, 7, 3, 9, 7, 9, 8, 7, 7, 8, 6, 9, 7, 8, 8, 5, 2, 9,…
$ believable_4     <int> 2, 7, 1, 5, 4, 9, 3, 7, 1, 3, 5, 9, 3, 6, 1, 3, 1, 5,…
$ believable_5     <int> 4, 7, 6, 9, 5, 5, 5, 7, 9, 9, 5, 9, 6, 7, 6, 7, 2, 5,…
$ relevant_1       <int> 1, 6, 3, 5, 7, 8, 3, 7, 1, 2, 8, 7, 7, 5, 5, 4, 1, 3,…
$ relevant_2       <int> 1, 6, 4, 6, 9, 9, 8, 6, 1, 3, 7, 9, 3, 7, 6, 3, 1, 8,…
$ relevant_3       <int> 5, 6, 6, 9, 9, 9, 9, 8, 4, 7, 9, 9, 5, 9, 9, 8, 2, 9,…
$ relevant_4       <int> 1, 6, 6, 9, 9, 9, 6, 7, 1, 2, 9, 8, 7, 8, 7, 5, 1, 3,…
$ relevant_5       <int> 1, 6, 2, 9, 6, 2, 6, 6, 1, 2, 9, 3, 3, 8, 6, 3, 1, 3,…

Fitting and Visualizing Logistic Regression Models

We’ll use ggdist() to create more informative visualizations of our logistic regression results.

Check the Outcome Distribution

hind |> 
  count(probcause) |>  
  mutate(percent = 100 * n / sum(n))

  probcause  n  percent
1         1 94 57.31707
2         2 70 42.68293

Recode Outcome Variable to 0/1

For logistic regression, we need the outcome coded as 0 and 1:

hind <- hind |> 
  mutate(pc = ifelse(probcause == "2", 0, 1)) 

# Verify the recoding
hind |> 
  select(probcause, pc) |> 
  head()

  probcause pc
1         2  0
2         2  0
3         2  0
4         1  1
5         1  1
6         1  1

Explore the Predictor

Let’s look at how foreseeability is distributed across our two outcome groups:

hind |>  
  ggplot(aes(x = foreseeability)) +
  geom_histogram(fill = "#ff6361", color = "white", bins = 20) +
  facet_wrap(~pc, ncol = 1, labeller = labeller(pc = c("0" = "No Probable Cause", "1" = "Probable Cause"))) +
  theme_emw_dark() +
  labs(
    x = "Foreseeability",
    y = "Count"
  )

The Model

We fit a simple logistic regression with foreseeability as the sole predictor:

\[ \begin{align*} \text{probcause}_i & \sim \operatorname{Binomial}(n = 1, p_i) \\ \operatorname{logit}(p_i) & = \beta_0 + \beta_1 \text{foreseeability}_i \end{align*} \]

We use the conventional logit link to ensure the binomial probabilities are restricted within the bounds of zero and one.

fit <- glm(
  data = hind,
  family = binomial,
  pc ~ 1 + foreseeability
)

tidy(fit) |> 
  mutate(
    estimate = round(estimate, 3),
    std.error = round(std.error, 3),
    statistic = round(statistic, 3),
    p.value = round(p.value, 4)
  ) |>
  gt() |>
  tab_header(title = "Logistic Regression Results")

term	estimate	std.error	statistic	p.value
Logistic Regression Results
(Intercept)	-3.147	0.615	-5.120	0
foreseeability	1.424	0.246	5.792	0

Visualizing the Model

Step 1: Prepare Prediction Data

We use predict() along with post-processing strategies from Gavin Simpson’s Confidence intervals for GLMs to prepare the data for plotting.

# Define the new data
nd <- tibble(foreseeability = seq(from = 0, to = 5, length.out = 100))

p <- predict(fit, newdata = nd, type = "link", se.fit = TRUE) |> 
  data.frame() |>  
  mutate(
    ll = fit - 1.96 * se.fit,
    ul = fit + 1.96 * se.fit
  ) |>  
  select(-residual.scale, -se.fit) |>  
  mutate(across(everything(), plogis)) |> 
  bind_cols(nd)

glimpse(p)

Rows: 100
Columns: 4
$ fit            <dbl> 0.04122411, 0.04416174, 0.04729837, 0.05064599, 0.05421…
$ ll             <dbl> 0.01272662, 0.01397883, 0.01535163, 0.01685614, 0.01850…
$ ul             <dbl> 0.1254264, 0.1308657, 0.1365095, 0.1423627, 0.1484303, …
$ foreseeability <dbl> 0.00000000, 0.05050505, 0.10101010, 0.15151515, 0.20202…

Step 2: Basic Fitted Line Plot

Here’s a conventional line plot for our logistic regression model:

p1 <- p |>  
  ggplot(aes(x = foreseeability, y = fit)) +
  geom_ribbon(
    aes(ymin = ll, ymax = ul),
    fill = "#ff6361",
    alpha = 0.5
  ) +
  geom_line(color = "white", linewidth = 1) +
  scale_y_continuous(
    "Probability of finding probable cause", 
    expand = c(0, 0), 
    limits = 0:1
  ) +
  scale_x_continuous("Foreseeability (1-5)") +
  theme_emw_dark() +
  labs(
    title = "Logistic Regression: Foreseeability → Probable Cause",
    subtitle = "Shaded region shows 95% confidence interval"
  )

p1

The fitted line is in white and the semitransparent ribbon marks the 95% confidence intervals. The plot shows how increased perceptions of foreseeability tend to lead to findings of probable cause.

save_plot(p1, "plots/logistic_basic.png", bg = "#282a36")

Step 3: Add Data with Dot Plots

To add the raw data to our plot, we can use stat_dots() from the ggdist package. This approach uses vertical jittering as recommended by Gelman and colleagues (2020).

p2 <- p |> 
  ggplot(aes(x = foreseeability)) +
  geom_ribbon(
    aes(ymin = ll, ymax = ul),
    alpha = 0.5, 
    fill = "#ff8281"
  ) +
  geom_line(aes(y = fit), color = "white", linewidth = 1.5) +
  stat_dots(
    data = hind |> mutate(probcause = factor(probcause, levels = c("1", "2"))),
    aes(
      y = pc, 
      side = ifelse(pc == 0, "top", "bottom"),
      color = probcause
    ),
    scale = 0.4, 
    shape = 19
  ) +
  scale_color_manual(
    "Probable Cause", 
    values = c("#50FA7B", "#ff6361"),
    labels = c("Yes", "No")
  ) +
  scale_x_continuous("Foreseeability (1-5)") +
  scale_y_continuous(
    "Probability of probable cause",
    expand = c(0, 0)
  ) +
  theme_emw_dark(base_size = 18) +
  theme(
    panel.grid = element_blank(),
    legend.position = "none",
    plot.title = element_text(size = 28, face = "bold"),
    plot.subtitle = element_text(size = 18),
    axis.title = element_text(size = 20),
    axis.text = element_text(size = 16)
  ) +
  labs(
    title = "Logistic Regression with Raw Data",
    subtitle = "Dots show individual observations; ribbon shows 95% CI"
  )

p2

This visualization is much more informative because:

The ribbon shows model uncertainty (95% CI)
The dots show the actual data distribution
The dot positions (top vs bottom) indicate the binary outcome
The color coding reinforces the outcome grouping

To save the plot as a png with proper sizing use:

# This is the key fix - set showtext DPI to match ggsave DPI
showtext_opts(dpi = 300)

ggsave(
  filename = "plots/logistic_dotplot.png",
  plot = p2,
  device = ragg::agg_png,
  dpi = 300,
  width = 10,
  height = 7,
  units = "in",
  bg = "#282a36"
)

Summary

Key takeaways for visualizing logistic regression:

Always show uncertainty — use ribbons or error bars to display confidence intervals
Include the raw data — dot plots help viewers understand the underlying distribution
Use color strategically — differentiate outcome groups clearly
Consider the logit scale — remember that the model operates on log-odds, but we display probabilities for interpretability

Reuse

CC BY 4.0