Logistic Regression Viz

This is an attempt to create logistic regression plots that don’t suck to look at. Using data from APLS 2024.

NoteAcknowledgements

This is based on this very helpful and informative blogpost by Solomon Kurz. I am just trying to replicate it and make it work with my data.

Load Data

library(broom)
library(ggdist)
library(ggtext)
library(ggplot2)

data <- read.csv("data/lay-vig2.csv")

hind <- data |> 
  hind_scoring() |> 
  mutate(sample = "juror") |> 
  janitor::clean_names() |> 
  select(
    sample, condition, probcause, 
    prob_drugs, prob_gun, prob_finance, prob_nothing, prob_guilt, 
    knowitallalong, difficultpredict, clearvivid, foreseeable, 
    worse_error, foreseeability, decisioneval, age, gender, race, relevance, 
    starts_with("believable_"), starts_with("relevant_")
  )
  
glimpse(hind)
Rows: 164
Columns: 29
$ sample           <chr> "juror", "juror", "juror", "juror", "juror", "juror",…
$ condition        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ probcause        <int> 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1,…
$ prob_drugs       <int> 10, 10, 20, 80, 60, 70, 70, 50, 25, 20, 40, 40, 20, 8…
$ prob_gun         <int> 0, 30, 5, 9, 15, 10, 10, 10, 0, 5, 20, 20, 10, 0, 20,…
$ prob_finance     <int> 0, 0, 5, 1, 5, 10, 0, 10, 0, 2, 10, 20, 20, 0, 0, 2, …
$ prob_nothing     <int> 90, 60, 70, 10, 20, 10, 20, 30, 75, 73, 30, 20, 50, 2…
$ prob_guilt       <dbl> 10, 40, 30, 90, 80, 90, 80, 70, 25, 27, 70, 80, 50, 8…
$ knowitallalong   <int> 1, 2, 2, 4, 3, 1, 4, 2, 2, 3, 3, 4, 4, 4, 4, 1, 1, 2,…
$ difficultpredict <int> 5, 2, 4, 2, 2, 4, 2, 4, 5, 5, 5, 2, 2, 4, 2, 3, 5, 2,…
$ clearvivid       <int> 1, 2, 2, 4, 5, 3, 4, 1, 2, 4, 2, 4, 3, 4, 3, 3, 1, 4,…
$ foreseeable      <int> 1, 2, 2, 4, 5, 2, 4, 2, 2, 4, 3, 4, 3, 4, 3, 2, 3, 3,…
$ worse_error      <int> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,…
$ foreseeability   <dbl> 1.00, 1.75, 1.75, 3.25, 3.50, 1.75, 3.25, 1.50, 1.75,…
$ decisioneval     <dbl> 1.2, 5.0, 2.8, 5.6, 6.4, 6.8, 5.4, 6.0, 1.8, 4.0, 4.8…
$ age              <int> 37, 31, 31, 34, 27, 27, 33, 40, 44, 58, 29, 24, 25, 2…
$ gender           <int> 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1,…
$ race             <int> 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 4, 1,…
$ relevance        <int> 9, 30, 21, 38, 40, 37, 32, 34, 8, 16, 42, 36, 25, 37,…
$ believable_1     <int> 9, 7, 6, 9, 8, 9, 6, 8, 8, 8, 4, 5, 7, 2, 9, 7, 2, 6,…
$ believable_2     <int> 5, 7, 4, 9, 7, 8, 7, 7, 9, 9, 7, 9, 4, 4, 8, 4, 2, 8,…
$ believable_3     <int> 5, 7, 3, 9, 7, 9, 8, 7, 7, 8, 6, 9, 7, 8, 8, 5, 2, 9,…
$ believable_4     <int> 2, 7, 1, 5, 4, 9, 3, 7, 1, 3, 5, 9, 3, 6, 1, 3, 1, 5,…
$ believable_5     <int> 4, 7, 6, 9, 5, 5, 5, 7, 9, 9, 5, 9, 6, 7, 6, 7, 2, 5,…
$ relevant_1       <int> 1, 6, 3, 5, 7, 8, 3, 7, 1, 2, 8, 7, 7, 5, 5, 4, 1, 3,…
$ relevant_2       <int> 1, 6, 4, 6, 9, 9, 8, 6, 1, 3, 7, 9, 3, 7, 6, 3, 1, 8,…
$ relevant_3       <int> 5, 6, 6, 9, 9, 9, 9, 8, 4, 7, 9, 9, 5, 9, 9, 8, 2, 9,…
$ relevant_4       <int> 1, 6, 6, 9, 9, 9, 6, 7, 1, 2, 9, 8, 7, 8, 7, 5, 1, 3,…
$ relevant_5       <int> 1, 6, 2, 9, 6, 2, 6, 6, 1, 2, 9, 3, 3, 8, 6, 3, 1, 3,…

Fitting and Visualizing Logistic Regression Models

We’ll use ggdist() to create more informative visualizations of our logistic regression results.

Check the Outcome Distribution

hind |> 
  count(probcause) |>  
  mutate(percent = 100 * n / sum(n))
  probcause  n  percent
1         1 94 57.31707
2         2 70 42.68293

Recode Outcome Variable to 0/1

For logistic regression, we need the outcome coded as 0 and 1:

hind <- hind |> 
  mutate(pc = ifelse(probcause == "2", 0, 1)) 

# Verify the recoding
hind |> 
  select(probcause, pc) |> 
  head()
  probcause pc
1         2  0
2         2  0
3         2  0
4         1  1
5         1  1
6         1  1

Explore the Predictor

Let’s look at how foreseeability is distributed across our two outcome groups:

hind |>  
  ggplot(aes(x = foreseeability)) +
  geom_histogram(fill = "#ff6361", color = "white", bins = 20) +
  facet_wrap(~pc, ncol = 1, labeller = labeller(pc = c("0" = "No Probable Cause", "1" = "Probable Cause"))) +
  theme_emw_dark() +
  labs(
    x = "Foreseeability",
    y = "Count"
  )

The Model

We fit a simple logistic regression with foreseeability as the sole predictor:

\[ \begin{align*} \text{probcause}_i & \sim \operatorname{Binomial}(n = 1, p_i) \\ \operatorname{logit}(p_i) & = \beta_0 + \beta_1 \text{foreseeability}_i \end{align*} \]

We use the conventional logit link to ensure the binomial probabilities are restricted within the bounds of zero and one.

fit <- glm(
  data = hind,
  family = binomial,
  pc ~ 1 + foreseeability
)

tidy(fit) |> 
  mutate(
    estimate = round(estimate, 3),
    std.error = round(std.error, 3),
    statistic = round(statistic, 3),
    p.value = round(p.value, 4)
  ) |>
  gt() |>
  tab_header(title = "Logistic Regression Results")
Logistic Regression Results
term estimate std.error statistic p.value
(Intercept) -3.147 0.615 -5.120 0
foreseeability 1.424 0.246 5.792 0

Visualizing the Model

Step 1: Prepare Prediction Data

We use predict() along with post-processing strategies from Gavin Simpson’s Confidence intervals for GLMs to prepare the data for plotting.

# Define the new data
nd <- tibble(foreseeability = seq(from = 0, to = 5, length.out = 100))

p <- predict(fit, newdata = nd, type = "link", se.fit = TRUE) |> 
  data.frame() |>  
  mutate(
    ll = fit - 1.96 * se.fit,
    ul = fit + 1.96 * se.fit
  ) |>  
  select(-residual.scale, -se.fit) |>  
  mutate(across(everything(), plogis)) |> 
  bind_cols(nd)

glimpse(p)
Rows: 100
Columns: 4
$ fit            <dbl> 0.04122411, 0.04416174, 0.04729837, 0.05064599, 0.05421…
$ ll             <dbl> 0.01272662, 0.01397883, 0.01535163, 0.01685614, 0.01850…
$ ul             <dbl> 0.1254264, 0.1308657, 0.1365095, 0.1423627, 0.1484303, …
$ foreseeability <dbl> 0.00000000, 0.05050505, 0.10101010, 0.15151515, 0.20202…

Step 2: Basic Fitted Line Plot

Here’s a conventional line plot for our logistic regression model:

p1 <- p |>  
  ggplot(aes(x = foreseeability, y = fit)) +
  geom_ribbon(
    aes(ymin = ll, ymax = ul),
    fill = "#ff6361",
    alpha = 0.5
  ) +
  geom_line(color = "white", linewidth = 1) +
  scale_y_continuous(
    "Probability of finding probable cause", 
    expand = c(0, 0), 
    limits = 0:1
  ) +
  scale_x_continuous("Foreseeability (1-5)") +
  theme_emw_dark() +
  labs(
    title = "Logistic Regression: Foreseeability → Probable Cause",
    subtitle = "Shaded region shows 95% confidence interval"
  )

p1

The fitted line is in white and the semitransparent ribbon marks the 95% confidence intervals. The plot shows how increased perceptions of foreseeability tend to lead to findings of probable cause.

save_plot(p1, "plots/logistic_basic.png", bg = "#282a36")

Step 3: Add Data with Dot Plots

To add the raw data to our plot, we can use stat_dots() from the ggdist package. This approach uses vertical jittering as recommended by Gelman and colleagues (2020).

p2 <- p |> 
  ggplot(aes(x = foreseeability)) +
  geom_ribbon(
    aes(ymin = ll, ymax = ul),
    alpha = 0.5, 
    fill = "#ff8281"
  ) +
  geom_line(aes(y = fit), color = "white", linewidth = 1.5) +
  stat_dots(
    data = hind |> mutate(probcause = factor(probcause, levels = c("1", "2"))),
    aes(
      y = pc, 
      side = ifelse(pc == 0, "top", "bottom"),
      color = probcause
    ),
    scale = 0.4, 
    shape = 19
  ) +
  scale_color_manual(
    "Probable Cause", 
    values = c("#50FA7B", "#ff6361"),
    labels = c("Yes", "No")
  ) +
  scale_x_continuous("Foreseeability (1-5)") +
  scale_y_continuous(
    "Probability of probable cause",
    expand = c(0, 0)
  ) +
  theme_emw_dark(base_size = 18) +
  theme(
    panel.grid = element_blank(),
    legend.position = "none",
    plot.title = element_text(size = 28, face = "bold"),
    plot.subtitle = element_text(size = 18),
    axis.title = element_text(size = 20),
    axis.text = element_text(size = 16)
  ) +
  labs(
    title = "Logistic Regression with Raw Data",
    subtitle = "Dots show individual observations; ribbon shows 95% CI"
  )

p2

This visualization is much more informative because:

  1. The ribbon shows model uncertainty (95% CI)
  2. The dots show the actual data distribution
  3. The dot positions (top vs bottom) indicate the binary outcome
  4. The color coding reinforces the outcome grouping

To save the plot as a png with proper sizing use:

# This is the key fix - set showtext DPI to match ggsave DPI
showtext_opts(dpi = 300)

ggsave(
  filename = "plots/logistic_dotplot.png",
  plot = p2,
  device = ragg::agg_png,
  dpi = 300,
  width = 10,
  height = 7,
  units = "in",
  bg = "#282a36"
)

Summary

Key takeaways for visualizing logistic regression:

  1. Always show uncertainty — use ribbons or error bars to display confidence intervals
  2. Include the raw data — dot plots help viewers understand the underlying distribution
  3. Use color strategically — differentiate outcome groups clearly
  4. Consider the logit scale — remember that the model operates on log-odds, but we display probabilities for interpretability
Back to top