No Data, No Worries : Fabricatr

Neil Shephard

Research Software Engineer, Department of Computer Science, University of Sheffield

Scan This

ns-rse.github.io/sheffieldr-fabricatr-20240621

Motivation

  • Rarely work on my own data.
  • Literate programming, reproducible workflow
  • Scripts to clean, summarise, plot and analyse data.

The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807

Problem

Rarely is the final data set available!

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” - R.A. Fisher

Solution

  • Database schema/dictionary?
  • Simulate data!

Fabricatr

  • fabricatr - Imagine Your Data Before You Collect It
#| label: setup
#| warning: false
#| eval: true
#| echo: true
install.packages(c("fabricatr", "tidyverse"))
library(correlate)
library(dplyr)
library(fabricatr)
library(ggplot2)
library(tidyr)

set.seed(55138)
  • Binary variables
  • Categorical/factors
  • Continuous
  • Time-Series
  • Correlated
  • Intra-Class Correlation (ICC)

Basic Example

dummy_df <- fabricatr::fabricate(
    N = 100,
    binary = fabricatr::draw_binary(
        N,
        prob = runif(N)
        ),
    binomial = fabricatr::draw_binomial(
        prob = runif(N),
        trials = 2
        ),
    uniform = runif(N),
    random_normal = rnorm(N, 38, 16)
)
dummy_df |> head()
   ID binary binomial     uniform random_normal
1 001      0        2 0.241211455      14.98915
2 002      0        2 0.002083675      28.97463
3 003      0        2 0.102645113      59.94338
4 004      1        2 0.124053932      39.85355
5 005      1        0 0.863406008      45.04457
6 006      1        0 0.655067994      59.41866

Basic Example - Summary

Characteristic N = 1001
binary 41 (41%)
binomial
    0 38 (38%)
    1 30 (30%)
    2 32 (32%)
uniform 0.38 (0.16, 0.70)
random_normal 38 (25, 49)
1 n (%); Median (IQR)
Figure 1
Figure 2
Figure 3

Categorical Variables

  • 100 observations
  • Random Normal (mean = 0, sd = 1) : rnorm()
  • Categorise using draw_ordered() and setting breaks.
  • Label variables with break_labels
cat_df <- fabricatr::fabricate(
    N = 100,
    x = rnorm(N),
    ordered = fabricatr::draw_ordered(
        x,
        breaks = c(-Inf,
                   -1,
                   -0.5,
                   0,
                   0.5,
                   1,
                   Inf),
        break_labels = c("Group 1",
                         "Group 2",
                         "Group 3",
                         "Group 4",
                         "Group 5",
                         "Group 6")
    )
)

Categorical Variables - Summary

Characteristic N = 1001
ordered
    Group 1 14 (14%)
    Group 2 13 (13%)
    Group 3 19 (19%)
    Group 4 22 (22%)
    Group 5 13 (13%)
    Group 6 19 (19%)
1 n (%)

Categorical variable with six levels.

Figure 4

Correlated Continuous Variables

EXPERIMENTAL

  • 100 observations
  • y (mean = 38, sd = 16) : using rnorm()
  • a (mean = 78, sd = 14) : rnorm() rank correlation with y of 0.8 using correlate
  • b (mean = 46, sd = 18) : rnorm() rank correlation with y of -0.4 using correlate
corr_df <- fabricatr::fabricate(
    N = 100,
    y = rnorm(N, mean = 64, sd = 11),
    a = fabricatr::correlate(given = y,
                             rho = 0.8,
                             rnorm,
                             mean = 78,
                             sd = 14),
    b = fabricatr::correlate(given = y,
                             rho = -0.4,
                             rnorm,
                             mean = 46,
                             sd = 18)
)

Correlated Continuous Variables - Summary

Correlated Continuous Variables - Summary

Characteristic N = 1001
y 64 (57, 70)
a 78 (66, 86)
b 47 (38, 59)
1 Median (IQR)

Correlated Discrete Variables

EXPERIMENTAL

corr_discrete_df <- fabricatr::fabricate(
    N = 100,
    q1 = fabricatr::draw_binomial(prob = 0.38,
                                  trials = 10,
                                  N = N),
    q2 = fabricatr::correlate(given = q1,
                             rho = 0.74,
                             fabricatr::draw_binomial,
                             prob = 0.63,
                             trials = 10),
    q3 = fabricatr::correlate(given = q2,
                             rho = -0.89,
                             fabricatr::draw_binomial,
                             prob = 0.79,
                             trials = 10)
)

Correlated Discrete Variables - Summary

Characteristic N = 1001
q1 4 (3, 5)
q2
    2 1 (1.0%)
    3 2 (2.0%)
    4 7 (7.0%)
    5 21 (21%)
    6 23 (23%)
    7 33 (33%)
    8 10 (10%)
    9 1 (1.0%)
    10 2 (2.0%)
q3
    5 3 (3.0%)
    6 9 (9.0%)
    7 25 (25%)
    8 27 (27%)
    9 27 (27%)
    10 9 (9.0%)
1 Median (IQR); n (%)
Tabulation of q1 v q2
2 3 4 5 6 7 8 9 10
0 0 0 2 0 0 0 0 0 0
1 0 0 3 1 1 0 0 0 0
2 1 1 1 3 5 1 0 0 0
3 0 1 1 10 8 5 0 0 0
4 0 0 0 5 5 10 2 0 0
5 0 0 0 2 1 11 4 0 0
6 0 0 0 0 2 3 1 1 0
7 0 0 0 0 1 1 3 0 0
8 0 0 0 0 0 2 0 0 1
9 0 0 0 0 0 0 0 0 1
Tabulation of q2 v q3
5 6 7 8 9 10
2 0 0 0 0 0 1
3 0 0 0 0 1 1
4 0 0 0 0 3 4
5 0 0 0 5 13 3
6 0 0 4 10 9 0
7 0 5 15 12 1 0
8 2 2 6 0 0 0
9 0 1 0 0 0 0
10 1 1 0 0 0 0

Correlated Discrete Variables - Summary

Intra Class Correlation (ICC)

schools_data <- fabricatr::fabricate(
  primary_schools = fabricatr::add_level(N = 20,
                      ps_quality = runif(N, 1, 10)),
  secondary_schools = fabricatr::add_level(N = 15,
                        ss_quality = runif(N, 1, 10),
                        nest = FALSE),
  students = fabricatr::link_levels(N = 1500,
               by = fabricatr::join_using(primary_schools,
                       secondary_schools),
               SAT_score = 800 + 13 * ps_quality + 26 * ss_quality +
                           rnorm(N, 0, 50)
             )
)

Intra Class Correlation (ICC) - Summary

lm(SAT_score ~ ps_quality + ss_quality, data = schools_data)

Call:
lm(formula = SAT_score ~ ps_quality + ss_quality, data = schools_data)

Coefficients:
(Intercept)   ps_quality   ss_quality  
     800.03        12.89        25.82  

Intra Class Correlation (ICC)

schools_data <- fabricatr::fabricate(
  primary_schools = fabricatr::add_level(N = 20,
                      ps_quality = runif(N, 1, 10)),
  secondary_schools = fabricatr::add_level(N = 15,
                        ss_quality = runif(N, 1, 10),
                        nest = FALSE),
  students = fabricatr::link_levels(N = 1500,
               by = fabricatr::join_using(primary_schools,
                               secondary_schools,
                               rho = 0.5),
               SAT_score = 800 + 13 * ps_quality + 26 * ss_quality +
                           rnorm(N, 0, 50)
             )
)

Time Series

Summary

  • fabricatr powerful but easy to use.
  • Simulate data and setup analytical workflow (tables/graphs/models).
  • Not useful for cleaning/tidying real data (~80% of work!).
  • Need good database schema/data dictionary