PMx-01: Foundations – The Leap from Sample to Population

Publish on: 2023/02/13 Classify at: RESEARCH/Pharmacometrics

Words: 1218 Read:≈ 6min

Summary

Establishing the fundamental distinction between Population and Sample, between Probability and Statistical Inference. Why modeling is reverse engineering, and why code is not a model.

If you are a pharmacometrician, you are likely comfortable with differential equations. You know how to describe the rate of change of a drug’s concentration in a compartment ($dA/dt = -k A$). But describing the biology is only half the battle.

The moment you move from a single theoretical curve to real-world data, you hit a wall: Variability.

Data we collect are unpredictable due to random variation. We cannot predict exactly what will happen to a specific patient, but we can predict the probability of different outcomes. This variation comes from two main sources:

Biological Variation: If you give the same dose to 100 patients, you get 100 different profiles.
Measurement Error: Even if you measure the same sample twice, the assay gives different numbers.

Crucially, the randomness comes from the way we collect the data. We cannot study the entire population because it is too large and we have constraints on time and resources. Instead, we select a representative random sample to infer information about the population.

This is why we need statistics. Statistics is the science that deals with random variation and the uncertainty associated with it. It does NOT remove the uncertainty, but enables us to quantify it (e.g., using confidence intervals). As Adrian Dunne — a leading educator in pharmacometric statistics whose lecture notes heavily inspire this series — emphasizes, “Modeling is about writing a model for the probability distribution of the population from which you selected your data.”

In this first post of our series, we are not going to write any NONMEM code. Instead, we are going to fix the most important bug in modeling: the one in our heads. We need to clearly define the relationship between the data we have and the truth we want.

The Core Concept: Population vs. Sample

The fundamental problem of statistics involves two entities:

The Population is vast and unknowable; the Sample is the small subset we actually observe. We use the Sample to infer the Population's parameters.

The Population (The Truth): This is the entire collection of subjects or items we are interested in. It is often infinite or conceptually vast (e.g., “all current and future patients with Disease X”). The population is characterized by Parameters (like the true mean clearance $\theta_{CL}$ or the true variance $\omega^2$). We never observe the population directly, and we never know the true parameters.
The Sample (The Evidence): This is the subset of the population that we actually measure in our clinical trial. We collect data ($y$) from these subjects. From this data, we calculate Estimates (like $\hat{\theta}_{CL}$). The sample is all we have.

The Two Roads: Probability vs. Inference

Confusion often arises because we travel between these two entities in opposite directions. Adrian defines this distinction clearly:

The two roads of statistics: Probability deduces from Population to Sample; Inference induces from Sample to Population. Our job as modelers is the reverse direction.

Probability (Deduction)

Direction: Population $\rightarrow$ Sample
The Scenario: Imagine we know the “Truth.” We know exactly that the population mean clearance is 5 L/h and the standard deviation is 1 L/h.
The Question: “If I take a random patient from this population, what is the probability that their clearance is greater than 7 L/h?”
Nature: This is predicting data from a known model. It is mathematically precise.

Statistical Inference (Induction)

Direction: Sample $\rightarrow$ Population
The Scenario: We don’t know the truth. We only have data from 50 patients, and their average clearance is 5.2 L/h.
The Question: “Based on these 50 patients, what can I say about the true population mean? Is it 5 L/h? Is it 6 L/h?”
Nature: This is Reverse Engineering. We are trying to reconstruct the invisible “Truth” based on the visible “Evidence.”

Key Takeaway:
Our job as modelers is Statistical Inference. We are trying to guess the blueprints of the machine (the Population Parameters $\theta$) by looking at the products it churns out (the Sample Data $y$). Because we only see a small sample, our inference is always uncertain. We express this uncertainty using confidence intervals and standard errors.

A Critical Distinction: $\theta$ vs. $\hat{\theta}$

Throughout this blog series, you will see a rigorous distinction in notation that many tutorials gloss over. You must separate the “God’s eye view” from the “Human view”:

$\theta$ (Theta): The True Value. It is a fixed, unknown constant of nature. It exists, but we will never know it.
$\hat{\theta}$ (Theta Hat): The Estimate. This is a random variable. If you ran your clinical trial again with different patients, you would get a different $\hat{\theta}$.

We use $\hat{\theta}$ to estimate $\theta$. We want our $\hat{\theta}$ to be close to $\theta$, but they are not the same thing. One is a target; the other is a dart thrown at that target. A good estimator is like a skilled dart player: the darts cluster tightly (low variance) around the bullseye (low bias). We will formalize exactly what “good” means in PMx-03.

The animation below shows this in action — each trial draws a new sample and computes a new $\hat{\theta}$. Over many trials, these estimates form the Sampling Distribution — the probability distribution of the estimator itself. It tells us how much our estimate would vary if we repeated the experiment many times:

Repeated sampling from a population builds up the Sampling Distribution of θ̂. Each estimate is different — the estimate is itself a random variable.

Philosophy: Code is NOT a Model

Before we move to probability distributions in the next post, there is one final philosophical pillar to establish.

Adrian Dunne is famously strict about this: “That is a piece of computer code, not a model.”

Many people point to a NONMEM control stream or a Monolix file and say, “This is my model.” Adrian argues this is a dangerous mindset:

flowchart LR
    classDef math fill:#eaf2f8,stroke:#3498db,color:#2c3e50,stroke-width:2px;
    classDef code fill:#fdedec,stroke:#e74c3c,color:#2c3e50,stroke-width:2px;
    classDef warn fill:#fef9e7,stroke:#f1c40f,color:#2c3e50,stroke-width:2px;

    A["Step 1: Mathematical Model
(Equations on Paper)"]:::math
    B["Step 2: Translation
(Implement in Software)"]:::code
    C["Step 3: Verification
(Does Code Match Math?)"]:::warn

    A -->|"Define biology
& statistics"| B
    B -->|"Check against
original equations"| C
    C -->|"Iterate"| A

Step 1: Modeling. You define the mathematical equations that describe the biology and the statistical distributions (e.g., $C(t) = \frac{D}{V}e^{-kt} + \epsilon$, where $\epsilon$ represents the random measurement error). You do this on paper, using mathematics (the universal language), independent of any software.
Step 2: Coding. You translate that mathematical model into a language a computer can understand (like NM-TRAN or MLXTRAN).

Why does this matter?
If you skip Step 1 and go straight to coding, you often fail to understand what you are actually asking the computer to do. You become a “coder,” not a “modeler.” By writing the model down mathematically first, you clarify your assumptions about the population and the error structure before you ever type $PK or $ERROR.

Summary

Statistics handles the variability and uncertainty inherent in biological data.
Probability goes from Population to Sample (Predicting data).
Inference goes from Sample to Population (Estimating parameters).
The Model is the mathematical description of the population; the Code is just the tool to fit it.

What’s Next

To perform inference, we first need to master the “Forward” direction. In the next post, we will dive into the building blocks of these models: Probability Distributions, from the simple Binomial to the critical Multivariate Normal.