Observation Models

Two observation models are provided, both of which related disease incidence in the simulation model to count data, but differ in how they treat the denominator.

Observations from the entire population

The PopnCounts class provides a generic observation model for relating disease incidence to count data where the denominator is assumed to be the population size \(N\) (i.e., the denominator is assumed constant and is either not known or is known to be the population size \(N\)).

This observation model assumes that the relationship between observed case counts \(y_t\) and disease incidence in particle \(x_t\) follow a negative binomial distribution with mean \(\mathbb{E}[y_t]\) and dispersion parameter \(k\).

\[ \begin{align}\begin{aligned}\mathcal{L}(y_t \mid x_t) &\sim NB(\mathbb{E}[y_t], k)\\\mathbb{E}[y_t] &= (1 - p_\mathrm{inf}) \cdot bg_\mathrm{obs} + p_\mathrm{inf} \cdot p_\mathrm{obs} \cdot N\\\operatorname{Var}[y_t] &= \mathbb{E}[y_t] + \frac{\left(\mathbb{E}[y_t]\right)^2}{k} \ge bg_\mathrm{var}\end{aligned}\end{align} \]

The observation model parameters comprise:

The background observation rate \(bg_\mathrm{obs}\);
The variance in the background signal \(bg_\mathrm{var}\) (the minimum variance);
The probability of observing an infected individual \(p_\mathrm{obs}\); and
The dispersion parameter \(k\), which controls the relationship between the mean \((\mathbb{E}[y_t])\) and the variance; as \(k \to \infty\) the distribution approaches the Poisson, as \(k \to 0\) the distribution becomes more and more over-dispersed with respect to the Poisson.

Probability mass functions for the PopnCounts observation model. — Probability mass functions for the `PopnCounts` observation model with different values of the dispersion parameter \(k\), for the expected value \(\mathbb{E}[y_t] = 200\) (vertical dashed line).

Incomplete observations

An observation may also define a detection probability ('pr_detect'), which defines how “complete” the observation is expected to be. This can be used to account for observed counts that are incomplete due to, e.g., delays in ascertainment and/or reporting. For example, to identify an observation as likely representing only 25% of the true count, set obs['pr_detect'] = 0.25. The detection probability \(p_\mathrm{det}\) has the following effect on the observation model:

\[ \begin{align}\begin{aligned}\mathcal{L}(y_t \mid x_t) &\sim NB(p_\mathrm{det} \cdot \mathbb{E}[y_t], k)\\\operatorname{Var}[y_t] &= p_\mathrm{det} \cdot \mathbb{E}[y_t] + \frac{\left(p_\mathrm{det} \cdot \mathbb{E}[y_t]\right)^2}{k}\end{aligned}\end{align} \]

Observations from population samples

The SampleFraction class provides a generic observation model for relating disease incidence to count data where the denominator is reported and may vary (e.g., weekly counts of all patients and the number that presented with influenza-like illness), and where the background signal is not a fixed value but rather a fixed proportion.

For this observation model, each observed value \((y_t)\) is the fraction of all patients \((N_t)\) that were identified as cases \((c_t)\). This observation model assumes that the relationship between the observed fraction \(y_t\), the denominator \(N_t\), and disease incidence in particle \(x_t\) follows a Beta-binomial distribution with probability of success \(\mathbb{E}[y_t]\) and dispersion parameter \(k\).

\[ \begin{align}\begin{aligned}y_t &= \frac{c_t}{N_t}\\\mathcal{L}(y_t \mid x_t) &= \mathcal{L}(c_t \mid N_t, x_t)\\\mathbb{E}[y_t] &= bg_\mathrm{obs} + p_\mathrm{inf} \cdot \kappa_\mathrm{obs}\\\operatorname{Var}[y_t] &= \frac{p \cdot (1 - p)}{N_t} \cdot \left[ 1 + \frac{N_t - 1}{k + 1} \right] \ge bg_\mathrm{var}\\\mathcal{L}(c_t \mid N_t, x_t) &\sim \operatorname{BetaBin}(N_t, \mathbb{E}[y_t], k)\\\mathbb{E}[c_t] &= N_t \cdot \mathbb{E}[y_t]\\\operatorname{Var}[c_t] &= N_t \cdot p \cdot (1 - p) \cdot \left[ 1 + \frac{N_t - 1}{k + 1} \right]\end{aligned}\end{align} \]

The shape parameters \(\alpha\) and \(\beta\) satisfy:

\[ \begin{align}\begin{aligned}p &= \mathbb{E}[y_t] = \frac{\alpha}{\alpha + \beta}\\k &= \alpha + \beta\\\implies \alpha &= p \cdot k\\\implies \beta &= (1 - p) \cdot k\end{aligned}\end{align} \]

For details, see “Estimation of parameters in the beta binomial model”, Tripathi et al., J Ann Inst Statist Math 46(2): 317–331, 1994 (DOI: 10.1007/BF01720588).

The observation model parameters comprise:

The background case fraction \(bg_\mathrm{obs}\);
The variance in the background case fraction \(bg_\mathrm{var}\) (the minimum variance);
The slope \(\kappa_\mathrm{obs}\) of the relationship between disease incidence and the proportion of cases; and
The dispersion parameter \(k\), which controls the relationship between the mean \((\mathbb{E}[y_t])\) and the variance; as \(k \to \infty\) the distribution approaches the binomial, as \(k \to 0\) the distribution becomes more and more over-dispersed with respect to the binomial.

Probability density functions for the SampleFraction observation model. — Probability mass functions for the `SampleFraction` observation model with different values of the dispersion parameter \(k\), for the expected value \(\mathbb{E}[c_t] = 200\) (vertical dashed line) where \(\mathbb{E}[y_t] = 0.02\) and \(N_t = 10,\!000\).

The PopnCounts class

class epifx.obs.PopnCounts(*args: Any, **kwargs: Any)

Generic observation model for relating disease incidence to count data where the denominator is assumed or known to be the population size.

Parameters:

obs_unit – A descriptive name for the data.
settings – The observation model settings dictionary.

The settings dictionary should contain the following keys:

observation_period: The observation period (in days).
upper_bound_as_obs: Treat upper bounds as point estimates (default: False).
pr_obs_lookup: The name of a lookup table for the observation probability \(p_\mathrm{obs}\) (default: None).
pr_obs_field: The name of the state vector field that contains the observation probability \(p_\mathrm{obs}\) (default: None).

Note

Do not define both pr_obs_lookup and pr_obs_field.

Note

If the pr_obs_lookup table contains more than one value column, each particle should be associated with a column by setting sample_values to True:

[scenario.test.lookup_tables]
pr_obs = { file = "pr-obs.ssv", sample_values = true }

[scenario.test.observations.cases]
model = "epifx.obs.PopnCounts"
pr_obs_lookup = "pr_obs"

If the lookup table contains only one value column, this can be omitted:

[scenario.test.lookup_tables]
pr_obs = "pr-obs.ssv"

[scenario.test.observations.cases]
model = "epifx.obs.PopnCounts"
pr_obs_lookup = "pr_obs"

log_llhd(ctx, snapshot, obs)

Calculate the log-likelihood \(\mathcal{l}(y_t \mid x_t)\) for the observation \(y_t\) (obs) and every particle \(x_t\).

If it is known (or suspected) that the observed value will increase in the future — when obs['incomplete'] == True — then the log-likehood \(\mathcal{l}(y > y_t \mid x_t)\) is calculated instead (i.e., the log of the survival function).

If an upper bound to this increase is also known (or estimated) — when obs['upper_bound'] is defined — then the log-likelihood \(\mathcal{l}(y_u \ge y > y_t \mid x_t)\) is calculated instead.

The upper bound can also be treated as a point estimate by setting upper_bound_as_obs = True — then the log-likelihood \(\mathcal{l}(y_u \mid x_t)\) is calculated.

from_file(filename, time_scale, year=None, time_col='to', value_col='count', ub_col=None, pr_detect_col=None)

Load count data from a space-delimited text file with column headers defined in the first line.

Parameters:

filename – The file to read.
year – Only returns observations for a specific year. The default behaviour is to return all recorded observations.
time_col – The name of the observation time column.
value_col – The name of the observation value column.
ub_col – The name of the estimated upper-bound column, optional.
pr_detect_col – The name of the column that defines a detection probability that accounts “incomplete” observations due to, e.g., delays in reporting.

Returns:

The observations data table.

The SampleFraction class

class epifx.obs.SampleFraction(*args: Any, **kwargs: Any)

Generic observation model for relating disease incidence to count data where the sample denominator is reported.

Parameters:

obs_unit – A descriptive name for the data.
settings – The observation model settings dictionary.

The settings dictionary should contain the following keys:

observation_period: The observation period (in days).
denominator: The denominator to use when generating simulated observations.
k_obs_lookup: The name of a lookup table for the disease-related increase in observation rate \(\kappa_\mathrm{obs}\) (default: None).
k_obs_field: The name of the state vector field that contains the observation rate \(\kappa_\mathrm{obs}\) (default: None).

Note

Do not define both k_obs_lookup and k_obs_field.

Note

If the k_obs_lookup table contains more than one value column, each particle should be associated with a column by setting sample_values to True:

[scenario.test.lookup_tables]
k_obs = { file = "k-obs.ssv", sample_values = true }

[scenario.test.observations.cases]
model = "epifx.obs.SampleFraction"
k_obs_lookup = "k_obs"

If the lookup table contains only one value column, this can be omitted:

[scenario.test.lookup_tables]
k_obs = "k-obs.ssv"

[scenario.test.observations.cases]
model = "epifx.obs.SampleFraction"
k_obs_lookup = "k_obs"

log_llhd(ctx, snapshot, obs): Calculate the log-likelihood \(\mathcal{l}(y_t \mid x_t)\) for the observation \(y_t\) (obs) and every particle \(x_t\).

simulate(ctx, snapshot, rng): Simulate the case fraction with respect to the default denominator.

from_file(filename, time_scale, year=None, time_col='to', value_col='cases', denom_col='patients')

Load count data from a space-delimited text file with column headers defined in the first line.

Note that returned observation values represent the fraction of patients that were counted as cases, not the absolute number of cases. The number of cases and the number of patients are recorded under the 'numerator' and 'denominator' keys, respectively.

Parameters:

filename – The file to read.
year – Only returns observations for a specific year. The default behaviour is to return all recorded observations.
time_col – The name of the observation time column.
value_col – The name of the observation value column (reported as absolute values, not fractions).
denom_col – The name of the observation denominator column.

Returns:

The observations data table.

Raises:

ValueError – If a denominator or value is negative, or if the value exceeds the denominator.