Observation Models

Two observation models are provided, both of which related disease incidence in the simulation model to count data, but differ in how they treat the denominator.

Observations from the entire population

The PopnCounts class provides a generic observation model for relating disease incidence to count data where the denominator is assumed to be the population size \(N\) (i.e., the denominator is assumed constant and is either not known or is known to be the population size \(N\)).

This observation model assumes that the relationship between observed case counts \(y_t\) and disease incidence in particle \(x_t\) follow a negative binomial distribution with mean \(\mathbb{E}[y_t]\) and dispersion parameter \(k\).

\[ \begin{align}\begin{aligned}\mathcal{L}(y_t \mid x_t) &\sim NB(\mathbb{E}[y_t], k)\\\mathbb{E}[y_t] &= (1 - p_\mathrm{inf}) \cdot bg_\mathrm{obs} + p_\mathrm{inf} \cdot p_\mathrm{obs} \cdot N\\\operatorname{Var}[y_t] &= \mathbb{E}[y_t] + \frac{\left(\mathbb{E}[y_t]\right)^2}{k} \ge bg_\mathrm{var}\end{aligned}\end{align} \]

The observation model parameters comprise:

  • The background observation rate \(bg_\mathrm{obs}\);
  • The variance in the background signal \(bg_\mathrm{var}\) (the minimum variance);
  • The probability of observing an infected individual \(p_\mathrm{obs}\); and
  • The dispersion parameter \(k\), which controls the relationship between the mean \((\mathbb{E}[y_t])\) and the variance; as \(k \to \infty\) the distribution approaches the Poisson, as \(k \to 0\) the distribution becomes more and more over-dispersed with respect to the Poisson.
Probability mass functions for the PopnCounts observation model.

Probability mass functions for the PopnCounts observation model with different values of the dispersion parameter \(k\), for the expected value \(\mathbb{E}[y_t] = 200\) (vertical dashed line).

Observations from population samples

The SampleCounts class provides a generic observation model for relating disease incidence to count data where the denominator is reported and may vary (e.g., weekly counts of all patients and the number that presented with influenza-like illness), and where the background signal is not a fixed value but rather a fixed proportion.

For this observation model, each observed value \((y_t)\) is the fraction of all patients \((N_t)\) that were identified as cases \((c_t)\). This observation model assumes that the relationship between the observed fraction \(y_t\), the denominator \(N_t\), and disease incidence in particle \(x_t\) follows a Beta-binomial distribution with probability of success \(\mathbb{E}[y_t]\) and dispersion parameter \(k\).

\[ \begin{align}\begin{aligned}y_t &= \frac{c_t}{N_t}\\\mathcal{L}(y_t \mid x_t) &= \mathcal{L}(c_t \mid N_t, x_t)\\\mathbb{E}[y_t] &= bg_\mathrm{obs} + p_\mathrm{inf} \cdot \kappa_\mathrm{obs}\\\operatorname{Var}[y_t] &= \frac{p \cdot (1 - p)}{N_t} \cdot \left[ 1 + \frac{N_t - 1}{k + 1} \right] \ge bg_\mathrm{var}\\\mathcal{L}(c_t \mid N_t, x_t) &\sim \operatorname{BetaBin}(N_t, \mathbb{E}[y_t], k)\\\mathbb{E}[c_t] &= N_t \cdot \mathbb{E}[y_t]\\\operatorname{Var}[c_t] &= N_t \cdot p \cdot (1 - p) \cdot \left[ 1 + \frac{N_t - 1}{k + 1} \right]\end{aligned}\end{align} \]

The shape parameters \(\alpha\) and \(\beta\) satisfy:

\[ \begin{align}\begin{aligned}p &= \mathbb{E}[y_t] = \frac{\alpha}{\alpha + \beta}\\k &= \alpha + \beta\\\implies \alpha &= p \cdot k\\\implies \beta &= (1 - p) \cdot k\end{aligned}\end{align} \]

For details, see “Estimation of parameters in the beta binomial model”, Tripathi et al., J Ann Inst Statist Math 46(2): 317–331, 1994 (DOI: 10.1007/BF01720588).

The observation model parameters comprise:

  • The background case fraction \(bg_\mathrm{obs}\);
  • The variance in the background case fraction \(bg_\mathrm{var}\) (the minimum variance);
  • The slope \(\kappa_\mathrm{obs}\) of the relationship between disease incidence and the proportion of cases; and
  • The dispersion parameter \(k\), which controls the relationship between the mean \((\mathbb{E}[y_t])\) and the variance; as \(k \to \infty\) the distribution approaches the binomial, as \(k \to 0\) the distribution becomes more and more over-dispersed with respect to the binomial.
Probability density functions for the SampleCounts observation model.

Probability mass functions for the SampleCounts observation model with different values of the dispersion parameter \(k\), for the expected value \(\mathbb{E}[c_t] = 200\) (vertical dashed line) where \(\mathbb{E}[y_t] = 0.02\) and \(N_t = 10,\!000\).

The PopnCounts class

class epifx.obs.PopnCounts(obs_unit, obs_period, pr_obs_ix=None)

Generic observation model for relating disease incidence to count data where the denominator is assumed or known to be the population size.

Parameters:
  • obs_unit – A descriptive name for the data.
  • obs_period – The observation period (in days).
  • pr_obs_ix – The index of the model parameter that defines the observation probability (\(p_\mathrm{obs}\)). By default, the value in the parameters dictionary is used.
expect(params, op, time, period, pr_inf, prev, curr)

Calculate the expected observation value \(\mathbb{E}[y_t]\) for every particle \(x_t\).

log_llhd(params, op, time, obs, pr_indiv, curr, hist)

Calculate the log-likelihood \(\mathcal{l}(y_t \mid x_t)\) for the observation \(y_t\) (obs) and every particle \(x_t\).

If it is known (or suspected) that the observed value will increase in the future — when obs['incomplete'] == True — then the log-likehood \(\mathcal{l}(y > y_t \mid x_t)\) is calculated instead (i.e., the log of the survival function).

If an upper bound to this increase is also known (or estimated) — when obs['upper_bound'] is defined — then the log-likelihood \(\mathcal{l}(y_u \ge y > y_t \mid x_t)\) is calculated instead.

The upper bound can also be treated as a point estimate by setting params['epifx']['upper_bound_as_obs'] = True — then the log-likelihood \(\mathcal{l}(y_u \mid x_t)\) is calculated.

llhd_in(op, time, mu, wt, y0, y1)

Return the probability mass in \([y_0, y1)\).

Parameters:
  • op – The observation model parameters dictionary.
  • time – The current simulation time, \(t\).
  • mu – The expected case fraction for each particle.
  • wt – The weight associated with each particle.
  • y0 – The (inclusive) minimum fraction of cases (\(y_0\)).
  • y1 – The (exclusive) maximum fraction of cases (\(y_0\)).
quantiles(op, time, mu, wt, probs)

Return the observations \(y_i\) that satisfy:

\[y_i = \inf\left\{ y \in \mathbb{N} : p_i \le \sum_i w_i \cdot \mathcal{L}(y_t \le y \mid x_t^i)\right\}\]
Parameters:
  • op – The observation model parameters dictionary.
  • time – The current simulation time, \(t\).
  • mu – The expected case fraction for each particle, \(\mathbb{E}(y_t)\).
  • wt – The weight associated with each particle, \(w_i\).
  • probs – The probabilities \(p_i\), which must be sorted in ascending order.
define_params(params, bg_obs, pr_obs, disp, bg_var=None)

Add observation model parameters to the simulation parameters.

Parameters:
  • bg_obs – The background signal in the data (\(bg_\mathrm{obs}\)).
  • pr_obs – The probability of observing an infected individual (\(p_\mathrm{obs}\)).
  • disp – The dispersion parameter (\(k\)).
  • bg_var – The variance of the background signal (\(bg_\mathrm{var}\)).
from_file(filename, year=None, date_col='to', value_col='count')

Load count data from a space-delimited text file with column headers defined in the first line.

Parameters:
  • filename – The file to read.
  • year – Only returns observations for a specific year. The default behaviour is to return all recorded observations.
  • date_col – The name of the observation date column.
  • value_col – The name of the observation value column.
Returns:

A list of observations, ordered as per the original file.

The SampleCounts class

class epifx.obs.SampleCounts(obs_unit, obs_period, denom, k_obs_ix=None)

Generic observation model for relating disease incidence to count data where the denominator is reported.

Parameters:
  • obs_unit – A descriptive name for the data.
  • obs_period – The observation period (in days).
  • denom – The denominator to use when calculating likelihoods and quantiles in the absence of an actual observation.
  • k_obs_ix – The index of the model parameter that defines the disease-related increase in observation rate (\(\kappa_\mathrm{obs}\)). By default, the value in the parameters dictionary is used.
expect(params, op, time, period, pr_inf, prev, curr)

Calculate the expected observation value \(\mathbb{E}[y_t]\) for every particle \(x_t\).

log_llhd(params, op, time, obs, pr_indiv, curr, hist)

Calculate the log-likelihood \(\mathcal{l}(y_t \mid x_t)\) for the observation \(y_t\) (obs) and every particle \(x_t\).

If it is known (or suspected) that the observed value will increase in the future — when obs['incomplete'] == True — then the log-likehood \(\mathcal{l}(y > y_t \mid x_t)\) is calculated instead (i.e., the log of the survival function).

If an upper bound to this increase is also known (or estimated) — when obs['upper_bound'] is defined — then the log-likelihood \(\mathcal{l}(y_u \ge y > y_t \mid x_t)\) is calculated instead.

The upper bound can also be treated as a point estimate by setting params['epifx']['upper_bound_as_obs'] = True — then the log-likelihood \(\mathcal{l}(y_u \mid x_t)\) is calculated.

llhd_in(op, time, mu, wt, y0, y1)

Return the probability mass over the interval \([y_0, y1)\).

Parameters:
  • op – The observation model parameters dictionary.
  • time – The current simulation time, \(t\).
  • mu – The expected fraction of all patients that are cases.
  • wt – The weight associated with each value of mu.
  • y0 – The (inclusive) minimum fraction of cases (\(y_0\)).
  • y1 – The (exclusive) maximum fraction of cases (\(y_0\)).
quantiles(op, time, mu, wt, probs)

Return the observations \(y_i\) that satisfy:

\[y_i = \inf\left\{ y \in \mathbb{N} : p_i \le \sum_i w_i \cdot \mathcal{L}(y_t \le y \mid x_t^i)\right\}\]
Parameters:
  • op – The observation model parameters dictionary.
  • time – The current simulation time, \(t\).
  • mu – The expected case fraction for each particle, \(\mathbb{E}(y_t)\).
  • wt – The weight associated with each particle, \(w_i\).
  • probs – The probabilities \(p_i\), which must be sorted in ascending order.
define_params(params, bg_obs, k_obs, disp, bg_var=None)

Add observation model parameters to the simulation parameters.

Parameters:
  • bg_obs – The background signal in the data (\(bg_\mathrm{obs}\)).
  • k_obs – The increase in observation rate due to infected individuals (\(\kappa_\mathrm{obs}\)).
  • disp – The dispersion parameter (\(k\)).
  • bg_var – The variance of the background signal (\(bg_\mathrm{var}\)).
from_file(filename, year=None, date_col='to', value_col='cases', denom_col='patients')

Load count data from a space-delimited text file with column headers defined in the first line.

Note that returned observation values represent the fraction of patients that were counted as cases, not the absolute number of cases. The number of cases and the number of patients are recorded under the 'numerator' and 'denominator' keys, respectively.

Parameters:
  • filename – The file to read.
  • year – Only returns observations for a specific year. The default behaviour is to return all recorded observations.
  • date_col – The name of the observation date column.
  • value_col – The name of the observation value column (reported as absolute values, not fractions).
  • denom_col – The name of the observation denominator column.
Returns:

A list of observations, ordered as per the original file.

Raises:

ValueError – If a denominator or value is negative, or if the value exceeds the denominator.

Custom observation models

All observation models should derive from the following base class:

class epifx.Obs

The base class of observation models, which defines the minimal set of methods that are required.

expect(params, op, time, period, pr_inf, prev, curr)

Return the expected observation value \(\mathbb{E}[y_t]\) for every particle \(x_t\), at one or more times \(t\).

Parameters:
  • params – The observation model parameters.
  • op – The observation model parameters dictionary.
  • time – The simulation time(s), \(t\).
  • period – The duration of the observation period (in days).
  • pr_inf – The probability of an individual becoming infected during the observation period(s), \(p_\mathrm{inf}\).
  • prev – The state vectors at the start of the observation period(s), \(x_t\).
  • curr – The state vectors at the end of the observation period(s).
log_llhd(params, op, time, obs, pr_indiv, curr, hist)

Return the log-likelihood \(\mathcal{l}(y_t \mid x_t)\) for the observation \(y_t\) and every particle \(x_t\).

Parameters:
  • params – The observation model parameters.
  • op – The observation model parameters dictionary.
  • time – The current simulation time, \(t\).
  • obs – An observation for the current time-step, \(y_t\).
  • pr_indiv – The probabilities of an individual not being infected and being infected during the observation period, \([1 - p_\mathrm{inf}, p_\mathrm{inf}]\).
  • curr – The particle state vectors, \(x_t\).
  • hist – The particle state histories, indexed by observation period.
llhd_in(op, time, mu, wt, y0, y1)

Return the weighted likelihood that \(y_t \in [y_0, y_1)\):

\[\sum_i w_i \cdot \mathcal{L}(y_0 \le y_t < y_1 \mid x_t^i)\]
Parameters:
  • op – The observation model parameters dictionary.
  • time – The current simulation time, \(t\).
  • mu – The expected case fraction for each particle, \(\mathbb{E}(y_t)\).
  • wt – The weight associated with each particle, \(w_i\).
  • y0 – The (inclusive) minimum fraction of cases, \(y_0\).
  • y1 – The (exclusive) maximum fraction of cases, \(y_1\).
quantiles(op, time, mu, wt, probs)

Return the observations \(y_i\) that satisfy:

\[y_i = \inf\left\{ y \in \mathbb{N} : p_i \le \sum_i w_i \cdot \mathcal{L}(y_t \le y \mid x_t^i)\right\}\]
Parameters:
  • op – The observation model parameters dictionary.
  • time – The current simulation time, \(t\).
  • mu – The expected case fraction for each particle, \(\mathbb{E}(y_t)\).
  • wt – The weight associated with each particle, \(w_i\).
  • probs – The probabilities \(p_i\), which must be sorted in ascending order.

Reading observations from disk

The from_file() methods provided by the observation models listed above use read_table() to read specific columns from data files. This function is intended to be a flexible wrapper around numpy.loadtxt, and should be sufficient for reading almost any whitespace-delimited data file with named columns.

epifx.obs.read_table(filename, columns, date_fmt=None, comment='#')

Load data from a space-delimited ASCII text file with column headers defined in the first non-comment line.

Parameters:
  • filename – The file to read.
  • columns – The columns to read, represented as a list of (name, type) tuples; type should either be a built-in NumPy scalar type, or datetime.date or datetime.datetime (in which case string values will be automatically converted to datetime.datetime objects by datetime_conv()).
  • date_fmt – A format string for parsing date columns; see datetime_conv() for details and the default format string.
  • comment – Prefix string(s) that identify comment lines; either a single Unicode string or a tuple of Unicode strings.
Returns:

A NumPy structured array.

Raises:

ValueError – If the data file contains non-ASCII text (i.e., bytes greater than 127), because numpy.loadtxt cannot process non-ASCII data (e.g., see NumPy issues #3184, #4543, #4600, #4939).

Example:

The code below demonstrates how to read observations from a file that includes columns for the year, the observation date, the observed value, and free-text metadata (up to 20 characters in length).

import datetime
import numpy as np
import epifx.obs
columns = [('year', np.int32), ('date', datetime.datetime),
           ('count', np.int32), ('metadata', 'S20')]
df = epifx.obs.read_table('my-data-file.ssv', columns,
                          date_fmt='%Y-%m-%d')
epifx.obs.datetime_conv(text, fmt='%Y-%m-%d', encoding='utf-8')

Convert date strings to datetime.datetime instances. This is a convenience function intended for use with, e.g., numpy.loadtxt.

Parameters:
  • text – A string (bytes or Unicode) containing a date or date-time.
  • fmt – A format string that defines the textual representation. See the Python strptime documentation for details.
  • encoding – The byte string encoding (default is UTF-8).