6 minute read

Introduction

The negative binomial (NB) distribution is widely used to model count data whose variance is greater than expected given other discrete probability distributions. The ability to account for overdispersion in observed data using the NB distribution has led to its application across a wide range scientific disciplines as an alternative to Poisson models. Recently, I was interested in evaluating the sum of independent but not identically distributed negative binomial random variables (r.v.s). I discovered relatively quickly, however, that a straightforward solution to this problem doesn’t really exist. The exact solutions that have been published by Furman and Vellaisamy both have significant computational drawbacks. Approximate methods have also been described for such sums, which largely alleviate the computational burdens of the exact methods, but at the potential cost of numeric accuracy. What really befuddled me, however, was the fact that I was unable to find any widely accessible package/tool/etc. that would let me easily apply any/all of these methods to actual data. As such, I wrote the R package that I felt was missing: nbconv.

The current version of nbconv can be found on CRAN and the developmental version can be found on GitHub.

Package description

nbconv was written with the same general syntax as other distribution functions in R. The package has 5 principal functions: dnbconv(), pnbconv(), qnbconv(), rnbconv(), and nbconv_params(). The first four of these return the mass function (PMF), distribution function (CDF), quantile function, and random deviates, respectively, for the convolution of NB r.v.s. The function nbconv_params() returns summary statistics of a given NB convolution based on its moments. The signatures for the 5 principal functions are:

dnbconv(counts, mus, ps, phis, method = c("exact", "moments", "saddlepoint"), 
n.terms = 1000, n.cores = 1, tolerance = 1e-3, normalize = TRUE)

pnbconv(quants, mus, ps, phis, method = c("exact", "moments", "saddlepoint"),
n.terms = 1000, n.cores = 1, tolerance = 1e-3, normalize = TRUE)

qnbconv(probs, counts, mus, ps, phis, method = c("exact", "moments", "saddlepoint"),
n.terms = 1000, n.cores = 1, tolerance = 1e-3, normalize = TRUE)

rnbconv(mus, phis, ps, n.samp, n.cores = 1)

nbconv_params(mus, phis, ps)

The parameterization of the NB distribution used in nbconv is the same as the parameterization used by stats::d/p/q/rnbinom(). All of the nbconv functions take as input vectors of either constituent distribution means (mus) or probabilities of success (ps) and consitutent distribution dispersion parameters (phis, referred to as size in stats).

The PMF, CDF, and quantile functions all require specification of the evaluation method. In nbconv, these are: Furman’s exact equation (method = “exact”), a method of moments approximation (method = “moments”), and the saddlepoint approximation (method = “saddlepoint”). I’ll avoid the gory mathematical details of the evaluation methods in this post, but a detailed description can be found here. To give credit where it is due, Martin Modrák’s blog post was my inspiration to include the saddlepoint approximation in nbconv.

Other method-specific variables can also be user-defined in these functions. The variables n.terms and tolerance only pertain to evaluation via Furman’s exact function and define 1) the number of terms included in the series and 2) how close the sum of the PMF of the mixture r.v. $K$ (see the method descriptions) must be to 1 to be accepted, respectively. The threshold defined via tolerance serves as a way to ensure that the number of terms included in the series sufficiently describe the possible values of $K$. The variable normalize pertains to evaluation via the saddlepoint approximation and defines whether or not the saddlepoint density should be normalized to sum to 1, since the saddlepoint PMF is not guaranteed to do so. Evaluation of the mass, distribution, and quantile functions via the exact or the saddlepoint methods, as well as generation of random deviates via rnbconv(), can be parallelized by setting n.cores to a value greater than 1. It should be noted, however, that for the exact function, only evaluation of the PMF, and evaluation of the recursive parameters, are parallelized. Because of this, CPU time for evaluation of the exact function is linearly related to the number of terms included in the series. rnbconv() and nbconv_params() work independently of the evaluation methods described above. The variable n.samp in rnbconv() defines the number of random deviates to be sampled from the target convolution.

Examples

To demonstrate general use of nbconv functions, I’ll generate some sample data. I’ll use the gamma distribution to ensure that $\mu ≥ 0$ and $\phi > 0$.

library(nbconv)

set.seed(1234)
mus <- rgamma(n = 25, shape = 3.5, scale = 3.5)

set.seed(1234)
phis <- rgamma(n = 25, shape = 2, scale = 4)

Summary statistics of the convolution can be informative as to what methods might or might not work well with our data. I won’t go into too much detail here, but in general, the exact method provides the most accurate results but at what can be a steep computational cost. For convolutions of wildly different NB distributions and/or highly overdispersed distributions, the influence of the mixture distribution on the shape of the convolution grows. When this happens, the number of terms included in the series generally has to increase as well. Because the exact method depends on recursive parameters, this means linearly increasing computation time. However, in instances where the convolution is largely symmetric and/or doesn’t exhibit a large degree of kurtosis, the method of moments and saddlepoint approximations work pretty well. Anecdotally, the saddlepoint approximation is a little bit more robust to skewness and kurotosis than the method of moments approximation, but I won’t explore this point in any more detail here.

nbconv_params( mus = mus, phis = phis )
#>         mean       sigma2     skewness  ex.kurtosis       K.mean 
#> 259.98524927 722.99508196   0.17586635   0.04909827  26.26988006

The output of nbconv_params() tells us that the convolution of the sample NB r.v.s is approximately symmetric and doesn’t exhibit much tailing. Because of this, we could probably get away with using any of the three evaluation methods. For the purposes of demonstration, however, I’ll go ahead and use them all. I’ll additionally calculate empirical probability masses from random deviates sampled using rnbconv() to serve as reference data.

samps <- rnbconv( mus = mus, phis = phis, n.samp = 1e6, n.cores = 1 )

empirical <- stats::density(x = samps, from = 0, to = 500, n = 500 + 1)$y

exact <- dnbconv( mus = mus, phis = phis, counts = 0:500, 
                  method = "exact", n.terms = 1000, n.cores = 1 )

moments <- dnbconv( mus = mus, phis = phis, counts = 0:500, method = "moments" )

saddlepoint <- dnbconv( mus = mus, phis = phis, counts = 0:500, 
                        method = "saddlepoint", n.cores = 1, normalize = TRUE )

For easier visualization, I’ll combine the four calculated probability mass vectors into a single long data frame.

df <- data.frame( empirical, exact, moments, saddlepoint )

df$count <- c( 0:500 )

df <- df |>
  tidyr::pivot_longer( cols = !count, names_to = "method", values_to = "probability") |>
  dplyr::arrange( method, count )

df$method <- factor( df$method, levels = c("empirical", "exact", "moments", "saddlepoint") )
library( ggplot2 )

ggplot(data = df,
       aes(x = count, y = probability , fill = method ) ) +
  geom_area(data = df[df$method == "empirical",],
            aes(x = count, y = probability ),
            color = "gray50", fill="gray50",
            inherit.aes = FALSE, alpha = 0.5 ) +  
  geom_point(shape = 21, size = 2, color = "#00000000") +
  scale_fill_manual(values = c("gray50","darkblue","darkred","darkgreen"),
                     labels = c("Empirical", "Exact", "Moments", "Saddlepoint" ),
                     guide = guide_legend(title.position = "top",
                                          title.hjust = 0.5,
                                          override.aes = list( size = 2.5 ) ) ) +
  theme_bw() +
  theme(axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        legend.position = "top",
        legend.title = element_blank(),
        legend.key.size = unit(3, "point"),
        panel.spacing = unit(1, "lines") ) +
  labs(x = "Counts", y = "Probability")

Visualizing the data, we can see that all three methods do indeed appear describe the empirical distribution well. This will not always be the case! I strongly encourage users of nbconv to pay attention to the summary statistics of the target convolution and to compare the evaluated distribution to random deviates!

Hopefully this brief example effectively demonstrates the general workflow of nbconv. If you have any comments or questions, please reach out!

Updated: