Introduction#
Preface#
This package has been implemented by Massimo Pierini as a Bachelor’s thesis (Pierini, 2024).
It is the first implementation of CUB class models in Python and is mainly based upon
the work of Domenico Piccolo and the CUB
package in R (Iannario et al., 2022),
mainteined by Rosaria Simone.
Background#
The class of CUB (Combination of Uniform and Binomial) models, proposed by Professor Domenico Piccolo in 2003 (Piccolo, 2003) within the cotext of rating and preference data analysis, hypothesizes that the ordinal responses provided by the raters are not the simple result of a reasoned choice, but rather the complex combination of a multitude of factors, both internal and external.
Simplifying, two main components can be distinguished: feeling and uncertainty.
The primary component of feeling is due to sufficient awareness and understanding of the topic based on knowledge and experience. The secondary component of uncertainty is instead generated by an intrinsic fuzziness, due to various circumstances: limited knowledge, lack of interest, timing of the survey, method of administration, boredom, and so on.
The simplest way to consider these two aspects is a distribution resulting from a mixture of a shifted Binomial component for the first and Uniform Discrete for the second which takes the form of the CUB family models, subsequently extended to consider further factors such as the overdispersion of Binomial component, the effect of shelter choice, and so on.
The most updated paper by Piccolo and Simone, 2019 will be used as a reference for terminology, theory and inferential issues.
Motivation#
Currently the class of CUB models has been implemented in statistical and econometric programming languages such as R (Iannario et al., 2022), Stata (Cerulli et al., 2021), Gretl (Simone et al., 2019) and GAUSS (Piccolo, 2006). However, given the recent increase in the development of the Python programming language also in the statistical field (Pittard and Li, 2020), their implementation in this environment could be useful to the scientific community.
Notes#
To simplify the notation, the complete matrix of the covariates will be occasionally indicated by \(\pmb T\) and the column vector of model’s parameters by \(\pmb\theta\).
Generally speaking, for models with covariates three different probability functions are available:
.pmfi()
(probability distribution matrix)- \[\begin{split}\Pr(R_i=r|\pmb\theta; \pmb T_i), \left\{ \begin{array}{l} i=1,\ldots,n \\ r=1,\ldots,m \end{array} \right.\end{split}\]
which is a matrix \(n \times m\) of the probability distribution for each \(i\)-th subject given the estimated parameters and the covariates. This is an auxiliary function for
.draw()
. Notice that each row sums to 1, i.e.\[\sum_{r=1}^m \Pr(R_i=r|\pmb\theta; \pmb T_i) = 1,\; \forall i\]
.pmf()
(average probability distribution)- \[\frac{1}{n} \sum_{i=1}^n \Pr(R_i=r|\pmb\theta; \pmb T_i),\; r=1,\ldots,m\]
which is a row vector \(1 \times m\) of the average probability given the estimated parameters and the covariates. This is an auxiliary function of
.plot_ordinal()
and used to compute the Dissimilarity index for models with covariates. Notice that it always sums to 1 because\[\begin{split}\begin{align*} \sum_{r=1}^m \frac{1}{n} \sum_{i=1}^n \Pr(R_i=r|\pmb\theta; \pmb T_i) &= \frac{1}{n} \sum_{i=1}^n \; \sum_{r=1}^m \Pr(R_i=r|\pmb\theta; \pmb T_i) \\&= \frac{1}{n} \sum_{i=1}^n 1 = \frac{1}{n} n = 1 \end{align*}\end{split}\]
.prob()
(observed sample probability)- \[\Pr(R_i=r_i|\pmb\theta;\pmb T_i),\; i=1,\ldots,n\]
which is a column vector \(n \times 1\) of the probabilities for each \(i\)-th subject of the observed response \(r_i\) given the estimated parameters and the covariates. This has not been implemented for all models and can be an auxiliary function of
.loglik()
. Notice that usually it doesn’t sum to 1.