stats - Statistics methods without the overhead

Purpose:

Provide access to various statistical calculations, namely:

Platform:

Linux/Windows | Python 3.7+

Developer:

J Berendt

Email:

development@s3dev.uk

Comments:

n/a

Example:

Create a sample dataset for the stats methods:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> import pandas as pd

>>> np.random.seed(73)
>>> data = np.random.normal(size=100)*100
>>> x = np.arange(data.size)
>>> y = pd.Series(data).rolling(window=25, min_periods=25).mean().cumsum()

>>> # Preview the trend.
>>> plt.plot(x, y)
class stats.LinearRegression(x: numpy.array, y: numpy.array)[source]

Calculate the linear regression of a dataset.

Parameters:
  • x (np.array) – Array of X-values.

  • y (np.array) – Array of Y-values.

Slope Calculation:

The calculation for the slope itself is borrowed from the scipy.stats.linregress() function. Whose source code was obtained on GitHub.

Example Use:

Tip

For a sample dataset and imports to go along with this example, refer to the docstring for this module.

Calculate a linear regression line on an X/Y dataset:

>>> from lib.stats import LinearRegression

>>> linreg = LinearRegression(x, y)
>>> linreg.calculate()

>>> # Obtain the regression line array.
>>> y_ = linreg.regression_line

>>> # View the intercept value.
>>> linreg.intercept
-31.26630

>>> # View the slope value.
>>> linreg.slope
1.95332

>>> # Plot the trend and regression line.
>>> plt.plot(x, y, 'grey')
>>> plt.plot(x, y_, 'red')
>>> plt.show()
__init__(x: numpy.array, y: numpy.array)[source]

LinearRegression class initialiser.

property slope

Accessor to the slope value.

property intercept

Accessor to the slope’s y-intercept.

property regression_line

Accessor to the calculated regression line, as y-values.

calculate()[source]

Calculate the linear regression for the X/Y data arrays.

The result of the calculation is accessible via the regression_line property.

class stats.Stats[source]

Wrapper class for various statistical calculations.

static cusum(df: pandas.DataFrame, cols: list | str, *, window: int = None, min_periods: int = 1, inplace=False, show_plot: bool = False) pandas.DataFrame | None[source]

Calculate a CUSUM on a set of data.

A CUSUM is a generalised method for smoothing a noisy trend, or for detecting a change in the trend.

Note

A CUSUM is not a cumulative sum (cumsum), although a cumulative sum is used. A CUSUM is a cumulative sum of derived values, where each derived value is calculated as the delta of a single value relative to the rolling mean of all previous values.

Parameters:
  • df (pd.DataFrame) – The DataFrame containing the column(s) on which a CUSUM is to be calculated.

  • cols (Union[list, str]) – The column (or list of columns) on which the CUSUM is to be calculated.

  • window (int, optional) –

    Size of the window on which the rolling mean is to be calculated. This corresponds to the pandas.df.rolling(window) parameter. Defaults to None.

    • If None is received, a %5 window is calculated based on the length of the DataFrame. This method helps smooth the trend, while keeping a representation of the original trend.

    • For a true CUSUM, a running average should be calculated on the length of the DataFrame, except for the current value. For this method, pass window=len(df).

  • min_periods (int, optional) – Number of periods to wait before calculating the rolling average. Defaults to 1.

  • inplace (bool, optional) – Update the passed DataFrame (in-place), rather returning a copy of the passed DataFrame. Defaults to False.

  • show_plot (bool, optional) – Display a graph of the raw value, and the calculated CUSUM results. Defaults to False.

Calculation:

The CUSUM is calculated by taking a rolling mean \(RA\) (optionally locked at the first value), and calculate the delta of the current value, relative to the rolling mean all previous values. A cumulative sum is applied to the deltas. The cumulative sum for each data point is returned as the CUSUM value.

Equation:

\(c_i = \sum_{i=1}^{n}(x_i - RA_i)\)

where \(RA\) (Rolling Mean) is defined as:

\(RA_{i+1} = \frac{1}{n}\sum_{j=1}^{n}x_j\)

Example Use:

Generate a sample trend dataset:

>>> import numpy as np
>>> import pandas as pd

>>> np.random.seed(13)
>>> s1 = pd.Series(np.random.randn(1000)).rolling(window=100).mean()
>>> np.random.seed(73)
>>> s2 = pd.Series(np.random.randn(1000)).rolling(window=100).mean()
>>> df = pd.DataFrame({'sample1': s1, 'sample2': s2})

Example for calculating a CUSUM on two columns:

>>> from EHM.stats import stats

>>> df_c = stats.cusum(df=df,
                       cols=['sample1', 'sample2'],
                       window=len(df),
                       inplace=False,
                       show_plot=True)
>>> df_c.tail()
      sample1   sample2  sample1_cusum  sample2_cusum
995  0.057574  0.065887      23.465337      29.279936
996  0.062781  0.072213      23.556592      29.369397
997  0.028513  0.072658      23.613478      29.459204
998  0.024518  0.070769      23.666305      29.547022
999  0.000346  0.074849      23.694901      29.638822
Returns:

If the inplace argument is False, a copy of the original DataFrame with the new CUSUM columns appended is returned. Otherwise, the passed DataFrame is updated, and None is returned.

Return type:

Union[pd.DataFrame, None]

kde(data: list | numpy.array | pandas.Series, n: int = 500) tuple[source]

Calculate the kernel density estimate (KDE) for an array X.

This function returns the probability density (PDF) using Gaussian KDE.

Parameters:
  • data (Union[list, np.array, pd.Series]) – An array-like object containing the data against which the Gaussian KDE is calculated. This can be a list, numpy array, or pandas Series.

  • n (int, optional) – Number of values returned in the X, Y arrays. Defaults to 500.

Example Use:

Tip

For a sample dataset and imports to go along with this example, refer to the docstring for this module.

Calculate a Gaussian KDE on Y:

>>> from utils4.stats import stats

>>> # Preview the histogram.
>>> _ = plt.hist(data)

>>> X, Y, max_x = stats.kde(data=data, n=500)
>>> plt.plot(X, Y)

>>> # Show X value at peak of curve.
>>> max_x
-9.718684033029376
Max X:

This function also returns the X value of the curve’s peak; where max_x is the X value corresponding to the max Y value on the curve. The result (max_x) is returned as the third tuple element.

Further Detail:

This method uses the scipy.stats.gaussian_kde() method for the KDE calculation. For further detail on the calculation itself, refer to that function’s docstring.

Background:

Originally, plotly.figure_factory.dist_plot() was used to calculate the KDE. However, to remove the plotly dependency from this library, their code was copied and refactored (simplified) into this function. Both the dist_plot() and pandas.DataFrame.plot.kde() method call scipy.stats.gaussian_kde() for the calculation, which this function also calls.

Returns:

A tuple containing the X-array, Y-array (both of n size), as well a the X value at max Y, as:

(curve_x, curve_y, max_x)

Return type:

tuple