stats - Statistics methods without the overhead
- Purpose:
Provide access to various statistical calculations, namely:
CUSUM:
cusum()
Gaussian KDE:
kde()
Linear Regression:
LinearRegression
- Platform:
Linux/Windows | Python 3.7+
- Developer:
J Berendt
- Email:
- Comments:
n/a
- Example:
Create a sample dataset for the stats methods:
>>> import matplotlib.pyplot as plt >>> import numpy as np >>> import pandas as pd >>> np.random.seed(73) >>> data = np.random.normal(size=100)*100 >>> x = np.arange(data.size) >>> y = pd.Series(data).rolling(window=25, min_periods=25).mean().cumsum() >>> # Preview the trend. >>> plt.plot(x, y)
- class stats.LinearRegression(x: numpy.array, y: numpy.array)[source]
Calculate the linear regression of a dataset.
- Parameters:
x (np.array) – Array of X-values.
y (np.array) – Array of Y-values.
- Slope Calculation:
The calculation for the slope itself is borrowed from the
scipy.stats.linregress()
function. Whose source code was obtained on GitHub.
- Example Use:
Tip
For a sample dataset and imports to go along with this example, refer to the docstring for
this module
.Calculate a linear regression line on an X/Y dataset:
>>> from lib.stats import LinearRegression >>> linreg = LinearRegression(x, y) >>> linreg.calculate() >>> # Obtain the regression line array. >>> y_ = linreg.regression_line >>> # View the intercept value. >>> linreg.intercept -31.26630 >>> # View the slope value. >>> linreg.slope 1.95332 >>> # Plot the trend and regression line. >>> plt.plot(x, y, 'grey') >>> plt.plot(x, y_, 'red') >>> plt.show()
- property slope
Accessor to the slope value.
- property intercept
Accessor to the slope’s y-intercept.
- property regression_line
Accessor to the calculated regression line, as y-values.
- calculate()[source]
Calculate the linear regression for the X/Y data arrays.
The result of the calculation is accessible via the
regression_line
property.
- class stats.Stats[source]
Wrapper class for various statistical calculations.
- static cusum(df: pandas.DataFrame, cols: list | str, *, window: int = None, min_periods: int = 1, inplace=False, show_plot: bool = False) pandas.DataFrame | None [source]
Calculate a CUSUM on a set of data.
A CUSUM is a generalised method for smoothing a noisy trend, or for detecting a change in the trend.
Note
A CUSUM is not a cumulative sum (cumsum), although a cumulative sum is used. A CUSUM is a cumulative sum of derived values, where each derived value is calculated as the delta of a single value relative to the rolling mean of all previous values.
- Parameters:
df (pd.DataFrame) – The DataFrame containing the column(s) on which a CUSUM is to be calculated.
cols (Union[list, str]) – The column (or list of columns) on which the CUSUM is to be calculated.
window (int, optional) –
Size of the window on which the rolling mean is to be calculated. This corresponds to the
pandas.df.rolling(window)
parameter. Defaults to None.If None is received, a %5 window is calculated based on the length of the DataFrame. This method helps smooth the trend, while keeping a representation of the original trend.
For a true CUSUM, a running average should be calculated on the length of the DataFrame, except for the current value. For this method, pass
window=len(df)
.
min_periods (int, optional) – Number of periods to wait before calculating the rolling average. Defaults to 1.
inplace (bool, optional) – Update the passed DataFrame (in-place), rather returning a copy of the passed DataFrame. Defaults to False.
show_plot (bool, optional) – Display a graph of the raw value, and the calculated CUSUM results. Defaults to False.
- Calculation:
The CUSUM is calculated by taking a rolling mean \(RA\) (optionally locked at the first value), and calculate the delta of the current value, relative to the rolling mean all previous values. A cumulative sum is applied to the deltas. The cumulative sum for each data point is returned as the CUSUM value.
- Equation:
\(c_i = \sum_{i=1}^{n}(x_i - RA_i)\)
where \(RA\) (Rolling Mean) is defined as:
\(RA_{i+1} = \frac{1}{n}\sum_{j=1}^{n}x_j\)
- Example Use:
Generate a sample trend dataset:
>>> import numpy as np >>> import pandas as pd >>> np.random.seed(13) >>> s1 = pd.Series(np.random.randn(1000)).rolling(window=100).mean() >>> np.random.seed(73) >>> s2 = pd.Series(np.random.randn(1000)).rolling(window=100).mean() >>> df = pd.DataFrame({'sample1': s1, 'sample2': s2})
Example for calculating a CUSUM on two columns:
>>> from EHM.stats import stats >>> df_c = stats.cusum(df=df, cols=['sample1', 'sample2'], window=len(df), inplace=False, show_plot=True) >>> df_c.tail() sample1 sample2 sample1_cusum sample2_cusum 995 0.057574 0.065887 23.465337 29.279936 996 0.062781 0.072213 23.556592 29.369397 997 0.028513 0.072658 23.613478 29.459204 998 0.024518 0.070769 23.666305 29.547022 999 0.000346 0.074849 23.694901 29.638822
- Returns:
If the
inplace
argument isFalse
, a copy of the original DataFrame with the new CUSUM columns appended is returned. Otherwise, the passed DataFrame is updated, andNone
is returned.- Return type:
Union[pd.DataFrame, None]
- kde(data: list | numpy.array | pandas.Series, n: int = 500) tuple [source]
Calculate the kernel density estimate (KDE) for an array X.
This function returns the probability density (PDF) using Gaussian KDE.
- Parameters:
data (Union[list, np.array, pd.Series]) – An array-like object containing the data against which the Gaussian KDE is calculated. This can be a list, numpy array, or pandas Series.
n (int, optional) – Number of values returned in the X, Y arrays. Defaults to 500.
- Example Use:
Tip
For a sample dataset and imports to go along with this example, refer to the docstring for
this module
.Calculate a Gaussian KDE on Y:
>>> from utils4.stats import stats >>> # Preview the histogram. >>> _ = plt.hist(data) >>> X, Y, max_x = stats.kde(data=data, n=500) >>> plt.plot(X, Y) >>> # Show X value at peak of curve. >>> max_x -9.718684033029376
- Max X:
This function also returns the X value of the curve’s peak; where
max_x
is theX
value corresponding to the maxY
value on the curve. The result (max_x
) is returned as the third tuple element.- Further Detail:
This method uses the
scipy.stats.gaussian_kde()
method for the KDE calculation. For further detail on the calculation itself, refer to that function’s docstring.- Background:
Originally,
plotly.figure_factory.dist_plot()
was used to calculate the KDE. However, to remove theplotly
dependency from this library, their code was copied and refactored (simplified) into this function. Both thedist_plot()
andpandas.DataFrame.plot.kde()
method callscipy.stats.gaussian_kde()
for the calculation, which this function also calls.- Returns:
A tuple containing the X-array, Y-array (both of
n
size), as well a the X value at max Y, as:(curve_x, curve_y, max_x)
- Return type:
tuple