get_outliers

giant.utilities.outlier_identifier:

giant.utilities.outlier_identifier.get_outliers(samples, sigma_cutoff=4)[source]

This function can be used to identify outliers in a 1 dimensional set of data.

It is based on the median absolute deviation algorithm:

\[\begin{split}\widetilde{\mathbf{x}}=\text{median}(\mathbf{x}) \\ mad = \text{median}(\left|\mathbf{x}-\widetilde{\mathbf{x}}\right|)\end{split}\]

where \(\widetilde{\mathbf{x}}\) is the median of the data set \(\mathbf{x}\) and \(mad\) is the median absolute deviation. Outliers are then identified by dividing the absolute deviation from the median by the median absolute deviation, multiplying by 1.4826 to represent a normal distribution, and then dividing by the median absolute deviation to compute the median absolute deviation “sigma”. This is then compared against a user specified sigma threshold and anything greater than or equal to this value is labeled as an outlier

\[\sigma_{mad} = 1.4826\frac{\left|\mathbf{x}-\widetilde{\mathbf{x}}\right|}{mad}\]

To use this function, simply enter a 1 dimensional data set and optionally the desired sigma threshold and you will get out a numpy boolean array which is True where the identified outliers are

>>> from giant.utilities.outlier_identifier import get_outliers
>>> import numpy as np
>>> data = np.random.randn(5)
>>> data[2] = data.max()*10000
>>> get_outliers(data, sigma_cutoff=10)
array([False, False,  True, False, False])

To subsequently get inliers, just use the NOT operator ~

>>> inliers = ~get_outliers(data, sigma_cutoff=10)
Parameters:
  • samples (Sequence | ndarray) – The 1 dimensional data set to identify outliers in

  • sigma_cutoff (Real) – The sigma threshold to use when labelling outliers

Returns:

A numpy boolean array with True where outliers are present in the data and False otherwise

Return type:

ndarray