anlearn.stats

anlearn.stats.IQR

class anlearn.stats.IQR(k: float = 1.5, lower_quantile: float = 0.25, upper_quantile: float = 0.75, ensure_2d: bool = True)[source]

Interquartile range

Outlier deteciton method using Tukey’s fences. If lower quantile is 0.25 (\(Q_1\) lower quartile) and upper quantile is 0.75 (\(Q_3\) upper quartile), then outlier is any observation outside the range:

\[[Q_1 - k(Q_3 - Q_1); Q_3 + k(Q_3 - Q_1)]\]

John Tukey proposed \(k=1.5\) is an outlier, and \(k=3\) is far out.

Parameters
  • k (float, optional) – Outlier threshold, by default 1.5

  • lower_quantile (float, optional) – Lower quantile, from (0; 1), by default 0.25

  • upper_quantile (float, optional) – Upper quantile, from (0; 1), by default 0.75

  • ensure_2d (bool, optional) – Frobid input 1D arrays, by default True

lqv_

Lower quantile value estimated from the input data

Type

float

uqv_

Upper quantile value estimated from the input data

Type

float

iqr_

Interquartile range, uqv_ - lqv_

Type

float

Example

>>> import numpy as np
>>> from anlearn.stats import IQR
>>> X = np.hstack([[-7,-4], np.arange(5), [10, 15]])
>>> iqr = IQR(ensure_2d=False)
>>> iqr.fit(X)
IQR(ensure_2d=False)
>>> iqr.predict(X)
array([-1,  1,  1,  1,  1,  1,  1,  1, -1])
>>> iqr.score_samples(X)
array([-1.75, -1.  , -0.  , -0.  , -0.  , -0.  , -0.  , -1.5 , -2.75])
Raises

ValueError – Lower quantile must be lower than upper quantile.

fit(X: anlearn._typing.ArrayLike, y: Optional[anlearn._typing.ArrayLike] = None)anlearn.stats.IQR[source]

Fit estimator

Parameters
  • X (ArrayLike) – Input data of shape (n_samples, 1) or (n_samples,) if ensure_2d is False

  • y (Optional[ArrayLike], optional) – Ignored, present for API consistency by convention, by default None

Returns

Fitted estimator

Return type

IQR

fit_predict(X, y=None)

Perform fit on X and returns labels for X.

Returns -1 for outliers and 1 for inliers.

Parameters
  • X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –

  • y (Ignored) – Not used, present for API consistency by convention.

Returns

y – 1 for inliers, -1 for outliers.

Return type

ndarray of shape (n_samples,)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(X: anlearn._typing.ArrayLike)numpy.ndarray[source]

Predict if samples are outliers or not

Samples with a score lower than k are considered to be outliers.

Parameters

X (ArrayLike) – Input data, shape (n_samples, n_features)

Returns

Shape (n_samples,) 1 for inlineres, -1 for outliers

Return type

numpy.ndarray

score_samples(X: anlearn._typing.ArrayLike)numpy.ndarray[source]

Score samples

Score is comuputed as distance from interval \([Q_{lower}; Q_{upper}]\) divided by interquartile range. \(score = distance(data, (lqv, uqv)) / iqr\). Score is inverted for scikit-learn compatibility

Parameters

X (ArrayLike) – Input data of shape (n_samples, 1) or (n_samples,) if ensure_2d is False

Returns

Shape (n_samples,). The outlier score of the input samples. The lower, the more abnormal.

Return type

numpy.ndarray

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance