anlearn.loda¶

anlearn.loda.Histogram¶

class anlearn.loda.Histogram(bins: Union[int, str] = 'auto', return_min: bool = True)[source]¶

Histogram model

Histogram model based on scipy.stats.rv_histogram.

Parameters

bins (Union[int, str], optional) –
- int - number of equal-width bins in the given range.
- str - method used to calculate bin width (numpy.histogram_bin_edges).
See numpy.histogram_bin_edges bins for more details, by default “auto”
return_min (bool, optional) – Return minimal float value instead of 0, by default True

hist¶

Value of histogram

Type: numpy.ndarray

bin_edges¶

Edges of histogram

Type: numpy.ndarray

pdf¶

Probability density function

Type: numpy.ndarray

fit(X: numpy.ndarray) → anlearn.loda.Histogram [source]¶

Fit estimator

Parameters: X (numpy.ndarray) – Input data, shape (n_samples,)
Returns: Fitted estimator
Return type: Histogram

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

predict_proba(X: numpy.ndarray) → numpy.ndarray [source]¶

Predict probability

Predict probability of input data X.

Parameters: X (numpy.ndarray) – Input data, shape (n_samples,)
Returns: Probability estimated from histogram, shape (n_samples,)
Return type: numpy.ndarray

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

anlearn.loda.LODA¶

class anlearn.loda.LODA(n_estimators: int = 1000, bins: Union[int, str] = 'auto', q: float = 0.05, random_state: Optional[int] = None, n_jobs: Optional[int] = None, verbose: int = 0)[source]¶

LODA: Lightweight on-line detector of anomalies 1

LODA is an ensemble of histograms on random projections. See Pevný, T. Loda 1 for more details.

Parameters

n_estimators (int, optional) – number of histograms, by default 1000
bins (Union[int, str], optional) –
- int - number of equal-width bins in the given range.
- str - method used to calculate bin width (numpy.histogram_bin_edges).
See numpy.histogram_bin_edges bins for more details, by default “auto”
q (float, optional) – Quantile for compution threshold from training data scores. This threshold is used for predict method, by default 0.05
random_state (Optional[int], optional) – Random seed used for stochastic parts., by default None
n_jobs (Optional[int], optional) – Not implemented yet, by default None
verbose (int, optional) – Verbosity of logging, by default 0

projections_¶

Random projections, shape (n_estimators, n_features)

Type: numpy.ndarray

hists_¶

Histograms on random projections, shape (n_estimators,)

Type: List[Histogram]

anomaly_threshold_¶

Treshold for predict() function

Type: float

Examples

>>> import numpy as np
>>> from anlearn.loda import LODA
>>> X = np.array([[0, 0], [0.1, -0.2], [0.3, 0.2], [0.2, 0.2], [-5, -5], [0.6, 0.7]])
>>> loda = LODA(n_estimators=10, bins=10, random_state=42)
>>> loda.fit(X)
LODA(bins=10, n_estimators=10, random_state=42)
>>> loda.predict(X)
array([ 1,  1,  1,  1, -1,  1])

References

1(1,2,3,4): Pevný, T. Loda: Lightweight on-line detector of anomalies. Mach Learn 102, 275–304 (2016). <https://doi.org/10.1007/s10994-015-5521-0>

fit(X: anlearn._typing.ArrayLike, y: Optional[anlearn._typing.ArrayLike] = None) → anlearn.loda.LODA [source]¶

Fit estimator

Parameters

X (ArrayLike) – Input data, shape (n_samples, n_features)
y (Optional[ArrayLike], optional) – Present for API consistency by convention, by default None

Returns

Fitted estimator

Return type

LODA

fit_predict(X, y=None)¶

Perform fit on X and returns labels for X.

Returns -1 for outliers and 1 for inliers.

Parameters

X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (Ignored) – Not used, present for API consistency by convention.

Returns

y – 1 for inliers, -1 for outliers.

Return type

ndarray of shape (n_samples,)

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

predict(X: anlearn._typing.ArrayLike) → numpy.ndarray [source]¶

Predict if samples are outliers or not

Samples with a score lower than anomaly_threshold_ are considered to be outliers.

Parameters: X (ArrayLike) – Input data, shape (n_samples, n_features)
Returns: 1 for inlineres, -1 for outliers, shape (n_samples,)
Return type: numpy.ndarray

score_features(X: anlearn._typing.ArrayLike) → numpy.ndarray [source]¶

Feature importance

Feature importance is computed as a one-tailed two-sample t-test between \(-log(\hat{p}_i)\) from histograms on projections with and without a specific feature. The higher the value is, the more important feature is.

See full description in 3.3 Explaining the cause of an anomaly 1 for more details.

Parameters: X (ArrayLike) – input data, shape (n_samples, n_features)
Returns: Feature importance in anomaly detection.
Return type: numpy.ndarray

Notes

\[t_j = \frac{\mu_j - \bar{\mu}_j}{ \sqrt{\frac{s^2_j}{|I_j|} + \frac{\bar{s}^2_j}{|\bar{I_j}|}}}\]

score_samples(X: anlearn._typing.ArrayLike) → numpy.ndarray [source]¶

Anomaly scores for samples

Average of the logarithm probabilities estimated of individual projections. Output is proportional to the negative log-likelihood of the sample, that means the less likely a sample is, the higher the anomaly value it receives 1. This score is reversed for scikit-learn compatibility.

Parameters: X (ArrayLike) – Input data, shape (n_samples, n_features)
Returns: The anomaly score of the input samples. The lower, the more abnormal. Shape (n_samples,)
Return type: numpy.ndarray

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

LODA: projections & histograms¶

Comparison of scikit-learn anomaly detection methods and LODA¶

LODA: large data - Credit Card Fraud Detection dataset¶

LODA: Explaining the cause of an anomaly on Zoo dataset¶