{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# LODA: Explaining the cause of an anomaly on Zoo dataset\n\n\nThe knowledge that an example is anomalous just the first part of the whole anomaly detection pipeline.\nWithout investigating further, I would consider this information almost useless. Lucky for us, LODA has a built-in way to\nget a little bit more information about why a particular example is viewed as an anomaly. With the smart usage of sparse projections,\nwe could compute a one-tailed two-sample t-test between probabilities from histograms on projections with and without aspecific features.\nCasually speaking, if histograms using a particular feature have statistically higher anomaly scores than ones without it, we should have a closer look at it. Also, it has a higher time complexity than scoring samples because we need to evaluate every feature separately.\n\n\nOf course, we should not consider this to be the ground truth for explaining the cause of an anomaly.\nThat is a complicated process requiring more analysis with in-depth knowledge of data.\nLODA gives us only a good starting point to lead our investigation.\nIf you want to see a full mathematical explanation read section **3.3 Explaining the cause of an anomaly** [1]_ in the original article.\n\nTo show this feature of LODA, we created a simple example using the Zoo dataset from the UCI Machine Learning Repository [3]_.\nIt contains different animal species and a summary of their characteristics (hair, feathers, eggs, milk, airborne, aquatic, etc.).\nWe have chosen it because it's small, simple, and features are easily understandable (cat has for legs :) ...)\nFirst of all, we transform this dataset using UMAP (:obj:`umap.UMAP`) [2]_ to show in two dimensions.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Author: Ondrej Kur\u00e1k kurak@gaussalgo.com\n# License: LGPLv3+\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom umap import UMAP\n\nfrom anlearn.loda import LODA\n\nframe = pd.read_csv(\n    \"https://raw.githubusercontent.com/sharmaroshan/Zoo-Dataset/master/zoo.csv\",\n)\n\nframe.set_index(\"animal_name\", inplace=True)\n\nprint(frame)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# !cat ../datasets/zoo.names\n\n# 1. Title: Zoo database\n\n# 2. Source Information\n#    -- Creator: Richard Forsyth\n#    -- Donor: Richard S. Forsyth\n#              8 Grosvenor Avenue\n#              Mapperley Park\n#              Nottingham NG3 5DX\n#              0602-621676\n#    -- Date: 5/15/1990\n\n# 3. Past Usage:\n#    -- None known other than what is shown in Forsyth's PC/BEAGLE User's Guide.\n\n# 4. Relevant Information:\n#    -- A simple database containing 17 Boolean-valued attributes.  The \"type\"\n#       attribute appears to be the class attribute.  Here is a breakdown of\n#       which animals are in which type: (I find it unusual that there are\n#       2 instances of \"frog\" and one of \"girl\"!)\n\n#       Class# Set of animals:\n#       ====== ===============================================================\n#            1 (41) aardvark, antelope, bear, boar, buffalo, calf,\n#                   cavy, cheetah, deer, dolphin, elephant,\n#                   fruitbat, giraffe, girl, goat, gorilla, hamster,\n#                   hare, leopard, lion, lynx, mink, mole, mongoose,\n#                   opossum, oryx, platypus, polecat, pony,\n#                   porpoise, puma, pussycat, raccoon, reindeer,\n#                   seal, sealion, squirrel, vampire, vole, wallaby,wolf\n#            2 (20) chicken, crow, dove, duck, flamingo, gull, hawk,\n#                   kiwi, lark, ostrich, parakeet, penguin, pheasant,\n#                   rhea, skimmer, skua, sparrow, swan, vulture, wren\n#            3 (5)  pitviper, seasnake, slowworm, tortoise, tuatara\n#            4 (13) bass, carp, catfish, chub, dogfish, haddock,\n#                   herring, pike, piranha, seahorse, sole, stingray, tuna\n#            5 (4)  frog, frog, newt, toad\n#            6 (8)  flea, gnat, honeybee, housefly, ladybird, moth, termite, wasp\n#            7 (10) clam, crab, crayfish, lobster, octopus,\n#                   scorpion, seawasp, slug, starfish, worm\n\n# 5. Number of Instances: 101\n\n# 6. Number of Attributes: 18 (animal_name, 15 Boolean attributes, 2 numerics)\n\n# 7. Attribute Information: (name of attribute and type of value domain)\n#    1. animal_name:      Unique for each instance\n#    2. hair\t\tBoolean\n#    3. feathers\t\tBoolean\n#    4. eggs\t\tBoolean\n#    5. milk\t\tBoolean\n#    6. airborne\t\tBoolean\n#    7. aquatic\t\tBoolean\n#    8. predator\t\tBoolean\n#    9. toothed\t\tBoolean\n#   10. backbone\t\tBoolean\n#   11. breathes\t\tBoolean\n#   12. venomous\t\tBoolean\n#   13. fins\t\tBoolean\n#   14. legs\t\tNumeric (set of values: {0,2,4,5,6,8})\n#   15. tail\t\tBoolean\n#   16. domestic\t\tBoolean\n#   17. catsize\t\tBoolean\n#   18. class_type\t\tNumeric (integer values in range [1,7])\n\n# 8. Missing Attribute Values: None\n\n# 9. Class Distribution: Given above"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Data visualization\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X = frame.values[:, :-1]\n\n# Prepare data for visualization using UMAP\numap = UMAP(n_neighbors=15, min_dist=0.9, random_state=42)\ntransformed = umap.fit_transform(X)\n\nplt.figure(figsize=(10, 10))\nplt.subplot(111, aspect=\"auto\")\nplt.subplots_adjust(\n    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01\n)\n\n\nfor type in np.unique(frame[\"class_type\"]):\n    selected = transformed[frame[\"class_type\"] == type]\n    plt.scatter(selected[:, 0], selected[:, 1], label=type)\n\nfor name, x, y in zip(frame.index, transformed[:, 0], transformed[:, 1]):\n    plt.annotate(name, (x, y), alpha=0.8, fontsize=10)\n\nplt.title(\"Zoo dataset - animal types\", fontsize=18)\nplt.xticks(())\nplt.yticks(())\nplt.legend(title=\"Animal type\", title_fontsize=15, fontsize=13)\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Explaining the cause of an anomaly\n\nOnce we get anomaly scores and importance of each feature, we could investigate further.\nWe'll choose the five most anomalous animals. For example, we'll take a closer look at honeybee.\nIt has a quite high score, and the most significant features are venomous (1.91), hair (1.55), breathes (1.28), and domestic (0.97).\nIf we consider the composition of our dataset, there are no other venomous animals that are domestic, so it does seem right.\nWe could find explanations like this for every other animal in the top five. Octopus has eight legs; sea wasp does have almost\nnone of the features in the dataset, etc. So could we tell that these are the real reasons why these animals are unusual? Yes and no.\nYes, this is why LODA sees them as anomalies considering our data, but without a review from a domain expert,\nwe must be careful about such a statement.\nAlso, consider the fact that this dataset is small, oversimplified, with just a limited number of features.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "loda = LODA(n_estimators=100, bins=100, random_state=42)\nloda.fit(X)\n\nscores = loda.score_samples(X)\npredicted = loda.predict(X)\n\n\nplt.figure(figsize=(10, 10))\nplt.subplot(111, aspect=\"auto\")\nplt.subplots_adjust(\n    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01\n)\n\nX_n = transformed[predicted == 1]\nX_a = transformed[predicted == -1]\n\nplt.scatter(X_n[:, 0], X_n[:, 1], color=\"tab:orange\", label=\"Inliners\")\nplt.scatter(X_a[:, 0], X_a[:, 1], color=\"tab:blue\", label=\"Outliers\")\n\nfor name, x, y in zip(frame.index[predicted == 1], X_n[:, 0], X_n[:, 1]):\n    plt.annotate(name, (x, y), alpha=0.5, fontsize=12)\n\nfor name, x, y in zip(frame.index[predicted == -1], X_a[:, 0], X_a[:, 1]):\n    plt.annotate(name, (x, y), fontsize=15, ha=\"right\")\n\nplt.title(\"Zoo dataset - anomalous examples\", fontsize=18)\nplt.legend(title=\"Predicted\", title_fontsize=15, fontsize=13)\nplt.xticks(())\nplt.yticks(())\nplt.show()\n\nfeature_scores = loda.score_features(X)\n\nfor animal, score, feature_score in zip(\n    frame[predicted == -1].itertuples(),\n    scores[predicted == -1],\n    feature_scores[predicted == -1],\n):\n    name = animal[0]\n    srt = np.argsort(feature_score)[::-1]\n\n    print(f\"{name} score: {score:.3f}\")\n\n    for feature, value, importance in zip(\n        frame.columns[srt][:4], np.array(animal[1:])[srt], feature_score[srt]\n    ):\n        print(f\"\\t{feature} {value} ({importance:.2f})\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Summary\nTo sum it up. LODA has a really powerful tool to explain the cause of an anomaly.\nIt is more resource consuming than scoring samples. We should take a closer look at anomalies if we want to tell the real reason.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## References\n.. [1] Pevn\u00fd, T. Loda: Lightweight on-line detector of anomalies. Mach Learn 102, 275\u2013304 (2016).\n        <https://doi.org/10.1007/s10994-015-5521-0>\n.. [2] McInnes, L., Healy, J., Saul, N., & Grossberger, L. (2018). UMAP: Uniform Manifold Approximation and Projection\n       The Journal of Open Source Software, 3(29), 861. <https://github.com/lmcinnes/umap/>\n.. [3] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository\n       [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.\n       <https://archive.ics.uci.edu/ml/datasets/Zoo>\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.6"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}