{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# LODA: large data - Credit Card Fraud Detection dataset\n\nIn previous sections, we have seen that LODA [1]_ is fully capable of getting similar results\nto more complex anomaly detection methods. Now we could take full advantage of LODA's\nlow time and space complexity and use it on some more massive datasets.\n\nWe'll use Credit Card Fraud Detection dataset from the Machine Learning Group of\nUniversit\u00e9 Libre de Bruxelles [4]_ (it's available on Kaggle [5]_).\nThis dataset consists of credit card transactions with 492 frauds out of 284,807 transactions.\nFeatures are a byproduct of PCA transformation without any additional information\ndue to confidentiality issues.\n\nFirst of all, we'll visualize the entire dataset in low dimensional space to get an overview. We'll transform data using UMAP [2]_ and then plot results.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Author: Ondrej Kur\u00e1k kurak@gaussalgo.com\n# License: LGPLv3+\nimport time\n\nimport datashader as ds\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom colorcet import fire\nfrom datashader import transfer_functions as tf\nfrom sklearn.ensemble import IsolationForest\nfrom sklearn.metrics import auc, roc_curve\nfrom umap import UMAP\n\nfrom anlearn.loda import LODA\n\nframe = pd.read_csv(\"../datasets/creditcard.csv\")\n\nX = np.arcsinh(frame.values[:, 1:-1])\ny = frame[\"Class\"].values\n\n\numap = UMAP(random_state=42)\n\n# This could take ~30 min on Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz\n# transformed = umap.fit_transform(X)\n# with open(\"../datasets/transformed.npy\", \"wb\") as out:\n#     np.save(out, transformed)\ntransformed = np.load(\"../datasets/transformed.npy\")\n\n\nplt.figure(figsize=(12, 8))\nplt.subplot(111, aspect=\"auto\")\nplt.subplots_adjust(\n    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01\n)\n\nfor index, label in enumerate((\"Normal transaction\", \"Fraud transaction\")):\n    plt.scatter(\n        transformed[:, 0][y == index],\n        transformed[:, 1][y == index],\n        s=5,\n        label=label,\n        alpha=0.5,\n    )\n\nplt.legend(fontsize=13)\nplt.xticks(())\nplt.yticks(())\n\nplt.title(\"Transformation by UMAP\", fontsize=15)\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. figure:: /img/loda/loda_fraud_umap.png\n   :alt: LODA: large dataset transformed by UMAP\n\nAt first sight at this visualization, we could see some apparent clusters.\nSome of them even including a lot of fraud transactions. But this could be misleading\ndue to significant overplotting. We'll try to solve this issue by\nusing a more meaningful projection created by Datashader [6]_.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "shader_data = pd.DataFrame(\n    transformed,\n    columns=[\"x\", \"y\"],\n)\n\nagg = ds.Canvas(plot_width=1000, plot_height=800).points(shader_data, \"x\", \"y\")\n\nimg = tf.shade(agg, name=\"Transformation by UMAP + Datashader\")\n\nplt.figure(figsize=(15, 15))\nplt.subplot(111, aspect=\"auto\")\nplt.subplots_adjust(top=0.96, wspace=0.05, hspace=0.01)\nplt.imshow(img.to_pil())\nplt.title(\"Transformation by UMAP + Datashader\", fontsize=15)\nplt.xticks(())\nplt.yticks(())\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. figure:: /img/loda/loda_fraud_datashader.png\n    :alt: LODA: large dataset transformed by UMAP and Datashader\n\nOnce we have some clues about how the dataset looks, let's try to detect some fraud\ntransactions. Because of its size, we'll use only LODA and isolation forest as anomaly\ndetection methods. For comparing them, we'll use the area under the ROC curve.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "times = {}\n\nloda = LODA(n_estimators=100, random_state=42, bins=100)\n\nstart_time = time.monotonic()\nloda.fit(X)\ntimes[\"loda.fit\"] = time.monotonic() - start_time\n\nstart_time = time.monotonic()\nloda_scores = loda.score_samples(X)\ntimes[\"loda.score_samples\"] = time.monotonic() - start_time\n\n\nstart_time = time.monotonic()\nfeature_scores = loda.score_features(X)\ntimes[\"loda.score_features\"] = time.monotonic() - start_time\n\nisoforest = IsolationForest(n_estimators=100, random_state=42)\n\nstart_time = time.monotonic()\nisoforest.fit(X)\ntimes[\"isoforest.fit\"] = time.monotonic() - start_time\n\nstart_time = time.monotonic()\niso_scores = isoforest.score_samples(X)\ntimes[\"isoforest.score_samples\"] = time.monotonic() - start_time\n\nloda_fpr, loda_tpr, _ = roc_curve(y, -loda_scores)\nloda_auc = auc(loda_fpr, loda_tpr)\n\nisof_fpr, isof_tpr, _ = roc_curve(y, -iso_scores)\nisof_auc = auc(isof_fpr, isof_tpr)\n\n\nplt.figure(figsize=(12, 8))\nplt.subplot(111, aspect=\"auto\")\nplt.subplots_adjust(\n    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01\n)\n\nplt.plot(\n    loda_fpr,\n    loda_tpr,\n    label=f\"\"\"LODA\n auc: {loda_auc:.3f}\n fit time: {times[\"loda.fit\"]:.2f}s\n score time: {times[\"loda.score_samples\"]:.2f}s,\n fscore time: {times[\"loda.score_features\"]:.2f}s\"\"\",\n)\nplt.plot(\n    isof_fpr,\n    isof_tpr,\n    label=f\"\"\"Isolation Forest\n auc: {isof_auc:.3f}\n fit time: {times[\"isoforest.fit\"]:.2f}s\n score time: {times[\"isoforest.score_samples\"]:.2f}s\"\"\",\n)\n\nplt.plot([0, 1], [0, 1], color=\"navy\", linestyle=\"--\")\n\nplt.title(\"Credit cards ROC curve\", fontsize=15)\nplt.legend(\n    title=\"Algorithm results\", title_fontsize=15, fontsize=13, loc=\"center right\"\n)\nplt.xlabel(\"False positive rate\", fontsize=13)\nplt.ylabel(\"True positive rate\", fontsize=13)\n\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. figure:: /img/loda/loda_fraud_roc.png\n    :alt: LODA: large dataset ROC curve\n\nAs we can see, both methods performed very well (with the LODA slightly better).\nThe low time complexity kicks in once we look at the training/predicting time for both\ndetectors. It took LODA only 1/4 of the isolation forest's time to score 284,807 samples.\nIt does not seem like such a big difference, but once we go up to millions of\ntransactions, it could be a game-changer.\n\nTo finalize this section, let's make another plot using Datashader and anomaly scores from LODA.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "shader_data = pd.DataFrame(\n    np.hstack([transformed, loda_scores[:, np.newaxis]]),\n    columns=[\"x\", \"y\", \"anomaly_score\"],\n)\n\nagg = ds.Canvas(plot_width=1000, plot_height=800).points(\n    shader_data, \"x\", \"y\", ds.mean(\"anomaly_score\")\n)\n\nimg = tf.shade(\n    agg, cmap=fire, name=\"Transformation by UMAP + Datashader (average anomaly score)\"\n)\n\nimg = tf.set_background(img, \"black\")\n\nplt.figure(figsize=(15, 15))\nplt.subplot(111, aspect=\"auto\")\nplt.subplots_adjust(\n    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01\n)\nplt.imshow(img.to_pil())\nplt.title(\"Transformation by UMAP + Datashader (average anomaly score)\", fontsize=15)\nplt.xticks(())\nplt.yticks(())\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. figure:: /img/loda/loda_fraud_datashader_anomaly.png\n    :alt: LODA: large dataset transformed by UMAP and Datashader\n\n## References\n.. [1] Pevn\u00fd, T. Loda: Lightweight on-line detector of anomalies. Mach Learn 102, 275\u2013304 (2016).\n        <https://doi.org/10.1007/s10994-015-5521-0>\n.. [2] McInnes, L., Healy, J., Saul, N., & Grossberger, L. (2018). UMAP: Uniform Manifold Approximation and Projection\n       The Journal of Open Source Software, 3(29), 861. <https://github.com/lmcinnes/umap/>\n.. [4] Machine Learning Group of Universit\u00e9 Libre de Bruxelles <http://mlg.ulb.ac.be>\n.. [5] Kaggle: Credit Card Fraud Detection <https://www.kaggle.com/mlg-ulb/creditcardfraud>\n.. [6] HoloViz Datashader <https://datashader.org/>\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.6"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}