Kddcup 99 dataset

The KDD Cup “99 dataset was created by processing the tcpdump portions of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, created by MIT Lincoln Lab [2]. The artificial data (described on the dataset’s homepage) was generated using a closed network and hand-injected attacks to produce a large number of different types of attack with normal activity in the background. As the initial goal was to produce a large training set for supervised learning algorithms, there is a large proportion (80.1%) of abnormal data which is unrealistic in real world, and inappropriate for unsupervised anomaly detection which aims at detecting “abnormal” data, i.e.:

  • qualitatively different from normal data

  • in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

  • SA is obtained by simply selecting all the normal data, and a small proportion of abnormal data to gives an anomaly proportion of 1%.

  • SF is obtained as in [3] by simply picking up the data whose attribute logged_in is positive, thus focusing on the intrusion attack, which gives a proportion of 0.3% of attack.

  • http and smtp are two subsets of SF corresponding with third feature equal to “http” (resp. to “smtp”).

General KDD structure:

Samples total

4898431

Dimensionality

41

Features

discrete (int) or continuous (float)

Targets

str, “normal.” or name of the anomaly type

SA structure:

Samples total

976158

Dimensionality

41

Features

discrete (int) or continuous (float)

Targets

str, “normal.” or name of the anomaly type

SF structure:

Samples total

699691

Dimensionality

4

Features

discrete (int) or continuous (float)

Targets

str, “normal.” or name of the anomaly type

http structure:

Samples total

619052

Dimensionality

3

Features

discrete (int) or continuous (float)

Targets

str, “normal.” or name of the anomaly type

smtp structure:

Samples total

95373

Dimensionality

3

Features

discrete (int) or continuous (float)

Targets

str, “normal.” or name of the anomaly type

sklearn.datasets.fetch_kddcup99() will load the kddcup99 dataset; it returns a dictionary-like object with the feature matrix in the data member and the target values in target. The « as_frame » optional argument converts data into a pandas DataFrame and target into a pandas Series. The dataset will be downloaded from the web if necessary.