Exploratory Data Analysis: Survival IDS Dataset

eda

python

machine-learning

EDA and baseline ML models on the HCRL Survival IDS dataset.

Published

February 27, 2026

EDA Notebook On Flooding Attacks in CAN networks (Assignment 2)

Importing necessary libraries and loading the dataset

This cell imports the pandas library and loads the dataset from a file located at ../data/dataset/Spark/Flooding_dataset_Spark.txt. It also displays the shape, column names, and the first few rows of the dataset.

Code

import pandas as pd

df = pd.read_csv("../data/dataset/Spark/Flooding_dataset_Spark.txt")

print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()

Displaying dataset information

This cell provides an overview of the dataset, including the data types of each column and the number of non-null values.

Code

df.info()

Generating descriptive statistics

This cell calculates and displays summary statistics for the numerical columns in the dataset, such as mean, standard deviation, minimum, and maximum values.

Code

df.describe()

Checking for missing values

This cell calculates the number of missing values in each column of the dataset.

Code

df.isnull().sum()

Some Basic Data Cleaning

This cell makes a new flag column where the flag for each packet exists in the same row and gets rid of the scattered flag encodings. Also, null data is replaced with a more representative string.

Code

df["flag"] = df.apply(lambda row: row[row.last_valid_index()] if row.last_valid_index() is not None else "No data", axis=1)
df["flag"] = df["flag"].apply(lambda x: 1 if x == "T" else 0)
cols = [c for c in df.columns if c != "flag"]
df[cols] = df[cols].apply(
    lambda row: row.mask(row.index == row.last_valid_index(), "No data"),
    axis=1
)

df.drop(columns=["R"], inplace=True, errors="ignore")
df.drop(columns=["04C1"], inplace=True, errors="ignore")

df.fillna("No data", inplace=True)

df.head()

More Data Cleaning

this cell makes the data column names more representative

Code

df.rename(columns={
    "1513920093.615172": "Timestamp",
    "8": "DLC",
    "00": "Data[0]",
    "CC": "Data[1]",
    "80": "Data[2]",
    "5E": "Data[3]",
    "52": "Data[4]",
    "08": "Data[5]",
    "00.1": "Data[6]",
    "00.2": "Data[7]"
}, inplace=True)
df.head()

Visualizing the distribution of Column 8

This cell uses matplotlib and seaborn to create a histogram with a kernel density estimate (KDE) for the values in column DLC. The plot is saved as distribution.png.

Code

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(df["DLC"], bins=50, kde=True, ax=ax)
ax.set_title("Distribution of DLC")
ax.set_xlabel("Value")
ax.set_ylabel("Count")
fig.savefig("../figures/distribution.png", dpi=150, bbox_inches="tight")
plt.show()

Generating a correlation heatmap

This cell calculates the correlation matrix for numeric columns in the dataset and visualizes it using a heatmap. The plot is saved as correlation_heatmap.png.

Code

fig, ax = plt.subplots(figsize=(10, 8))
numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns
sns.heatmap(df[numeric_cols].corr(), annot=True, fmt=".2f", cmap="coolwarm", ax=ax)
ax.set_title("Feature Correlation Heatmap")
fig.savefig("../figures/correlation_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()

Creating a new categorical column and visualizing its distribution

This cell visualizes the distribution of the new category using a count plot, which is saved as class_distribution.png.

Code

fig, ax = plt.subplots(figsize=(8, 6))

sns.countplot(data=df, x="flag", ax=ax)

ax.set_xticklabels(["Normal (R)", "Flooding (T)"])
ax.set_title("Message Count: Normal vs Flooding")
ax.set_xlabel("CAN ID Category")
ax.set_ylabel("Number of Messages")

fig.savefig("../figures/class_distribution.png", dpi=150, bbox_inches="tight")
plt.show()

Visualizing data size distribution for different categories

This cell creates a violin plot to visualize the distribution of data sizes (DLC) for the two categories in the flag column. The plot is saved as violin_plot.png.

Code

fig, ax = plt.subplots(figsize=(8, 6))

sns.violinplot(data=df, x="flag", y="DLC", ax=ax)

ax.set_xticks([0, 1])
ax.set_xticklabels(["Normal", "Flooding (0x000)"])
ax.set_title("Data Size Distribution: Flooding vs Normal Messages")
ax.set_xlabel("CAN ID Category")
ax.set_ylabel("Data Size")

fig.savefig("../figures/violin_plot.png", dpi=150, bbox_inches="tight")
plt.show()

Extracting A More Specific Time Feature

This collects timestamps within 100 ms and dislays the message count over these different time frames. This shows the pattern in message flow over time

Code

# Extract message over time graph
df["time_bucket"] = df["Timestamp"].apply(lambda x: int(x * 10) / 10)

message_counts = df.groupby("time_bucket").size().reset_index(name="message_count")

fig, ax = plt.subplots(figsize=(10, 8))

sns.lineplot(data=message_counts, x="time_bucket", y="message_count", ax=ax)

ax.set_title("Message Count Over Time (100ms Windows)")
ax.set_xlabel("Time Window (100ms)")
ax.set_ylabel("Number of Messages")

fig.savefig("../figures/message_over_time.png", dpi=150, bbox_inches="tight")
plt.show()

Looking At High Traffic

This specifically encodes each frame as a high traffic time frame or not. This helps with looking at a pattern between high message counts and injected messages. Specifcally, if messages in high traffic time frames are more likely to be flooding messgaes or not.

Code

high_traffic = df.groupby("time_bucket").size().reset_index(name="message_count")
high_traffic["high_traffic"] = (high_traffic["message_count"] > 300).astype(int)
print(high_traffic.head())

df = df.merge(high_traffic[["time_bucket", "high_traffic"]], on="time_bucket", how="left")
df.head()

Extracting new features from string data

This cell looks at the uniqueness of the byte data in the data columns. Used to look for a pattern in variance of data between flooding and non-flooding messages

Code

data_cols = ["Data[0]", "Data[1]", "Data[2]", "Data[3]", "Data[4]", "Data[5]", 
             "Data[6]", "Data[7]"]

df["unique_count"] = df[data_cols].apply(lambda row: row[row != "No data"].nunique(), axis=1)

Creating a parallel coordinates plot

This cell creates a parallel coordinates plot to visualize the relationships between the features DLC, high_traffic, unique_count, and flag. The plot is saved as parallel_coordinates.png.

Code

import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates

plot_df = df[["unique_count", "DLC", "high_traffic", "flag"]].copy()
plot_df = plot_df.sample(1000, random_state=42)
plot_df["flag"] = plot_df["flag"].map({0: "Normal", 1: "Flooding"})

fig, ax = plt.subplots(figsize=(10, 6))

parallel_coordinates(plot_df, "flag",
                     color=["steelblue", "crimson"],
                     alpha=0.2,
                     ax=ax)

ax.set_title("Parallel Coordinates: Flooding vs Normal Messages")

fig.savefig("../figures/parallel_coordinates.png", dpi=150, bbox_inches="tight")
plt.show()

Relooking at correlation

This cell remakes a new correlation heatmap matrix with the new extracted features to determine their relavence in training

Code

#Relooking at correlation
corr_df = df[["DLC", "unique_count","high_traffic", "flag"]].copy()
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr_df.corr(), annot=True, fmt=".2f", cmap="coolwarm", ax=ax)
ax.set_title("Feature Correlation Heatmap")
fig.savefig("../figures/correlation_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()

Splitting the dataset into training and testing sets

This cell splits the dataset into training and testing sets using train_test_split from sklearn. The target variable is flag.

Code

from sklearn.model_selection import train_test_split

X = df.drop(columns=["flag"])
y = df["flag"]

X = X.select_dtypes(include="number")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

Some Basic Preprocessing

This is used to scale down the data specifically for logistic regression models to perform effectively

Code

from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

# Fill NaNs
imputer = SimpleImputer(strategy="median")
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Scale
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Training machine learning models

This cell trains two machine learning models: Logistic Regression and Random Forest Classifier, using the training data.

Code

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

Evaluating model performance

This cell evaluates the performance of the trained models (Logistic Regression and Random Forest) using the test dataset. It generates classification reports and confusion matrices for each model, and saves the confusion matrix plots as images

Code

from sklearn.metrics import classification_report, confusion_matrix

for name, model in [("Logistic Regression", lr), ("Random Forest", rf)]:
    y_pred = model.predict(X_test)
    print(f"\n{'='*40}")
    print(f"{name}")
    print(f"{'='*40}")
    print(classification_report(y_test, y_pred))
    
    fig, ax = plt.subplots(figsize=(6, 5))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax)
    ax.set_title(f"Confusion Matrix - {name}")
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Actual")
    fig.savefig(f"../figures/confusion_matrix_{name.lower().replace(' ', '_')}.png", 
                dpi=150, bbox_inches="tight")
    plt.show()

Check Model Sanity

This verifies that the model is learning meaningful patterns from the data

Code

# Sanity check
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
print(f"Dummy classifier score: {dummy.score(X_test, y_test):.2f}")