Code
import pandas as pd
df = pd.read_csv("../data/dataset/Spark/Flooding_dataset_Spark.txt")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()February 27, 2026
This cell imports the pandas library and loads the dataset from a file located at ../data/dataset/Spark/Flooding_dataset_Spark.txt. It also displays the shape, column names, and the first few rows of the dataset.
This cell provides an overview of the dataset, including the data types of each column and the number of non-null values.
This cell calculates and displays summary statistics for the numerical columns in the dataset, such as mean, standard deviation, minimum, and maximum values.
This cell calculates the number of missing values in each column of the dataset.
This cell makes a new flag column where the flag for each packet exists in the same row and gets rid of the scattered flag encodings. Also, null data is replaced with a more representative string.
df["flag"] = df.apply(lambda row: row[row.last_valid_index()] if row.last_valid_index() is not None else "No data", axis=1)
df["flag"] = df["flag"].apply(lambda x: 1 if x == "T" else 0)
cols = [c for c in df.columns if c != "flag"]
df[cols] = df[cols].apply(
lambda row: row.mask(row.index == row.last_valid_index(), "No data"),
axis=1
)
df.drop(columns=["R"], inplace=True, errors="ignore")
df.drop(columns=["04C1"], inplace=True, errors="ignore")
df.fillna("No data", inplace=True)
df.head()this cell makes the data column names more representative
This cell uses matplotlib and seaborn to create a histogram with a kernel density estimate (KDE) for the values in column DLC. The plot is saved as distribution.png.
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(df["DLC"], bins=50, kde=True, ax=ax)
ax.set_title("Distribution of DLC")
ax.set_xlabel("Value")
ax.set_ylabel("Count")
fig.savefig("../figures/distribution.png", dpi=150, bbox_inches="tight")
plt.show()This cell calculates the correlation matrix for numeric columns in the dataset and visualizes it using a heatmap. The plot is saved as correlation_heatmap.png.
fig, ax = plt.subplots(figsize=(10, 8))
numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns
sns.heatmap(df[numeric_cols].corr(), annot=True, fmt=".2f", cmap="coolwarm", ax=ax)
ax.set_title("Feature Correlation Heatmap")
fig.savefig("../figures/correlation_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()This cell visualizes the distribution of the new category using a count plot, which is saved as class_distribution.png.
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x="flag", ax=ax)
ax.set_xticklabels(["Normal (R)", "Flooding (T)"])
ax.set_title("Message Count: Normal vs Flooding")
ax.set_xlabel("CAN ID Category")
ax.set_ylabel("Number of Messages")
fig.savefig("../figures/class_distribution.png", dpi=150, bbox_inches="tight")
plt.show()This cell creates a violin plot to visualize the distribution of data sizes (DLC) for the two categories in the flag column. The plot is saved as violin_plot.png.
fig, ax = plt.subplots(figsize=(8, 6))
sns.violinplot(data=df, x="flag", y="DLC", ax=ax)
ax.set_xticks([0, 1])
ax.set_xticklabels(["Normal", "Flooding (0x000)"])
ax.set_title("Data Size Distribution: Flooding vs Normal Messages")
ax.set_xlabel("CAN ID Category")
ax.set_ylabel("Data Size")
fig.savefig("../figures/violin_plot.png", dpi=150, bbox_inches="tight")
plt.show()This collects timestamps within 100 ms and dislays the message count over these different time frames. This shows the pattern in message flow over time
# Extract message over time graph
df["time_bucket"] = df["Timestamp"].apply(lambda x: int(x * 10) / 10)
message_counts = df.groupby("time_bucket").size().reset_index(name="message_count")
fig, ax = plt.subplots(figsize=(10, 8))
sns.lineplot(data=message_counts, x="time_bucket", y="message_count", ax=ax)
ax.set_title("Message Count Over Time (100ms Windows)")
ax.set_xlabel("Time Window (100ms)")
ax.set_ylabel("Number of Messages")
fig.savefig("../figures/message_over_time.png", dpi=150, bbox_inches="tight")
plt.show()This specifically encodes each frame as a high traffic time frame or not. This helps with looking at a pattern between high message counts and injected messages. Specifcally, if messages in high traffic time frames are more likely to be flooding messgaes or not.
This cell looks at the uniqueness of the byte data in the data columns. Used to look for a pattern in variance of data between flooding and non-flooding messages
This cell creates a parallel coordinates plot to visualize the relationships between the features DLC, high_traffic, unique_count, and flag. The plot is saved as parallel_coordinates.png.
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
plot_df = df[["unique_count", "DLC", "high_traffic", "flag"]].copy()
plot_df = plot_df.sample(1000, random_state=42)
plot_df["flag"] = plot_df["flag"].map({0: "Normal", 1: "Flooding"})
fig, ax = plt.subplots(figsize=(10, 6))
parallel_coordinates(plot_df, "flag",
color=["steelblue", "crimson"],
alpha=0.2,
ax=ax)
ax.set_title("Parallel Coordinates: Flooding vs Normal Messages")
fig.savefig("../figures/parallel_coordinates.png", dpi=150, bbox_inches="tight")
plt.show()This cell remakes a new correlation heatmap matrix with the new extracted features to determine their relavence in training
#Relooking at correlation
corr_df = df[["DLC", "unique_count","high_traffic", "flag"]].copy()
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr_df.corr(), annot=True, fmt=".2f", cmap="coolwarm", ax=ax)
ax.set_title("Feature Correlation Heatmap")
fig.savefig("../figures/correlation_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()This cell splits the dataset into training and testing sets using train_test_split from sklearn. The target variable is flag.
This is used to scale down the data specifically for logistic regression models to perform effectively
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
# Fill NaNs
imputer = SimpleImputer(strategy="median")
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)
# Scale
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)This cell trains two machine learning models: Logistic Regression and Random Forest Classifier, using the training data.
This cell evaluates the performance of the trained models (Logistic Regression and Random Forest) using the test dataset. It generates classification reports and confusion matrices for each model, and saves the confusion matrix plots as images
from sklearn.metrics import classification_report, confusion_matrix
for name, model in [("Logistic Regression", lr), ("Random Forest", rf)]:
y_pred = model.predict(X_test)
print(f"\n{'='*40}")
print(f"{name}")
print(f"{'='*40}")
print(classification_report(y_test, y_pred))
fig, ax = plt.subplots(figsize=(6, 5))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax)
ax.set_title(f"Confusion Matrix - {name}")
ax.set_xlabel("Predicted")
ax.set_ylabel("Actual")
fig.savefig(f"../figures/confusion_matrix_{name.lower().replace(' ', '_')}.png",
dpi=150, bbox_inches="tight")
plt.show()This verifies that the model is learning meaningful patterns from the data