IoT Network Traffic Analysis with Machine Learning - Current Progress

machine learning
data-processing
data-pipeline
feature-engineering
An overview of my progress in my IoT-ML project: building a data pipeline from raw Zeek network logs to an engineered feature matrix for malicious traffic detection.
Published

March 31, 2026

Goal

Detect malicious network traffic from IoT devices using machine learning, based on Zeek network intrusion detection logs from 38 devices captured in November 2020


Experimental Workflow

1. Data Ingestion

Raw Zeek JSON logs are extracted from 38 zip files and loaded into a SQLite database (~43M records across 14 tables), with traffic bucketed into time windows.

2. Feature Exploration

To determine which signals are worth encoding into the graph, log records are first aggregated into 34 candidate daily features per device (connection counts, protocol ratios, DNS anomalies, file transfer stats). Redundant features are dropped by removing highly correlated pairs (|r| > 0.75) and high-multicollinearity features (VIF > 10), leaving 23 features with distinct information. Derived security metrics — traffic_asymmetry, src_concentration, weird_rate, dns_to_conn_ratio — are identified here and carried forward as core signals in the graph. This step establishes what to measure; Step 3 builds those measurements directly into the graph at higher resolution.

3. Graph Construction

Going back to the raw logs, graphs are built at hourly granularity — each snapshot has devices as nodes and observed connections as edges, with two distinct types of features:

  • Edges capture which devices connect to which, along with aggregated communication attributes covering volume, reliability, protocol mix, and application-layer behavior.
  • Node features describe how a device behaves overall — aggregated across all its connections from both sender and receiver perspectives, enriched with DNS activity, and extended with temporal delta and rolling z-score features to flag sudden deviations from each device’s recent 24-hour baseline.
  • Input into a VGAE model

Current Feature Set

Category Features
Connection unique_src_ips, unique_dst_ips, unique_dst_ports, avg_duration, failed_conns, failed_conn_ratio
Protocol tcp_count, tcp_ratio, dns_count, dns_to_conn_ratio
DNS unique_domains, nxdomain_ratio
Files / SNMP / Network file_count, total_file_bytes, snmp_count, snmp_write_ratio, tunnel_count, unique_tunnel_types, reporter_count
Derived Security traffic_asymmetry, total_bytes, src_concentration, weird_rate

Current Status

Phase Status
Data ingestion pipeline Done
Feature engineering Done
EDA Done
Graph prep In-Progress
Model architecture / training Not started
Evaluation / results Not started