# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Setting up the style for plots
sns.set(style="whitegrid")
Network Intrusion Detection System using Machine Learning¶
Overview¶
This project demonstrates the development of a machine learning-based Network Intrusion Detection System (IDS) to identify potential threats within network traffic. We use a labeled dataset of network traffic (e.g., the NSL-KDD dataset) and train a classification model to detect different types of intrusions.
Problem Statement¶
As cyber-attacks become more complex, traditional rule-based IDS systems struggle to detect newer, more sophisticated attack types like zero-day attacks. Our goal is to leverage machine learning to detect such intrusions in real-time while minimising false positives.
# Loading the dataset (assuming it's in CSV format)
df = pd.read_csv("NSL_KDD_Train.csv")
# Show first 5 rows of the data
df.head()
Rename the columens of dataset¶
Rename some of the columesns of dataset to be more understandable.
df.rename(columns={'0': 'duration'}, inplace=True)
df.rename(columns={'tcp': 'protocol_type'}, inplace=True)
df.rename(columns={'ftp_data': 'service'}, inplace=True)
df.rename(columns={'SF': 'flag'}, inplace=True)
df.rename(columns={'491': 'src_bytes'}, inplace=True)
df.rename(columns={'0.1': 'dst_bytes'}, inplace=True)
df.rename(columns={'normal': 'label'}, inplace=True)
Data Overview¶
The dataset contains the following columns:
duration: Length of the connectionprotocol_type: Type of protocol (TCP, UDP, etc.)service: Network service on the destination (e.g., HTTP, FTP)flag: Status of the connectionsrc_bytes: Number of data bytes sent from source to destinationdst_bytes: Number of data bytes sent from destination to source- ... (List the most relevant features).
The target variable is label, which identifies whether the connection was normal or an attack.
Exploratory Data Analysis (EDA)¶
# Plot the distribution of attacks vs normal traffic
plt.figure(figsize=(8,5))
sns.countplot(x='label', data=df)
plt.title("Distribution of Attacks vs Normal Traffic")
plt.show()
# Show correlation between features
df_numeric = df.select_dtypes(include=[np.number])
plt.figure(figsize=(12,8))
sns.heatmap(df_numeric.corr(), cmap='coolwarm', annot=False)
plt.title("Correlation between Features")
plt.show()
Data Preprocessing¶
# Label Encoding for categorical variables
# Convert categorical columns to numeric using label encoding
df['protocol_type'] = df['protocol_type'].astype('category').cat.codes
df['service'] = df['service'].astype('category').cat.codes
df['flag'] = df['flag'].astype('category').cat.codes
# One-Hot Encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=['protocol_type', 'service', 'flag'])
# Splitting data into features (X) and labels (y)
X = df.drop('label', axis=1)
y = df['label']
# Train-test split (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Checking if all the sets are integrated¶
print(X_train.dtypes)
print(y_train.dtypes)
Model Training¶
# Training a Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Predictions on the test set
y_pred = model.predict(X_test)
Model Evaluation¶
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8,6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Normal','Attack'], yticklabels=['Normal','Attack'])
plt.title("Confusion Matrix")
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# Classification Report
print(classification_report(y_test, y_pred, zero_division=0))
Insights¶
- The model shows an accuracy of X% in identifying network intrusions.
- The confusion matrix indicates that most attacks were correctly identified, but there are some false positives (normal traffic misclassified as attacks).
- Feature importance analysis reveals that features like
src_bytes,dst_bytes, andprotocol_typeplay key roles in the detection.
Conclusion¶
This machine learning-based IDS provides a promising approach to detecting network intrusions. However, to improve the accuracy and reduce false positives, further tuning of the model or incorporating additional data (e.g., behavioral patterns) could be beneficial.