Supervised ML Algorithms Case Study

Comparative Analysis of Machine Learning Classification Methods on Binary-Class Datasets

Project Overview

This study compares three popular supervised machine learning classification algorithms—Random Forests, Support Vector Machines, and Logistic Regression—by applying them to three binary-class datasets from the UCI Machine Learning Repository. The goal was to evaluate their performance across different contexts and identify their strengths and weaknesses.

Algorithms were evaluated using different train-test splits (20/80, 50/50, 80/20)
Parameter tuning was done with 10-fold cross-validation
Performance was measured by training/testing accuracy and log loss
Three datasets were used: Credit Approval, Tic-Tac-Toe, and Breast Cancer Wisconsin

The Algorithms

Random Forest

Tree-based algorithm that builds multiple decision trees on random data subsets and combines their predictions.

Handles non-linear relationships well
Shows feature importance
Reduces overfitting

Support Vector Machine

Finds the optimal boundary that separates classes with maximum margin.

Works well in high dimensions
Handles complex relationships
Effective with limited samples

Logistic Regression

Models probability of class membership using a logistic function applied to linear feature combinations.

Simpler implementation
Faster training time
Easily interpretable results

Results by Dataset

All three algorithms performed well across datasets, with high accuracy and relatively low log loss values. Below are the best test accuracies achieved by each algorithm:

Credit Approval

Random Forest
97.8%
SVM
96.4%
Logistic Regression
95.1%

Tic-Tac-Toe

Random Forest
96.2%
SVM
98.5%
Logistic Regression
98.5%

Breast Cancer

Random Forest
97.8%
SVM
96.4%
Logistic Regression
97.4%

In addition to accuracy, log loss was measured as an indicator of model confidence. SVM consistently achieved the lowest log loss values, demonstrating greater confidence in its predictions.

Key Findings

By a narrow margin, SVM demonstrated the most consistent performance across all datasets, showing the best balance between training/testing accuracies and log loss values. However, all three algorithms proved effective, with their relative strengths becoming apparent in different contexts.

Logistic Regression offers excellent performance with the benefit of simplicity and fast runtime, making it an excellent choice for binary classification tasks, especially those with linear relationships. Random Forest excels at identifying non-linear features and reducing overfitting, while SVM shows particular strength in maintaining balanced performance across varied datasets.

Conclusion

This study demonstrates that while SVM might have a slight edge in overall performance, the optimal algorithm choice depends heavily on the specific dataset characteristics and practical constraints like interpretability and computational resources. For binary classification tasks, all three algorithms can deliver strong results when properly tuned.

The most important takeaway is that thorough parameter tuning and proper cross-validation are often more important than the specific algorithm choice, as all three methods achieved accuracies above 95% when optimally configured.

Full Research Paper

First page of research paper
Preview of first page
Download Full Paper (PDF)

"A Comparative Analysis of Three Supervised Machine Learning Classification Methods employed on Three Binary-Class Datasets"

Explore the Code

View on GitHub