ML Algorithms Case Study

Project Overview

This study compares three popular supervised machine learning classification algorithms—Random Forests, Support Vector Machines, and Logistic Regression—by applying them to three binary-class datasets from the UCI Machine Learning Repository. The goal was to evaluate their performance across different contexts and identify their strengths and weaknesses.

Algorithms were evaluated using different train-test splits (20/80, 50/50, 80/20)

Parameter tuning was done with 10-fold cross-validation

Performance was measured by training/testing accuracy and log loss

Three datasets were used: Credit Approval, Tic-Tac-Toe, and Breast Cancer Wisconsin

The Algorithms

Random Forest

Tree-based algorithm that builds multiple decision trees on random data subsets and combines their predictions.

Handles non-linear relationships well

Shows feature importance

Reduces overfitting

Support Vector Machine

Finds the optimal boundary that separates classes with maximum margin.

Works well in high dimensions

Handles complex relationships

Effective with limited samples

Logistic Regression

Models probability of class membership using a logistic function applied to linear feature combinations.

Simpler implementation

Faster training time

Easily interpretable results

Results by Dataset

All three algorithms performed well across datasets, with high accuracy and relatively low log loss values. Below are the best test accuracies achieved by each algorithm:

Credit Approval

Random Forest

97.8%

SVM

96.4%

Logistic Regression

95.1%

Tic-Tac-Toe

Random Forest

96.2%

SVM

98.5%

Logistic Regression

98.5%

Breast Cancer

Random Forest

97.8%

SVM

96.4%

Logistic Regression

97.4%

In addition to accuracy, log loss was measured as an indicator of model confidence. SVM consistently achieved the lowest log loss values, demonstrating greater confidence in its predictions.

Key Findings

By a narrow margin, SVM demonstrated the most consistent performance across all datasets, showing the best balance between training/testing accuracies and log loss values. However, all three algorithms proved effective, with their relative strengths becoming apparent in different contexts.

Logistic Regression offers excellent performance with the benefit of simplicity and fast runtime, making it an excellent choice for binary classification tasks, especially those with linear relationships. Random Forest excels at identifying non-linear features and reducing overfitting, while SVM shows particular strength in maintaining balanced performance across varied datasets.

Conclusion

This study demonstrates that while SVM might have a slight edge in overall performance, the optimal algorithm choice depends heavily on the specific dataset characteristics and practical constraints like interpretability and computational resources. For binary classification tasks, all three algorithms can deliver strong results when properly tuned.

The most important takeaway is that thorough parameter tuning and proper cross-validation are often more important than the specific algorithm choice, as all three methods achieved accuracies above 95% when optimally configured.

Full Research Paper

Preview of first page

Download Full Paper (PDF)

"A Comparative Analysis of Three Supervised Machine Learning Classification Methods employed on Three Binary-Class Datasets"

Explore the Code

View on GitHub