Project Overview
This study compares three popular supervised machine learning classification algorithms—Random Forests, Support Vector Machines, and Logistic Regression—by applying them to three binary-class datasets from the UCI Machine Learning Repository. The goal was to evaluate their performance across different contexts and identify their strengths and weaknesses.
The Algorithms
Random Forest
Tree-based algorithm that builds multiple decision trees on random data subsets and combines their predictions.
Support Vector Machine
Finds the optimal boundary that separates classes with maximum margin.
Logistic Regression
Models probability of class membership using a logistic function applied to linear feature combinations.
Results by Dataset
All three algorithms performed well across datasets, with high accuracy and relatively low log loss values. Below are the best test accuracies achieved by each algorithm:
Credit Approval
Tic-Tac-Toe
Breast Cancer
In addition to accuracy, log loss was measured as an indicator of model confidence. SVM consistently achieved the lowest log loss values, demonstrating greater confidence in its predictions.
Key Findings
By a narrow margin, SVM demonstrated the most consistent performance across all datasets, showing the best balance between training/testing accuracies and log loss values. However, all three algorithms proved effective, with their relative strengths becoming apparent in different contexts.
Logistic Regression offers excellent performance with the benefit of simplicity and fast runtime, making it an excellent choice for binary classification tasks, especially those with linear relationships. Random Forest excels at identifying non-linear features and reducing overfitting, while SVM shows particular strength in maintaining balanced performance across varied datasets.
Conclusion
This study demonstrates that while SVM might have a slight edge in overall performance, the optimal algorithm choice depends heavily on the specific dataset characteristics and practical constraints like interpretability and computational resources. For binary classification tasks, all three algorithms can deliver strong results when properly tuned.
The most important takeaway is that thorough parameter tuning and proper cross-validation are often more important than the specific algorithm choice, as all three methods achieved accuracies above 95% when optimally configured.