Clustering Algorithms Visualization

Project Overview

Clustering Algorithms Visualization is an interactive web application that allows users to explore unsupervised machine learning clustering techniques on different datasets. Initially developed as a class project focusing on K-means, I expanded it to include an interactive interface to demonstrate clustering algorithms covered in class.

This application provides a hands-on approach to understanding how clustering algorithms work by visualizing their step-by-step execution. Users can select different datasets, adjust the number of clusters, and observe how each algorithm partitions the data points differently, making it an excellent educational tool for machine learning concepts.

Technical Implementation

Key Features

Real-time visualization of clustering algorithms (K-means, EM, Hierarchical)

Selection between multiple synthetic datasets with different distributions

Interactive controls to adjust number of clusters and algorithm parameters

Flask-based backend with optimized NumPy implementations

Dynamic frontend rendering of clustering results

Principal Component Analysis (PCA) for dimensionality reduction

Architecture

The application follows a client-server architecture with a Flask backend and JavaScript frontend:

Backend: Python Flask server that handles dataset generation, algorithm execution, and API endpoints

Algorithms: Custom implementations of clustering algorithms using NumPy and SciPy

Frontend: Interactive visualization using HTML5, CSS3, and JavaScript

Data Processing: Dimensionality reduction with PCA for visualization

Algorithm Implementations

K-means

The K-means implementation uses the following steps:

Randomly initialize K centroids from the data points

Assign each point to the nearest centroid

Update centroids based on the mean of assigned points

Repeat until convergence or max iterations reached

Expectation-Maximization (to be implemented)

The EM implementation for Gaussian Mixture Models:

Initialize means, covariances, and weights

E-step: Compute responsibilities for each point

M-step: Update parameters based on responsibilities

Handle numerical stability with regularization

Code Highlights

One of the core algorithm implementations is the K-means model:

class KmeansModel:
    def __init__(self, X, k, max_iters=100):
        self.X = X
        self.k = k
        self.max_iters = max_iters
        self.dim = X.shape[1]
        self.N = X.shape[0]

        indices = np.random.choice(self.N, self.k, replace=False)
        self.centroids = self.X[indices]
    
    def get_labels(self, X, centroids):
        distances = np.sqrt(((X[:, np.newaxis] - centroids) ** 2).sum(axis=2))
        return np.argmin(distances, axis=1)
    
    def run(self):
        prev_centroids = None
        iters = 0
        
        while iters < self.max_iters:
            labels = self.get_labels(self.X, self.centroids)
            
            new_centroids = np.array([
                self.X[labels == k].mean(axis=0) if np.sum(labels == k) > 0 
                else self.centroids[k] 
                for k in range(self.k)
            ])

            if prev_centroids is not None and np.allclose(prev_centroids, new_centroids):
                break
                
            prev_centroids = new_centroids.copy()
            self.centroids = new_centroids
            iters += 1
            
        return self.get_labels(self.X, self.centroids)

Performance Optimizations

The implementations include several optimizations:

Vectorized operations using NumPy instead of loops for better performance

Mathematical optimizations in distance calculations

Numerical stability improvements with regularization

Efficient data structures for storing intermediate results

Datasets

The application generates three different datasets to demonstrate how clustering algorithms perform under different data distributions:

Dataset X1: Three blobs with moderate overlap, randomly transformed

Dataset X2: Three distinct clusters with different centers

Dataset X3: Complex dataset with elongated clusters of different sizes and orientations

These datasets are generated using scikit-learn's make_blobs function and transformed with linear transformations to create more interesting patterns:

def get_X3():
    """Get X3 dataset from notebook implementation"""
    centers = [[5, 5]]
    X31, _ = make_blobs(cluster_std=1.5, random_state=20, n_samples=200, centers=centers)
    X31 = np.dot(X31, np.array([[1.0, 0], [0, 5.0]]))
    
    X32, _ = make_blobs(cluster_std=1.5, random_state=20, n_samples=200, centers=centers)
    X32 = np.dot(X32, np.array([[5.0, 0], [0, 1.0]]))
    
    centers = [[7, 7]]
    X33, _ = make_blobs(cluster_std=1.5, random_state=20, n_samples=100, centers=centers)
    X33 = np.dot(X33, np.random.RandomState(0).randn(2, 2))
    
    X3 = np.vstack((X31, X32, X33))
    return X3

Project Evolution

The project has evolved significantly over time:

Phase 1: Initial K-means implementation and visualization (A1)

Phase 2: Addition of EM and Hierarchical clustering algorithms (A2)

Phase 3: Enhanced UI/UX with interactive elements

Phase 4: Code optimization and refactoring for better performance

Future enhancements planned for this project:

Enable uploading custom datasets for analysis

Extend dimensionality reduction options beyond PCA (t-SNE, UMAP)

Technologies Used

Python Flask NumPy SciPy scikit-learn JavaScript HTML5 CSS3 Pandas Machine Learning

Learning Outcomes

Through this project, I gained valuable experiences in:

Implementing machine learning algorithms from scratch

Optimizing numerical computations with NumPy

Building interactive web applications with Flask

Working with data visualization and interactive plots

Understanding the mathematical foundations of clustering algorithms

Handling edge cases and numerical stability issues in ML algorithms

Explore the Code

View on GitHub