Clustering Algorithms Visualization

An interactive web application for visualizing and exploring clustering algorithms

K-means Clustering Demo

Project Overview

Clustering Algorithms Visualization is an interactive web application that allows users to explore unsupervised machine learning clustering techniques on different datasets. Initially developed as a class project focusing on K-means, I expanded it to include an interactive interface to demonstrate clustering algorithms covered in class.

This application provides a hands-on approach to understanding how clustering algorithms work by visualizing their step-by-step execution. Users can select different datasets, adjust the number of clusters, and observe how each algorithm partitions the data points differently, making it an excellent educational tool for machine learning concepts.

Technical Implementation

Key Features

Real-time visualization of clustering algorithms (K-means, EM, Hierarchical)
Selection between multiple synthetic datasets with different distributions
Interactive controls to adjust number of clusters and algorithm parameters
Flask-based backend with optimized NumPy implementations
Dynamic frontend rendering of clustering results
Principal Component Analysis (PCA) for dimensionality reduction

Architecture

The application follows a client-server architecture with a Flask backend and JavaScript frontend:

Backend: Python Flask server that handles dataset generation, algorithm execution, and API endpoints
Algorithms: Custom implementations of clustering algorithms using NumPy and SciPy
Frontend: Interactive visualization using HTML5, CSS3, and JavaScript
Data Processing: Dimensionality reduction with PCA for visualization

Algorithm Implementations

K-means

The K-means implementation uses the following steps:

Randomly initialize K centroids from the data points
Assign each point to the nearest centroid
Update centroids based on the mean of assigned points
Repeat until convergence or max iterations reached

Expectation-Maximization (to be implemented)

The EM implementation for Gaussian Mixture Models:

Initialize means, covariances, and weights
E-step: Compute responsibilities for each point
M-step: Update parameters based on responsibilities
Handle numerical stability with regularization

Code Highlights

One of the core algorithm implementations is the K-means model:

class KmeansModel:
    def __init__(self, X, k, max_iters=100):
        self.X = X
        self.k = k
        self.max_iters = max_iters
        self.dim = X.shape[1]
        self.N = X.shape[0]

        indices = np.random.choice(self.N, self.k, replace=False)
        self.centroids = self.X[indices]
    
    def get_labels(self, X, centroids):
        distances = np.sqrt(((X[:, np.newaxis] - centroids) ** 2).sum(axis=2))
        return np.argmin(distances, axis=1)
    
    def run(self):
        prev_centroids = None
        iters = 0
        
        while iters < self.max_iters:
            labels = self.get_labels(self.X, self.centroids)
            
            new_centroids = np.array([
                self.X[labels == k].mean(axis=0) if np.sum(labels == k) > 0 
                else self.centroids[k] 
                for k in range(self.k)
            ])

            if prev_centroids is not None and np.allclose(prev_centroids, new_centroids):
                break
                
            prev_centroids = new_centroids.copy()
            self.centroids = new_centroids
            iters += 1
            
        return self.get_labels(self.X, self.centroids)

Performance Optimizations

The implementations include several optimizations:

Vectorized operations using NumPy instead of loops for better performance
Mathematical optimizations in distance calculations
Numerical stability improvements with regularization
Efficient data structures for storing intermediate results

Datasets

The application generates three different datasets to demonstrate how clustering algorithms perform under different data distributions:

Dataset X1: Three blobs with moderate overlap, randomly transformed
Dataset X2: Three distinct clusters with different centers
Dataset X3: Complex dataset with elongated clusters of different sizes and orientations

These datasets are generated using scikit-learn's make_blobs function and transformed with linear transformations to create more interesting patterns:

def get_X3():
    """Get X3 dataset from notebook implementation"""
    centers = [[5, 5]]
    X31, _ = make_blobs(cluster_std=1.5, random_state=20, n_samples=200, centers=centers)
    X31 = np.dot(X31, np.array([[1.0, 0], [0, 5.0]]))
    
    X32, _ = make_blobs(cluster_std=1.5, random_state=20, n_samples=200, centers=centers)
    X32 = np.dot(X32, np.array([[5.0, 0], [0, 1.0]]))
    
    centers = [[7, 7]]
    X33, _ = make_blobs(cluster_std=1.5, random_state=20, n_samples=100, centers=centers)
    X33 = np.dot(X33, np.random.RandomState(0).randn(2, 2))
    
    X3 = np.vstack((X31, X32, X33))
    return X3

Project Evolution

The project has evolved significantly over time:

Phase 1: Initial K-means implementation and visualization (A1)
Phase 2: Addition of EM and Hierarchical clustering algorithms (A2)
Phase 3: Enhanced UI/UX with interactive elements
Phase 4: Code optimization and refactoring for better performance

Future enhancements planned for this project:

Enable uploading custom datasets for analysis
Extend dimensionality reduction options beyond PCA (t-SNE, UMAP)

Technologies Used

Python Flask NumPy SciPy scikit-learn JavaScript HTML5 CSS3 Pandas Machine Learning

Learning Outcomes

Through this project, I gained valuable experiences in:

Implementing machine learning algorithms from scratch
Optimizing numerical computations with NumPy
Building interactive web applications with Flask
Working with data visualization and interactive plots
Understanding the mathematical foundations of clustering algorithms
Handling edge cases and numerical stability issues in ML algorithms

Explore the Code

View on GitHub