Back to projects

Real-Time ASL Detection

Completed

A real-time American Sign Language detection system using MediaPipe hand tracking and a Random Forest classifier to recognize ASL letters and digits via webcam.

Started August 2024Completed May 20250 starsLast updated Feb 20, 2026
Real-Time ASL Detection

Tech Stack

PythonOpenCVMediaPipeScikit-learnNumpy

About this project

This project implements a complete machine learning pipeline for real-time American Sign Language (ASL) detection. It captures hand gesture images via webcam, extracts 21 hand landmarks using MediaPipe, applies data augmentation (rotations, flips, color jittering) to expand the dataset, and trains a Random Forest classifier on the landmark features. The inference module processes live webcam frames, overlays detected hand landmarks, and displays predicted ASL characters in real time with confidence thresholding. The system supports 36 classes covering the full ASL alphabet and digits 0-9.

Case Study

Real-Time ASL Detection

A computer vision system that recognizes American Sign Language hand signs in real time using a webcam, MediaPipe hand tracking, and a trained Random Forest classifier. The system supports 36 classes covering the full ASL alphabet (A-Z) and digits (0-9), delivering predictions with confidence scoring at interactive frame rates.


Overview

Communicating through American Sign Language requires fluency that most people lack. This project explores whether a lightweight machine learning pipeline can bridge that gap by translating static ASL hand signs into text in real time using only a standard webcam — no specialized hardware required.


Key Features

  • Real-time hand landmark detection and visualization via MediaPipe
  • 36-class recognition covering A-Z letters and 0-9 digits
  • Confidence thresholding to suppress low-certainty predictions
  • End-to-end pipeline: data collection, augmentation, training, and inference
  • Runs on consumer hardware with a standard webcam

Technical Highlights

Data Collection Pipeline

  • Webcam-based image capture with per-class labeling
  • 100 base images per class across 36 classes
  • Organized directory structure for reproducible dataset creation Data Augmentation
  • Rotation (90 CW, 90 CCW)
  • Horizontal and vertical flips
  • HSV color jittering with randomized brightness and saturation
  • Expands the dataset by 10x per original image Feature Extraction
  • MediaPipe Hands for 21-landmark detection per frame
  • Each landmark produces (x, y) coordinates, yielding 42 features per sample
  • Static image mode for consistent landmark extraction during dataset creation Model Training
  • Random Forest classifier via scikit-learn
  • 80/20 stratified train-test split
  • Serialized model output via pickle for inference reuse Real-Time Inference
  • OpenCV video capture with continuous frame processing
  • MediaPipe hand landmark overlay on live video
  • Prediction confidence scoring with configurable threshold
  • Visual feedback: predicted character rendered on the video frame

System Architecture

ComponentResponsibility
collect_imgs.pyCaptures labeled hand gesture images from webcam into per-class directories
data_augmentation.pyApplies geometric and color transformations to expand the training dataset
create_dataset.pyExtracts MediaPipe hand landmarks from images and serializes feature vectors
train_classifier.pyTrains a Random Forest classifier and evaluates accuracy on a held-out test set
inference_classifier.pyRuns real-time webcam inference with landmark visualization and character prediction

Engineering Challenges

Landmark Consistency

  • MediaPipe does not always detect hands in every frame
  • Feature vectors must be exactly 42 elements; incomplete detections are discarded
  • Single-hand constraint to maintain consistent input dimensionality Data Quality
  • Webcam lighting and background variation affect landmark stability
  • Augmentation strategy needed to balance diversity without introducing noise
  • Some ASL signs are visually similar, requiring sufficient training samples for disambiguation Real-Time Performance
  • Frame-by-frame processing must stay within interactive latency
  • Confidence thresholding prevents flickering predictions on ambiguous frames
  • Balancing detection confidence settings against recall

What I Learned

  • Building an end-to-end ML pipeline from data collection through deployment
  • Working with MediaPipe's hand tracking API and understanding landmark-based feature engineering
  • The importance of data augmentation in small-dataset scenarios
  • Practical tradeoffs between model complexity and real-time inference speed
  • Designing confidence-based filtering to improve user experience in live prediction systems