Real-Time ASL Detection

Completed

A real-time American Sign Language detection system using MediaPipe hand tracking and a Random Forest classifier to recognize ASL letters and digits via webcam.

Started August 2024Completed May 20250 starsLast updated Feb 20, 2026

View Source

Tech Stack

PythonOpenCVMediaPipeScikit-learnNumpy

About this project

This project implements a complete machine learning pipeline for real-time American Sign Language (ASL) detection. It captures hand gesture images via webcam, extracts 21 hand landmarks using MediaPipe, applies data augmentation (rotations, flips, color jittering) to expand the dataset, and trains a Random Forest classifier on the landmark features. The inference module processes live webcam frames, overlays detected hand landmarks, and displays predicted ASL characters in real time with confidence thresholding. The system supports 36 classes covering the full ASL alphabet and digits 0-9.

Case Study

Real-Time ASL Detection

A computer vision system that recognizes American Sign Language hand signs in real time using a webcam, MediaPipe hand tracking, and a trained Random Forest classifier. The system supports 36 classes covering the full ASL alphabet (A-Z) and digits (0-9), delivering predictions with confidence scoring at interactive frame rates.

Overview

Communicating through American Sign Language requires fluency that most people lack. This project explores whether a lightweight machine learning pipeline can bridge that gap by translating static ASL hand signs into text in real time using only a standard webcam — no specialized hardware required.

Key Features

Real-time hand landmark detection and visualization via MediaPipe
36-class recognition covering A-Z letters and 0-9 digits
Confidence thresholding to suppress low-certainty predictions
End-to-end pipeline: data collection, augmentation, training, and inference
Runs on consumer hardware with a standard webcam

Technical Highlights

Data Collection Pipeline

Webcam-based image capture with per-class labeling
100 base images per class across 36 classes
Organized directory structure for reproducible dataset creation Data Augmentation
Rotation (90 CW, 90 CCW)
Horizontal and vertical flips
HSV color jittering with randomized brightness and saturation
Expands the dataset by 10x per original image Feature Extraction
MediaPipe Hands for 21-landmark detection per frame
Each landmark produces (x, y) coordinates, yielding 42 features per sample
Static image mode for consistent landmark extraction during dataset creation Model Training
Random Forest classifier via scikit-learn
80/20 stratified train-test split
Serialized model output via pickle for inference reuse Real-Time Inference
OpenCV video capture with continuous frame processing
MediaPipe hand landmark overlay on live video
Prediction confidence scoring with configurable threshold
Visual feedback: predicted character rendered on the video frame

System Architecture

Component	Responsibility
`collect_imgs.py`	Captures labeled hand gesture images from webcam into per-class directories
`data_augmentation.py`	Applies geometric and color transformations to expand the training dataset
`create_dataset.py`	Extracts MediaPipe hand landmarks from images and serializes feature vectors
`train_classifier.py`	Trains a Random Forest classifier and evaluates accuracy on a held-out test set
`inference_classifier.py`	Runs real-time webcam inference with landmark visualization and character prediction

Engineering Challenges

Landmark Consistency

MediaPipe does not always detect hands in every frame
Feature vectors must be exactly 42 elements; incomplete detections are discarded
Single-hand constraint to maintain consistent input dimensionality Data Quality
Webcam lighting and background variation affect landmark stability
Augmentation strategy needed to balance diversity without introducing noise
Some ASL signs are visually similar, requiring sufficient training samples for disambiguation Real-Time Performance
Frame-by-frame processing must stay within interactive latency
Confidence thresholding prevents flickering predictions on ambiguous frames
Balancing detection confidence settings against recall

What I Learned

Building an end-to-end ML pipeline from data collection through deployment
Working with MediaPipe's hand tracking API and understanding landmark-based feature engineering
The importance of data augmentation in small-dataset scenarios
Practical tradeoffs between model complexity and real-time inference speed
Designing confidence-based filtering to improve user experience in live prediction systems