Stop Wrestling with Motion Capture! three-mediapipe-rig Is Insane
Stop Wrestling with Motion Capture! three-mediapipe-rig Is Insane
What if I told you that building a Hollywood-grade motion capture system no longer requires a $50,000 studio, a green screen, or a team of VFX artists hunched over keyframes at 3 AM? Here's the painful truth most Three.js developers discover the hard way: skeletal animation is brutal. You've got your beautiful character model loaded in glTF format, bones perfectly weighted in Blender, and then... nothing. The rig just stands there, lifeless, while you drown in quaternion math and inverse kinematics nightmares.
Sound familiar?
Every indie developer, creative technologist, and WebXR pioneer has hit this wall. You want your character to wave when the user waves. To smile when they smile. To throw a punch that actually looks like a punch—not a glitchy T-pose seizure. The traditional path? Export mocap data from expensive software, retarget it to your skeleton, fight with bone orientations, and pray your timeline doesn't explode.
Enter three-mediapipe-rig—a weapon so elegantly brutal it feels like cheating. This open-source Three.js module by bandinopla fuses Google's MediaPipe computer vision with your 3D skeletons, transforming any webcam into a real-time motion capture device. Full body. Both hands. Facial expressions. All running in the browser. With minimal code.
The secret sauce? It's angle-based, meaning your skeleton can be any size, any proportion, and it just works. No retargeting hell. No scaling nightmares. Just raw, tracked motion flowing directly into your rig.
Ready to never write another manual keyframe? Let's dissect how this changes everything.
What Is three-mediapipe-rig?
three-mediapipe-rig is a specialized Three.js module created by developer bandinopla that bridges two powerhouse technologies: Google MediaPipe (the industry-standard on-device machine learning framework for perception tasks) and Three.js (the ubiquitous WebGL library powering immersive 3D experiences).
At its core, this module solves one deceptively complex problem: how do you map 2D/3D landmark data from a webcam feed onto a hierarchical bone structure in real-time?
The answer, traditionally, involves solving inverse kinematics equations, handling coordinate system conversions, managing bone roll conventions, and interpolating between noisy detection frames. Bandinopla's module abstracts all of this into a clean, promise-based API that handles the heavy lifting while giving you surgical control when you need it.
Why it's trending now:
The convergence of three forces makes this module explosively relevant in 2024-2025:
- MediaPipe's maturation — Google's vision models now run at usable framerates in browsers via WebAssembly, with robust hand, pose, and face mesh detection
- Three.js ecosystem growth — With WebGPU support emerging and the Node Material system maturing, browser-based 3D is becoming production-viable for experiences previously locked to native apps
- The "VTuber" and virtual production boom — Independent creators desperately need affordable mocap pipelines, and this delivers exactly that
The module runs three concurrent ML models (face, body, hands), so yes, expect some FPS trade-offs. But the results? Nothing short of magical for the zero-dollar price tag.
Key Features That Destroy the Competition
Let's break down what makes this module a technical knockout:
Full-Body Pose Tracking
Captures shoulders, arms, hips, legs, and head with landmark-based skeletal inference. The angle-based approach means your 8-foot ogre rig and your chibi anime character both receive proportionally correct motion without manual retargeting.
Individual Finger Tracking
Both hands, all fingers, three joints per finger. This isn't crude "hand open/closed" detection—this is per-bone rotation data flowing to index1L, middle2R, thumb3L, and every other phalange. Perfect for sign language apps, musical instrument simulations, or intricate gesture interfaces.
Facial Blendshape Animation
MediaPipe's face mesh generates 52+ blendshape coefficients (brow raises, jaw opens, eye blinks, mouth shapes). The module maps these directly to morph targets on a mesh named face. Your character actually emotes in sync with the user.
Automatic Bone Binding with Custom Mapping
Drop in the provided rig.blend skeleton, or bring your own and supply a BoneMap. The module intelligently connects MediaPipe landmarks to your hierarchy. The default naming convention covers standard rigs, but complete customization is one object literal away.
Dual Input Modes
Live webcam for real-time interaction, or pre-recorded video/image files for development, debugging, and batch processing. The debugVideo and debugFrame options are lifesavers when you don't want to perform for your own camera 200 times.
Built-in Recording Pipeline
Capture motion directly to .glb format containing animation clips. While currently bundling mesh data (workaroundable with gltf-transform), this enables rapid prototyping of reusable animation libraries.
Debug Visualization
Overlay landmark detection on your input feed with configurable displayScale. Peek under the hood with direct access to tracker .root objects for scene visualization.
Real-World Use Cases Where This Dominates
1. Virtual YouTubers & Stream Avatars
The VTuber economy is exploding, but entry costs are prohibitive. With three-mediapipe-rig, a webcam + browser + Three.js scene = professional puppeteering. Face tracking drives expressions; hand tracking enables gestures; body tracking handles posture. All without OBS plugins, iPhone TrueDepth sensors, or expensive tracking suits.
2. Interactive Web Experiences & Campaigns
Marketing agencies pay fortunes for "gesture-controlled" experiences that are usually just hacked Kinect demos. Now? A single developer deploys a full-body interactive character to a CDN. Users wave to navigate, grab to select, smile to confirm. The engagement metrics speak for themselves.
3. Accessibility & Assistive Technology
Motion-based interfaces for users with limited mobility. Track subtle finger movements for switch alternatives. Use facial expressions as control inputs for users who can't operate traditional peripherals. The browser-based deployment means zero installation friction.
4. Rapid Animation Prototyping
Game developers and animators can perform rough passes directly into Three.js, record the .glb, then refine in Blender or Maya. No mocap suit rental, no studio booking. Iterate on animation timing in hours, not days.
5. Educational & Training Simulations
Medical training for hand positioning. Language learning with lip-reading feedback. Physical therapy with form correction. The multi-modal tracking (body + hands + face) enables richer pedagogical interactions than single-modality solutions.
Step-by-Step Installation & Setup Guide
Let's get you from zero to tracked skeleton in under 10 minutes.
Prerequisites
- Node.js 18+ with npm/yarn/pnpm
- A Three.js project initialized (Vite, Next.js, or vanilla)
- A webcam (or video file for testing)
Installation Commands
# Install the core module
npm install three-mediapipe-rig
# Ensure peer dependencies are present
npm install three@^0.182.0 @mediapipe/tasks-vision@^0.10.32
⚠️ Critical: The peer dependency versions are strict. Three.js
^0.182.0and MediaPipe^0.10.32must be satisfied or you'll encounter WASM loading failures and matrix math incompatibilities.
Project Structure Setup
my-mocap-project/
├── public/
│ └── models/
│ └── your-character.glb # Your rigged character
├── src/
│ ├── main.ts # Entry point
│ └── rig-config.ts # Bone mapping (if custom)
└── package.json
Basic Scene Initialization
import * as THREE from 'three';
import { GLTFLoader } from 'three/addons/loaders/GLTFLoader.js';
import { setupTracker } from 'three-mediapipe-rig';
// Standard Three.js boilerplate
const scene = new THREE.Scene();
const camera = new THREE.PerspectiveCamera(75, window.innerWidth / window.innerHeight, 0.1, 1000);
const renderer = new THREE.WebGLRenderer({ antialias: true });
renderer.setSize(window.innerWidth, window.innerHeight);
document.body.appendChild(renderer.domElement);
const clock = new THREE.Clock();
Loading Your Character
const loader = new GLTFLoader();
const gltf = await loader.loadAsync('/models/your-character.glb');
const rig = gltf.scene.getObjectByName('rig')!; // Must match your armature root name
scene.add(gltf.scene);
🔑 Pro tip: The provided
rig.blendin the repository shows exact bone roll conventions. Mismatched roll causes elbows to bend backwards and knees to invert. Either use this skeleton as reference or match its orientations precisely.
Environment Configuration for Self-Hosting (Optional)
For production deployments, self-host MediaPipe models to avoid CDN dependency:
const tracker = await setupTracker({
modelPaths: {
vision: "/models/wasm",
pose: "/models/pose_landmarker_lite.task",
hand: "/models/hand_landmarker.task",
face: "/models/face_landmarker.task",
},
});
Download model files from Google's MediaPipe model repository and serve via your CDN or static host.
REAL Code Examples from the Repository
These examples are adapted directly from the official README with enhanced commentary showing production patterns.
Example 1: Minimal Viable Tracking Setup
This is the canonical quick-start—the fewest lines to get from zero to motion:
import { setupTracker } from 'three-mediapipe-rig';
// STEP 1: Initialize MediaPipe vision models
// This is ASYNC because it downloads and compiles WASM + ML models
const tracker = await setupTracker({
ignoreLegs: false, // Set true for seated/upper-body-only characters
displayScale: 0.2, // Debug overlay at 20% size in corner
});
// STEP 2: Find your rig in the Three.js scene hierarchy
// The '!' is TypeScript non-null assertion—ensure this exists!
const rig = scene.getObjectByName('rig')!;
// STEP 3: Bind the skeleton to the tracker
// This creates the landmark-to-bone mapping and returns an update handler
const binding = tracker.bind(rig);
// STEP 4: User gesture REQUIRED for webcam privacy reasons
const startButton = document.getElementById('start-mocap')!;
startButton.addEventListener('click', async () => {
// Prompts browser for camera permission, begins real-time inference
await tracker.start();
});
// STEP 5: Animation loop—this is where the magic happens every frame
renderer.setAnimationLoop((time) => {
const delta = clock.getDelta(); // Seconds since last frame
// Apply tracked motion to bones with smooth interpolation
binding?.update(delta);
renderer.render(scene, camera);
});
What's happening under the hood? setupTracker() instantiates three MediaPipe task runners: PoseLandmarker, HandLandmarker, and FaceLandmarker. Each processes the video stream independently. The bind() method creates a BindingHandler that maintains internal state for bone rotations, applying exponential smoothing based on delta to prevent jitter. The angle-based approach computes relative joint angles from landmark positions, then maps these to your skeleton's local bone spaces—completely decoupled from absolute scale.
Example 2: Custom Bone Mapping for Existing Rigs
Got a character from Mixamo, Character Creator, or manual rigging? No problem. The BoneMap interface lets you remap without renaming bones in Blender:
import type { BoneMap } from 'three-mediapipe-rig';
// Map the module's expected bone names to YOUR rig's actual names
const mixamoBoneMap: BoneMap = {
faceMesh: "Face", // Facial blendshape mesh (must start with "face" by default)
head: "mixamorigHead", // Cranium bone
hips: "mixamorigHips", // Root of skeleton hierarchy
neck: "mixamorigNeck", // Neck/spine connector
torso: "mixamorigSpine", // Upper body
armL: "mixamorigLeftArm", // Upper arm
forearmL: "mixamorigLeftForeArm",
armR: "mixamorigRightArm",
forearmR: "mixamorigRightForeArm",
thighL: "mixamorigLeftUpLeg",
shinL: "mixamorigLeftLeg",
footL: "mixamorigLeftFoot",
thighR: "mixamorigRightUpLeg",
shinR: "mixamorigRightLeg",
footR: "mixamorigRightFoot",
handL: "mixamorigLeftHand",
handR: "mixamorigRightHand",
// Finger chains follow same pattern...
index1L: "mixamorigLeftHandIndex1",
index2L: "mixamorigLeftHandIndex2",
index3L: "mixamorigLeftHandIndex3",
// ... continue for middle, ring, pinky, thumb (L/R)
};
// Apply custom mapping during bind
const binding = tracker.bind(rig, mixamoBoneMap);
Critical insight: The finger naming convention index1L through index3L represents proximal, intermediate, and distal phalanges respectively. MediaPipe outputs 21 hand landmarks; the module internally computes angles between these and drives your bone chain. Missing fingers in your map? They're silently skipped—no crashes.
Example 3: Facial Geometry Binding (The Secret Weapon)
This is the advanced feature that separates hobby projects from polished productions. MediaPipe's face mesh doesn't just give blendshapes—it provides a canonical 468-vertex facial mesh that deforms to match the user's face:
// Load the canonical face model (from MediaPipe's official distribution)
// This EXACT vertex count and UV layout is required
const faceGeometry = await loadFaceMeshModel('/models/canonical_face_model.obj');
// Bind to tracker's face mesh output
const face = tracker.faceTracker.bindGeometry(faceGeometry);
// In your render loop:
renderer.setAnimationLoop((time) => {
const delta = clock.getDelta();
// Updates vertex positions to match detected face shape
// AND applies webcam video as texture projection
face.update(delta);
// The result: your 3D face mesh literally becomes the user's face,
// deforming with expressions and textured with live video
binding?.update(delta);
renderer.render(scene, camera);
});
The technical marvel here: bindGeometry() creates a Three.js NodeMaterial with a custom positionNode that displaces vertices based on MediaPipe's facial mesh solution. Simultaneously, it samples the video feed as a texture. The result is a geometry-accurate, texture-mapped facial puppet that looks like the user because it is the user—mathematically speaking.
Use cases: virtual makeup try-on, anonymized avatar systems, real-time face replacement in WebRTC calls.
Example 4: Multi-Character & Recording Pipeline
// Bind MULTIPLE rigs to same tracker—perfect for mirrored duets or crowd scenes
const heroBinding = tracker.bind(heroRig);
const villainBinding = tracker.bind(villainRig, villainBoneMap);
// Start recording hero's motion for later playback
heroBinding.startRecording();
renderer.setAnimationLoop((time) => {
const delta = clock.getDelta();
heroBinding.update(delta);
villainBinding.update(delta); // Same motion, different proportions!
renderer.render(scene, camera);
});
// After performance...
const recording = heroBinding.stopRecording();
// Save complete GLB (rig + animation + textures)
recording.saveToFile('hero-performance.glb');
// Or extract just the AnimationClip for programmatic use
const clip = recording.clip;
const mixer = new THREE.AnimationMixer(heroRig);
mixer.clipAction(clip).play();
Production note: The current .glb export includes mesh data. For animation-only files, pipe through gltf-transform:
npx @gltf-transform/cli@latest optimize hero-performance.glb animation-only.glb --prune --texture-compress webp
Advanced Usage & Best Practices
Performance Optimization
- Use
ignoreLegs: truefor seated experiences—cuts model load by 30% - Set
ignoreFace: trueif only body control needed - Lower
displayScaleto 0.1 or disable debug overlay entirely in production - Consider
pose_landmarker_lite.taskvspose_landmarker_full.taskfor speed/accuracy trade-offs
Smoothing & Latency
The delta parameter in binding.update(delta) controls interpolation. For snappier response (gaming), pass raw delta. For broadcast-quality smoothness (VTubing), consider capping or scaling delta:
const SMOOTHING_FACTOR = 0.5;
binding.update(Math.min(delta, 0.05) * SMOOTHING_FACTOR);
Handling Permission Denials
Always wrap tracker.start() in try-catch with user-facing fallback:
try {
await tracker.start();
} catch (err) {
// Fallback to debugVideo or instruct user
console.warn('Webcam blocked, falling back to demo video');
const fallbackTracker = await setupTracker({ debugVideo: '/demo.mp4' });
}
Bone Roll Sanity Check
If elbows/knees bend backwards, your bone rolls are inverted. In Blender, select the bone, go to Edit Mode → Bone → Roll, and match the reference rig.blend exactly. This is the #1 support issue.
Comparison with Alternatives
| Feature | three-mediapipe-rig | Manual MediaPipe + Three.js | Kalidokit | Unity + MediaPipe Plugin |
|---|---|---|---|---|
| Setup Complexity | Minimal (npm install + bind) | High (custom IK solvers) | Medium (VRM-specific) | High (C# interop, builds) |
| Skeleton Flexibility | Any proportion, angle-based | Requires retargeting | VRM-only | Humanoid rig required |
| Face Tracking Depth | Blendshapes + geometry | Manual implementation | Basic expressions | Plugin-dependent |
| Finger Tracking | Full per-bone | Often omitted | Limited | Available |
| Deployment Target | Browser (any device) | Browser | Browser | Native apps only |
| Recording Built-in | Yes (.glb) | No | No | Editor-dependent |
| Debug Tools | Video overlay, frame test | Manual | Limited | Full editor |
| License | MIT | N/A | MIT | Various |
Verdict: For web-native Three.js projects requiring rapid deployment without engine lock-in, three-mediapipe-rig is unmatched. Kalidokit serves VRM/VTuber pipelines well but lacks flexibility. Unity solutions require abandoning the web ecosystem entirely.
FAQ: Your Burning Questions Answered
Does three-mediapipe-rig work on mobile browsers?
Yes, with caveats. MediaPipe's WASM models run on mobile GPUs, but expect 15-25 FPS versus 30+ on desktop. Test on target devices; consider ignoreLegs: true for performance.
Can I use this with React Three Fiber?
Absolutely. Import in a useEffect, store tracker in ref, call binding.update() in useFrame. The imperative API plays nicely with R3F's declarative model.
How accurate is the finger tracking?
Surprisingly robust for webcam-based ML. Occlusions (self-covering fingers) cause dropouts, but MediaPipe's temporal smoothing recovers quickly. For piano/guitar apps, it's production-ready.
Is there a way to reduce the FPS drop from three models?
Yes—disable unused trackers via config. The face mesh is heaviest; if you only need body, set ignoreFace: true. Future MediaPipe versions promise unified models that may eliminate this overhead.
Can I stream this data to a server or other clients?
The module doesn't include networking, but you can serialize bone rotations from binding and transmit via WebRTC DataChannels or WebSockets. For lower bandwidth, send MediaPipe landmarks and compute skeleton server-side.
What about multiple simultaneous users?
Each user needs their own setupTracker() instance with dedicated video element. Performance scales linearly—two users ≈ half framerate. For crowds, consider server-side inference with WebRTC ingestion.
Is the recorded .glb compatible with Blender?
Yes! Import directly. The animation clip uses standard quaternion curves. Use gltf-transform to strip mesh data if you want animation-only files for mixing.
Conclusion: The Future of Web Motion Capture Is Here
Let's be brutally honest: three-mediapipe-rig shouldn't exist at this quality level for free. It collapses a pipeline that previously demanded specialized hardware, proprietary software, and weeks of integration into a single npm install and four lines of binding code.
The angle-based architecture is genuinely clever—solving the scale-independence problem that plagues every other mocap retargeting workflow. The facial geometry binding pushes into territory previously reserved for research demos. And the recording pipeline, while imperfect in export size, enables rapid iteration that transforms creative workflows.
Is it perfect? No. The FPS hit from three concurrent ML models is real. The .glb export bloat needs addressing. But for prototyping, indie production, and web-native experiences, this module punches so far above its weight it feels illegal.
My recommendation? Stop reading. Start building. Fork the GitHub repository, run the demos, point your webcam at your rigged character, and feel that first moment of magic when your digital puppet mirrors your wave. That's the moment you'll understand why this changes everything.
⭐ Star the repo. Build something weird. Share what you make.
The motion capture democratization is here. Don't get left behind.
Ready to dive deeper? Explore the PoseCap Editor and MeshCap Editor for advanced recording workflows, or check the live character demos to see what's possible.
Comments (0)
No comments yet. Be the first to share your thoughts!