A DIY motion capture suit with facial tracking for livestreaming a 3D character (to zero viewers)
UPDATE: I have taken this project much further.
Deep in COVID lockdown (December 2020), I stumbled upon a trending clip of a fascinating livestreamer named “CodeMiko”:
I could not believe what I was seeing. Real-time motion capture with face-tracking and live audience interactivity? I had no idea this was possible. I needed to know how she was doing it. And more importantly, I wanted my own.
I began watching every CodeMiko stream I could. This quickly became an obsession - CodeMiko was creating the most innovative and hilarious content I had seen in years. After the chaotic streams were over, CodeMiko’s creator “Technician” would talk about her struggles with the CodeMiko development process. This provided key details on how she was creating this content:
That is a lot of work and equipment, but I can see how it might be achievable on a budget. I’ve released a mediocre VR game built with UE4, so that engine is a possibility for the 3D character and environment - but I’m also familiar with Blender and 3dsmax as alternatives. An iPhone with FaceID is easy enough to get. Crypto is booming so 3090s are hard to get, but I’ve overclocked my GTX 1080 and I’m confident I can squeeze enough performance out of that to make it work. But there is one very big problem:
I’m not willing to spend $20,000 USD on a motion capture suit for a fun side-project.
This leaves me no choice: I’ll have to make one.
A quick search reveals that many vTubers achieve full-body motion-tracking by strapping Vive trackers (at least 3, up to 7) onto their bodies and using Vive Lighthouses for outside-in positional tracking. This certainly works (and I already have a Vive system with a tracker) but the limited number of tracking points is very apparent in the jitteriness of the character animation:
I want something better.
Another option is using a live-feed from a webcam (sometimes also a Kinect) to estimate pose and position. This suffers from even lower fidelity, and has problems with occlusion:
I want something better.
The most complex option is an inertial motion capture suit. Second-hand ones are quite rare (and expensive!) when they do pop up. An exhaustive search of Hackaday, GitHub, Chinese tech forums, and obscure corners of the internet reveals a few dead-end DIY projects. But one project shows hope: Chordata Motion - a 15-point tracking system which utilizes small off-the-shelf IMU sensors, all run from a Raspberry Pi:
The team has shipped assembled units to early backers, the project has a small but active community forum full of technical discussion and support, and everything about the project is open-source. With the gerber files, bill of materials, 3d models (for enclosures), and firmware/software openly published, we have everything needed to build our own inertial motion capture suit.
The Chordata Motion system is built up of two components: a central hub to receive and process positional data, and as many “kceptor” motion trackers as you would like to feed into the hub (15 for full-body tracking).
This means there are two unique PCB layouts we will need to get made. The project team have shared gerber files, but one of the benefits of an open-source project is that the community can make further improvements. In this case, a member of the community (@valor) has made a more compact version of the hub and kceptor:
In addition to 3d-printable enclosures:
Before sending the gerber files off for PCB creation (EasyEDA is my preferred vendor), I checked the forums for any outstanding issues. A couple members have confirmed the designs work, so as a final check I used Kicad to ensure the PCBs will pass the EasyEDA DRC (Design Rule Check - each PCB vendor has specific tolerances and limits which their machinery requires). I modified the PCB layouts slightly to pass DRC, and the finalized gerber files were sent to EasyEDA.
A few weeks later, we have PCBs:
As well as a stencil to aid with applying solderpaste (highly recommended):
The COVID electronics component shortage was in full effect across the planet, meaning many of the electronics components specified in the Bill of Materials would need to be replaced with equivalents. Luckily, the Chordata system uses mostly standard parts with easy substitutions (resistors, capacitors, LEDs, voltage regulators, multiplexers), but there is one extremely important component which presents a significant problem:
This tiny IMU (inertial measurement unit) chip contains a magnetometer, accelerometer, and gyroscope and is used for orientation and motion detection. This chip is present on every single kceptor tracker, and is the key to the entire operation of the motion capture suit. And it was sold out across the planet, with no known restocking date.
There are no equivalent chips which can be substituted. Somehow we need to get our hands on at least 15 of these chips (ideally more than 20), so we have some flexibility during assembly, testing, and future use.
There is a way: salvage.
There are pre-assembled 9DOF sensor modules for use with Arduino that use the LSM9DS1TR as their IMU. These assembled boards are only slightly more expensive than buying the IMU chip separately, but removing the IMU chip presents two challenges:
To deal with problem 1, we can use an infrared preheater to pre-heat the boards from below. To deal with problem 2 we should buy as many spare boards as possible and work as quickly as possible when desoldering.
The workstation for desoldering of the surface mount chips:
After a few failed attempts, the method used by Louis Rossman gives reliable results (the secret being obscene amounts of flux):
We can now harvest enough IMU chips to proceed.
The enclosures for the Chordata hub and kceptors are easily printed on an Ender-3:
The Raspberry Pi 3B+ is readily available. An enclosure for it can also be printed:
We will need to make cabling of various lengths, some for pin headers, others terminated with RJ45 for communication from the kceptors to the hub. This is easy to achieve with cheap-ish crimping tools:
Everything will need to be powered in a self-contained fashion, using a 5V 2.1A 26800mAh powerbank in a waistbelt:
The Chordata kceptors and hub need to be attached to my body, with the kceptors kept in the same position along a bone (to ensure the tracking of each bone remain stable). This can be achieved by supergluing 3d-printed clips to polyester webbing with anti-slip backing. Velcro sewn to the straps allows easy attachment/removal:
We can now move on to assembly of the electronics.
The PCBs are designed for the use of SMD (surface mount) components, which are typically soldered using a reflow technique in an oven. Luckily I have a reflow oven, but even without one it would be possible to solder everything using tweezers and a hot air gun.
The reflow temperature profile was manually entered to match the solderpaste being used. Solderpaste was applied to the PCB using the matching metal stencil, electrical components were manually placed using tweezers. To increase airflow during reflow, the PCB being reflowed was raised using spare PCBs.
The end result is assembled hubs and kceptors:
For redundancy, a total of 2 hubs and 18 kceptors were assembled.
The intended location of each kceptor is given by this diagram:
The full suit is assembled using RJ45 cables to connect all kceptors to the hub:
The hub and Raspberry Pi are velcro’d to the central webbing which is placed around the waist:
The full suit:
The suit is time-consuming to put on, but once everything is strapped in the kceptors remain rigidly in place:
The Chordata team has developed the notochord software which runs on the Raspberry Pi, receiving all of the raw data from the kceptors and performing the sensor fusion required to determine the relative position and rotation of each kceptor. It then transmits this information over WiFi to a receiving computer. Getting this software onto the Raspberry Pi is a simple process.
After many painful hours of troubleshooting communication errors, intermittent connections, failed calibrations, undocumented software bugs, and even more intermittent connections, we finally have the motion capture suit sending tracking data over WiFi using the OSC protocol:
With body tracking solved, we can move on to facial tracking.
Facial tracking is a much simpler problem to solve, thanks to Apple’s inclusion of a front-facing laser dot projector on their newer iPhones to enable the FaceID authentication feature. By projecting an grid of dots on a face using infrared light, it is possible to track facial expressions.
The app Face Cap has a simple interface and makes it easy to transmit facial blendshapes data using OSC. With an iPhone pointed at our face at all times, we can have continuous (and expressive) facial tracking.
The next problem is figuring out how to keep an iPhone pointed at our face. One solution is quite simple: a ski helmet with various GoPro-mount accessories cobbled together.
It’s heavy, but it works: (for an hour or two before neck pain kicks in)
If we don’t expect much movement during a motion capture session, we can get by with a desk-mounted iPhone holder (as long as we face forwards):
With the hardware and software on the transmitting side (motion capture suit, iPhone) figured out, we can move onto the software on the receiving side (a desktop PC).
Our ultimate goal is to have a motion-captured 3D character in a 3D environment streamed live onto the internet in real-time.
Unreal Engine is the obvious first choice to achieve this - it is free, extremely flexible, and is what CodeMiko herself uses. A significant problem however is that receiving tracking data from the Chordata suit is not currently supported by any plugins, and we would need to write our own. This is likely more work than expected, which could explain why the community has not yet developed a solution.
There is an alternative option: the Chordata team maintains a Blender plugin for receiving tracking data from the suit. Blender is fully-featured and free, however it presents one major problem: it is not a real-time rendering engine like Unreal. Instead it is intended for a traditional pre-rendered 3D still/animation pipeline.
What if we could come up with a way to use Blender as a real-time engine?
It is relatively straightforward to get the Chordata plugin up-and-running in Blender. OSC messages are sent over WiFi from the Raspberry Pi on the suit, intercepted by Blender, and translated to bone positions and rotations:
After calibrating the suit (a procedure that establishes the baseline position, rotation, and environmental factors for each kceptor), the result is a 3D character directly controlled by the motion capture suit:
The intention is for us to record animations within Blender using the motion capture suit and then later play back these animations during a render. But this isn’t strictly required - even without recording an animation, the Blender viewport is updated in real-time with the motion capture suit’s positions and rotations. If we could make the Blender viewport look as good as possible within its limitations (simplified lighting, no motion blur, limited anti-aliasing, lower resolution), we should be able to use the viewport itself as a real-time renderer:
In theory this solution would also allow us to build a full 3D environment for the character, but achieving decent results with an environment requires a different configuration than the character.:
We can take a hybrid approach to solve this problem: we can pre-render our environment, then composite the real-time character on top of the pre-rendered environments. (an approach familiar to anyone who remembers the Playstation 1 era)
We render our environment in a looping animation:
We then modify the viewport to work as a green screen to enable easy masking of the character model:
With these two video layers figured out, we can work on compositing them in real-time.
We have two video streams (a pre-rendered video and a screen-captured portion of the Blender viewport), and we should plan on having multiple audio streams (microphone audio, background music) in the future. We need to composite all of these streams together, potentially apply effects to particular elements (audio or visual), and stream the results on the internet. Every livestreamer solves this type of problem the same way: OBS.
OBS is the swiss army knife of video and audio production. It lets you compose scenes with multiple sources and record or stream the results live to online platforms. Anything that it can’t do out-of-the-box can be achieved with one of the countless plugins developed by the community. It is also free.
The animated background is a simple looping video as the bottom layer:
For the top layer, we capture the Blender viewport and mask out the character model:
And when put together, we get:
We have accomplished our goal: a real-time motion-captured 3D character in a 3D environment, ready to be streamed or recorded.
While this is a satisfying starting point, there are a number of enhancements required to make this into a system capable of producing compelling content.
If you’re familiar with motion capture systems, you are probably surprised that an open-source DIY inertial motion capture suit is capable of determining translation of the root bone in space. Your surprise is warranted, Chordata is not able to do this (yet). As a result the root bone (the hip) stays anchored in place for the most part (with limited vertical translation made possible by anchoring the foot bones to the floor and using IK). The practical implication of this is that this motion capture system will not work for tracking a character walking around in an environment - it can only capture the “inner pose” (all rotations relative to the root bone at the hip). This limitation makes sense in the context of the Chordata system, which was originally intended for capturing dance poses.
One idea I am exploring is attaching a Vive tracker to the waist of the suit, and using the Vive outside-in tracking system for global positioning of the root bone of the character model.
With only one pre-rendered background video, we are stuck with a front-view of the character. We could render views of the environment from multiple angles, and adjust the viewport angle of the character model to match.
Once we introduce a controllable camera, we need a way for the performer to be able to control it in real-time. This could be achieved with a wireless numpad in the had of the performer, firing off hotkeys in OBS. This would also allow the performer to trigger effects and other interactivities.
We could add “filters” and other effects to the video layers in OBS (background and character) to spice things up.
We could use our armature to control multiple characters within Blender, or could make duplicates of our existing characters in OBS, treating the duplicates differently than the main character.
These clips incorporate the improvements mentioned above, as well as others described in detail in the follow-up to this post.
To show the flexibility of our approach: by swapping out the background video, changing the camera framing, and swapping out the character model we can achieve an entirely different type of scene: