wackbar - interactive VJ system with real-time performance capture

Interactive audiovisual insanity, directed in real-time by a motion-captured performer

The Idea

I’ve always loved the chaos of public-access television. Amateur camera work, rough greenscreen abuse, unscreened phone calls, a general sense of overwhelm and confusion.

Let’s Paint TV is as close to perfect as the medium gets:

Platforms like Twitch and YouTube are clearly the spiritual successors to public access television, but their content doesn’t scratch the same itch for me. It’s all a little too polished. Public access television was distinctly wack.

What if we used modern techniques to recreate the chaotic energy of live public access television?

Motion captured performance. Wacky visual effects. Synchronized to music. Controlled in real-time by the performer. Streamed live on the internet.

The Tools

Initial version:

Chordata Motion mocap suit
Blender
TouchDesigner
GLSL
iPhone Xs
FaceCap
Wireless Numeric Keypad
Twiddler 3
AutoHotKey
OSC
Python
Spout
OBS
NVIDIA GeForce GTX 1080

Upgraded version:

Perception Neuron 32 mocap suit
Unreal Engine 5
Live Link Face
NVIDIA GeForce GTX 3090

The Process

We already have a basic livestreaming motion capture system up and running. Our motion capture suit animates a 3D character in real-time, the character is composited over a pre-rendered background, and the output can be recorded or streamed onto the internet.

We need to add the following functionality:

Additional camera angles for the character and environment
Visual effects which can be toggled and combined in various ways
A method of controlling all of the above in real-time

Let’s get started.

1. Camera Angles

At the moment we are only using a single camera with a stationary front view of our character. The looping background scene is pre-rendered using the same camera as the character to keep everything spatially consistent.

If we want to create dynamic content we will need multiple camera positions to choose from. The following camera positions should be sufficient:

front
back
top-down
left
right
front left (three-quarters view)
front right (three-quarters view)

In the future we will likely want sweeping/panning camera moves between these positions, in which case only a single camera is needed in Blender and we could update its position and rotation to transition between these “standard” viewing angles. For now we will keep things simple and switch between different static cameras:

We pre-render our looping background scene from each of these new camera positions:

By matching the Blender viewport camera to the appropriate background render we can now dynamically change the scene in a convincing way:

2. Visual Effects

This is where things get fun. I’m a huge fan of datamoshing, analog video artifacts, video feedback, digital compression artifacts, and psychedelic art. Let’s build a system which allows us to play with these types of effects.

We have three video streams: a 3D character, a background, and a composite of the two. Each of these layers is suited to different type of effects: the character could experience glitch effects (ala Max Headroom), the background could cycle through colors in a psychedelic fashion, and the final composite might have effects applied to it to simulate various media (for example, a green monochrome CRT display, or a scratchy 16mm video).

What we need is a system which can work with these various video streams in an independent way, having the flexibility to toggle and combine various effects on-demand in a performant way. This is the problem that all VJs face, so we should start by surveying the software they use.

Resolume - extremely powerful and flexible, however is intended specifically for music visualizations which might be limiting for our more general use case
NestDrop - spiritual successor to the classic MilkDrop music visualizer for Winamp, again not exactly what we are looking for
Magic Music Visuals - a node-based interface with a lot of customization, could be viable
Cathodemer - an excellent CRT display simulator, but limited otherwise

Although each of these shows potential, none of these options are general enough to meet all our needs. We know we will want the performer to control the effects (and the camera) in real-time using a controller of some kind. Presumably we will need a system to handle these commands, routing them where appropriate, all while keeping track of the current state of our scene (which effects are toggled, the order of the layers, which camera is currently active, etc.)

There are software solutions which operate at this higher-level, allowing the scripting/programming of various control systems (e.g. Chataigne, Node-RED). We could use these programs to receive real-time inputs from the performer and then route the appropriate actions to one of the above effects programs, but an all-in-one solution (control routing+effects) would be preferred.

Enter one of the greatest pieces of software ever created: TouchDesigner.

2.1 TouchDesigner

TouchDesigner is difficult to describe. The wiki calls it a “node-based visual programming language for real-time interactive multimedia content”. The simplest way to describe it is a visual scripting environment that lets you take almost anything as an input (e.g. USB device, network message, video stream, audio stream, etc.), and drive almost anything (e.g. a microcontroller, USB device, video stream, audio stream, laser controller, etc.) as an output. What you do in the middle is completely up to you (e.g. generate a texture, modify a sound, logical operations, mathematical operations, etc.).

My first exposure to TouchDesigner was a Deadmau5 livestream showing how he uses TouchDesigner to run his cube. Enough said, TouchDesigner it is.

As a visual scripting environment, TouchDesigner makes it easy to see the logical flow of our various video streams. We setup a pipeline to apply effects on each video layer independently:

And then run the composited image through its own effects pipeline:

As for how we can implement the effects themselves, the most powerful method available to us is the use of shaders.

2.2 Shaders

For our purposes, shaders can be thought of as small programs which modify pixels before they are displayed. The shader is aware of the relative position of each pixel and can use this information while making the modifications, allowing a wide variety of effects and transformations.

Shaders in TouchDesigner are written using GLSL. The language is easy to understand and there is an enormous amount of reference material for us to learn from. With experimentation, we can implement a wide variety of effects.

Some effects will be applied only to the character:

Other effects will only be applied to the background:

But the most fun are the ones on the composited image. By stacking numerous shaders to recreate various flaws in old display technologies, we can recreate the look of a variety of mediums:

With shader effects implemented, we can move on to a different kind of effect - one that considers the element of time.

2.3 Cache and Feedback Effects

We all remember the first time we pointed a recording device at itself.

TouchDesigner has a built-in feedback effect, but we will have significantly more flexibility if we build our own version. By maintaining a cache of previous frames (for example the last 100 frames) of a video stream, we can create interesting effects by combining the previous frames with the current frame in various ways.

For example, we could create psychedelic effects by retaining previous frames and applying different colors to them:

Or we could progressively blur previous frames for a ghostly trail:

But most importantly, With this cache we can implement one of the most iconic video effects of all time - the Max Headroom glitch:

By jumping back-and-forth between a recent character frame and a random previous frame (at varying speeds) we can recreate the classic effect (using an alternate scene previously created):

Now that we have a collection of effects, we can experiment with combinations that lead to the most visually interesting result.

2.4 Presets

Although most of the effects we’ve created work fine on their own, we get much more interesting results when they are combined in various ways. Some combinations are obvious (e.g. XXX and YYY), but others won’t be. If we implement a way of dumping the current list of active effects, all we need to do is experiment with combinations until we find something interesting, then dump the list of active effects.

By doing this we come up with a long list of pleasing combinations. After some time, some patterns emerge - for example many of the combinations rely on cycling through colors. We can group these combinations within a higher-level group (for example “RGB” with our example of cycling colors). We’ll call these effects groups “presets”, and this simplifies our selection of effects in real-time. Although we can always experiment with effects on-the-fly, or manually combine them in ways we know are visually pleasing, it will be significantly smoother if we have the option to randomly select a known-good category of effect.

We end up with tables of presets for each context:

We can now turn our attention to controlling everything we’ve developed so far.

3. Controller System

Our motion capture suit is “wireless” (i.e. not tethered to a PC), so the controller system should be wireless as well. We need the performer to have quick access to a number of hotkeys, without having to look down at the controller (as this glance would be obvious in the motion-captured performance). Voice commands are a reasonable option, but in the long-run we want the option of the performer being able to speak with the audience, so voice commands are out. What we need is a physical controller with multiple buttons that fits in a single hand.

3.1 Wireless Numeric Keypad

A reasonable option is a wireless numpad:

The labels on the keys are irrelevant for our purposes, so to increase one-handed grip we can cover the buttons and the rear with grip tape:

We will think of the numpad as the “director” controls for our performer. It will control camera position, scene selection, microphone on/off, etc.

The numpad has a limited number of keys, but the layout lends itself well to camera control (as we have 8 camera positions to choose from, and their relative placement can be logically mapped to the numeric keys).

The follow hotkey layout is a good starting point for camera control:

We will use some of the remaining keys for scene control, microphone control, and so on - but we can already tell we do not have enough keys available on the numpad to control our effects system. We need a way to easily trigger numerous effects across 3 different contexts (character, background, composited image). Ideally we would have a full keyboard in a small form factor, operated blindly in one hand. Is this possible?

The answer is yes, thanks to chording.

3.2 Chords

By treating unique combinations of multiple keys (known as chords) as individual hotkeys, we greatly expand our key-mapping options. With a small number of buttons we can achieve the functionality of a full keyboard.

The Twiddler does exactly this:

The Twiddler is ergonomic, lightweight, and extremely flexible. It takes some time to develop the muscle memory for each button’s location, but once that is established it is possible to enter complex combinations without glancing. By combining the buttons underneath our thumb with the buttons available to our other fingers, we can (in theory) easily toggle effects in 3 different contexts.

The Twiddler uses a nomenclature for the various combinations which makes it easier to remember the combinations, L M R (left middle right) for each of the 4 possible rows.

Our planned layout looks like this:

We have two input devices with a plan for what actions we would like to trigger. We can now work on receiving these input commands on the PC side.

3.3 Intercepting and Routing Commands

Both the numpad and the Twiddler are recognized by a PC as USB input devices and they transmit typical keyboard (and mouse, in the case of the Twiddler) button presses. The simplest approach would be to use built-in hotkey features in Blender, TouchDesigner, and OBS - we could ensure we distribute the hotkeys so there are no collisions, and have each program deal with them individually. This is cumbersome to setup and manage, especially as the number of hotkeys increases (which we expect to happen as we add more effects in the future). There is also no guarantee that we can distribute every hotkey without eventually overlapping with an existing Windows hotkey (e.g. Ctrl-S or Ctrl-X). A more centralized and easily-configurable option is preferred.

AutoHotKey is the go-to for solving this problem: it is a free, open-source scripting language for macros, hotkeys, and other automations. Our controller system poses a problem: we have multiple input devices acting as keyboard and mouse, and we don’t want our keypresses to be accepted by the operating system. To solve this we can use AutoHotInterception to “intercept” the commands from specific USB devices (i.e. only the the wireless numpad and the Twiddler), preventing them from continuing to the operating system and other programs, instead routing these commands ourselves. But how can we achieve this routing?

We are already using the OSC protocol for transmitting facial blendshapes from an iPhone to Blender, so it is a natural choice. Although it was originally intended for networking synthesizers with computers, the format is so open-ended that it can be used by any device to send a message to any other device or software.

An OSC message is very simple, consisting of:

an address (e.g. /Blender/)
a type (e.g. i for integer, f for float, s for string)
an argument (e.g. 3 or “on” or 1.564)

So a message being sent to Blender to activate a particular camera could look like:

/Blender s(“Front”)

What we will do in this case is intercept all inputs from the two controllers, convert them into OSC messages, and have TouchDesigner receive the OSC messages and take the appropriate action(s). The following OSC message format will make it clear which button combination on which controller has been pressed:

/twid s("(ALT)LLOO")

/nump s(“NumpadHome”)

In TouchDesigner, we parse the OSC messages to split off the address and the argument, then setup logic to handle each request. TouchDesigner primarily uses a node-based visual scripting to implement logic, but for more complex situations we have the ability to program in Python. In this case we will call Python functions based upon the OSC message that is received.

For maximum flexibility, we will manage which commands call which functions in a table:

Let’s review our progress: we’ve got real-time motion-captured performance, triggerable visual effects, and a flexible controller system in place. The final piece is streaming the result live to the internet.

4. Streaming

In the earlier version of this system we implemented OBS as our recording/streaming solution (as every livestreamer does). This remains the best choice, but some changes are required now that we are using TouchDesigner as our video source - we need a way to capture its output in a performant way. Previously we used screencapturing in OBS to achieve a similar goal (capturing the Blender viewport), but this was only a quick-and-dirty solution with many limitations (low resolution, locking the position of the application window, etc.) There is a much better solution available to us: Spout

4.1 Spout

Our VJ system is implemented on a single computer. The visuals being created in TouchDesigner have already been rendered by the computer, shouldn’t it be possible for other applications to use the existing render for their own purposes? This is what Spout achieves - by sharing the GPUs frame buffer across multiple programs, each program can have zero latency access to the same visual imagery. This is exactly what we are looking for, and is easily implemented in both TouchDesigner and OBS.

Sidenote: if we had the situation of visual imagery being generated on one computer, and the streaming occurring on a second computer, we could use NDI to achieve a similar result by streaming the video over a network (with minimal latency of ~50ms).

4.2 Music with Overlay

We’ve built a powerful system for VJing, but have not discussed the most important part: the music. OBS has built-in support for playlists, and an OSC plugin allows us to use our existing controller system to control playback (next, previous, pause, etc.)

It would be ideal if the currently-playing song was displayed onscreen. This can be achieved with another plugin.

Although all of our effects are currently manually triggered, we should add functionality which would allow our effects to react to the music directly (e.g. beat detection). We can achieve this by routing the audio output from OBS into TouchDesigner, however this will introduce some latency. If we do not compensate for this latency, our performer will be responding slightly after the music has already been played. We can address this by adding a bit of latency on the audio stream which is sent to the final output (but not the audio stream which is sent to TouchDesigner), ensuring the final output from OBS has the audio in sync with the visuals.

5. Character

Until this point we’ve been using a generic low-poly character, but it will be more fun if we can come up with something more distinct. Our goal was to create a modern version of live-access television with it’s characteristic aesthetic. We’ve referenced Max Headroom as an influence, and many of our effects are refined versions of those found in 80s and 90s television - so why not continue this trend and put Max Headroom in a tuxedo?

For a name, we’ve set out from the beginning to create a wack aesthetic so why beat around the bush - “wackbar” will work.

5.1 Backup Characters

There is plenty of room on both sides of the main character for additional animated characters. If we think of the additional characters as backup dancers we have an excuse for various formations alongside and behind the main character. For the choice of backup dancer model, we’ll continue the trend of ripping off iconic characters and use one of Hajime Sorayama’s female robots:

Our VJ system operates under the assumption of a single performer who “does-it-all”, so how can we animate multiple characters at once?

One option is to use pre-recorded animations for the backup dancers, but we would need to synchronize these pre-recorded movements with the music that is currently playing. An easier approach comes to mind - why not re-use the motion capture we are using for the main character? We could mirror the skeleton in the backup dancers to provide some visual distinction, and we could introduce a slight delay (perhaps the length of a beat) so the duplication is less obvious.

This approach works quite well:

6. Upgrades

After experimenting with the system in its current state, a few limitations stand out:

our characters are unable to walk around the environment due to limitations of our DIY motion-capture suit
having to pre-render our environment as looping videos from fixed perspectives restricts our possible camera moves
although we can change camera angles, being restricted to only a single scene is boring

We can address each of these with a few upgrades.

6.1 Professional Motion Capture Suit

As much fun as it was to build our motion capture suit, the inability to walk around an environment limits our system to a stationary dancing character. One option to remedy this is to enhance our motion capture suit by attaching a Vive tracker at the waist, using the Vive outside-in tracking system to track the performer’s position in the room. There are no technical reasons why this should not work, but before putting in the effort we should investigate the easiest solution: buying a professional inertial motion capture suit.

The explosion in popularity of VTubing and VRChat over the last few years has resulted in a healthy secondary market for motion capture suits. We can pickup an older Perception Neuron suit on eBay for a reasonable price:

One reason this model of suit is desirable is that no subscription is required for the interfacing software (unfortunately common for high-end motion capture suits) - an older version of the Neuron software is still available and will work indefinitely:

6.2 Unreal Engine 5

Our approach of using Blender for animation of our characters was initially motivated by our choice of motion capture suit (Chordata only provides a Blender plugin for capturing motion data). But with our switch to the Perception Neuron suit we can return to our initial preferred environment: Unreal Engine. This switch will bring with it a number of improvements:

improved lighting and rendering
improved character models
real-time environment
unrestricted camera movement

We start by creating a new environment:

Then we redesign our character model using MetaHuman:

We program all the functionality we need using the Blueprints visual scripting system:

We replace our previous facial tracking app with Live Link Face (since we must also use Live Link to receive motion capture data in UE5 from our suit):

And we use Spout to share the rendered output with TouchDesigner in 3 layers (character, background, both):

This change in video source has an impact on our effects pipeline (for example, we now have an alpha channel and no longer need to do chroma key masking). This results in a number of other changes and improvements in the rapidly-expanding TouchDesigner project, the end result looking like:

The visual improvements in our output from the migration to UE5 are substantial:

6.3 Additional Scenes

As a final upgrade, we can add some variety to the scenes we use.

A Matrix-inspired grid of screens:

A Korean BJ-style three-pane view:

And kaleidoscopic patterns:

The Result

A compilation of the original Blender version and upgraded UE5 version: