Custom Input Callouts Around The Mouse Position

In the last post, I talked about how to receive global mouse and keyboard events in the Windows OS. In this post, I’m going to go over why I was playing around with it.

I’ve been playing around with the idea of making some Inkscape tutorials and working on custom overlays and callout tools.

Before I continue, I want to clarify that all this is prototyping work. It’s a work in progress and isn’t publicly accessible. I use mostly Camtasia to do video recording, which already comes with integrated input callout tools.
This is going to be one of those things where a lot of work is done to do something specific that can be easily done if you can live within the constraints of an existing premade solution.

If you don’t want to read and just want to see the summary and results video, jump down to here.

Concept

When doing tutorial videos, oftentimes, overlays are used to present additional images and readable information. And for software, it’s nice to show your keyboard and mouse input when you’re demonstrating it. There are some tools to show keyboard and mouse presses, and some applications even have built-in tools or plugins to do this.

For example, Blender3D has an add-on script called Screencast Keys that places a mouse and keyboard callout at the corner of the window for people video-capturing their interaction.

In Blender, the bottom left of the 3D View shows the user’s input, with the help of the Screencast Keys plugin.

But there’s a specific look and a certain amount of control I want from this:

Caricatured where needed
- Exaggerate things to make them more visible and understandable when needed.
- For example, that Blender Screencast Keys design (image above). What kind of mouse is that? It’s like an old-timey Mac mouse, but with 3 buttons like an old timed PC button. The business area with the mouse buttons only uses up about 1/4th of the graphic; the rest is just the mouse body, which isn’t that relevant. For my implementation, I’m going to scale up the business area and scale down the blank space of the mouse body.
Skeuomorphed
- I want these callouts to be obvious for what they represent. Not only as input but as the physical objects I’m pressing.
Mouse wheel operations
- I want the mouse scrolling to be shown, and I want it to be clear what’s going on if I have to show the mouse wheel being scrolled and pressed simultaneously.
Obvious but ignorable
- I want it to be a useful reference that I can have people focus on when necessary and something I can casually leave on for everything, even during unimportant moments.
- I might be asking for a lot, but I want the live the dream of having it both ways: having my cake and eating it too – having it obvious and in your face, but also ignorable and out of the way.
Animated
- C’mon, who doesn’t enjoy a good of fashioned animation!? Some movement, fading, and tweening.
Generic
- I want this for calling out my inputs for a few Inkscape demonstrations and usable for any type of desktop application in the future.
Configurable
- I want a system I can tweak. That I can add more features or modify for certain situations and edge cases.
Foveated Focus
- Most importantly, I want the user not to stretch their focus across the entire real estate of the video recording.
- There’s a term called inattentional blindness – it’s where you can’t find something or notice something is going on, even when it’s right in front of you. Something so obvious that when you find it, you wonder how it was so difficult before. It happens to the best of us, especially when important things that should be noticed far from each other. Make a person have to spread their attention across a large enough space, they can miss a gorilla in a basketball court.
- I want to put all the content next to the mouse. That’s what’s doing the clicking, that’s where the user’s going to look, that’s where I intend for the user to look, that’s where I want the callouts to be. That’s where all the action is! Make it obvious and easy to see and placed where they should be looking anyway.

So with my very few and humble (sarcasm) design guidelines, let’s start!

It should be reiterated that this is a prototype and a work in progress. So it’s arguable how successful I was with this, but it’s a starting point.

Hacking Together A Solution

While I’m using Camtasia to video record, which has its own NLE, I’m actually using Sony Vegas as my NLE. I bought Sony Vegas 14 for $50 as part of a Humble Bundle, and I like it better, there’s less handholding, but it’s more versatile for my workflow. And instead of having to start a renderer (e.g., raw OpenGL, a PNG API, or Windows GDI) from scratch, I’m just going to bolt together a Unity project.
I was also considering Processing, but I know Unity better, and it’s Inspector and Editor IDE gives me a lot of options for creating interfaces and controls.

Recording Input

First, the Input Recorder is opened.

It was created by taking a wxWidgets Hello World sample, gutting it out, and throwing some controls in.
Could I have just done this in C#, directly integrated into the Unity project? Perhaps…

The basics of how it records input was covered in my last post.

Setting the Render Origin

I’m capturing on a 3 monitor setup, two of which are 21:9 ultrawide. How am I going to know how to render the mouse position? I can’t just use the raw mouse position – because what that means with respect to where I’m recording the video capture is probably meaningless.

I need to be able to set up an origin for how the mouse coordinates translate to the video capture region. Maybe I could somehow talk to Camtasia and get that information? Who knows? It sounds like that would either be overly difficult or overly fragile to do that, though.

Since I intend to record a fullscreen app, I leverage that and have a setting to output the app’s position when I’m first recording. When I’m rendering the overlay callouts, everything is translated to use the top left of the app as the origin of where to draw stuff. To do that, I have an app selection button that allows me to grab a reference to the Window under the mouse using the function WindowFromPoint().

Because I need to put the mouse over the button to click it, what I do is click the button to give it mouse focus. Then I can move the mouse over where I need it to be and press space to press the button again. It’s hackey, but I can afford rag-tag solutions.
There are other ways to select an app, such as how Spy++ allows a drag-and-drop interface to select. We would also focus on selecting a point on the screen to be the origin, instead of a selected app’s top-left corner being the origin.

Constructing The Mouse

The mouse is simply a few Unity GUI sprites layered on top of each other. It’s a few simple vector shapes made in Inkscape and exported to PNG. The sprites are pretty high resolution, so I have the option of scaling it up.

As mentioned before, it was an explicit decision to make the mouse buttons take up as much space as possible, and the rest of the empty space takes up as little space as possible, at least as much as I can without it looking super out of proportions.

Note how the wheel is also a capsule shape to emphasize it’s the wheel. I also thought a drop shadow would be nice to create some depth from the actual video recording.

Misc Aesthetics

Some other aesthetic choices:

Keyboard key skeuomorph. At the bottom, there’s an edge to make it feel like a key viewed from the top, a slight off-angle.
Active items are highlighted green and fade with time.

Many of these properties can be adjusted in the inspector, like the actual highlight color, the fading times of actions, offset from the cursor, etc.

And of course, the sprites for the keys and mouse can be changed directly – same with the scale of the sprites.

Rendering The Frames

The process of rendering the frames is pretty straight forward.

Parse the log file. They’re separated by newlines, so a simple String.Split will do.
Process the log lines in order. Do whatever it tells us to do, i.e., color, and move items around.
Flush the frames every 1/30th a second. Our rendered frame stack is intended to be played back at 30FPS. The timestamp of the last time a frame was rendered is remembered. If a timestamp is over 1/30th of a second greater than that, we render the current state as a new frame before processing the next line. We need to remember to put the frame number as part of the filename.

To have transparency, make sure the Clear Flags of the camera is set to Solid Color, and make sure the Background color has 0 alpha. Also, make sure the Texture2D you’re reading the screengrab into has an alpha channel. I use TextureFormat.ARGB32.

I also need to make sure the Unity preview is running at the image resolution I want the frame stack to be.

When saving a screen grab of the game, the output image resolution is the size of the game screen.

Integrating The Overlay

This is also pretty straight forward, import the frame stack into the NLE and add it to a video layer on top of the recording.

When Camtasia starts recording, a large visible countdown from 3 to 1 occurs before the actual recording starts. I often use this to time the exact moment I start recording input so I know how to align the overlay and captured video – they share the same start. You won’t see this in the demo video (below) because I’m already recording in Camtasia when I start recording the input.

Camtasia’s timer that counts down to when recording is going to start. I hit the record button when the timer hits zero.

Because the input callout overlay is just a video on another NLE video layer, it gives me extra controls. I can turn it on and off by deleting parts of it. I can control its transparency and do crossfades. I can decide if other overlays should go above it or if the callouts should be above everything.

The multiple layers in Sony Vegas for a demonstration video. The layer for the input callout video is highlighted in teal.

The overlay is grouped to the recording so that if the recording is modified or moved, the overlay will automatically move with it.

Video Explanation

So how well does it work? See for yourself (the final composite is at timestamp 3:49). Note how the overlay is semi-transparent.

Summary of the entire process, as well as a rendered result.

There’s a lot of tweaking that can be applied, but the base functionality is in place. It’s also more of a technical demo than an efficacy demo because I’m just scrambling to keep the demo short.

Why Not Do It The Other Way?

What “other” way exactly?

I think the most likely alternative someone might think to ask is, “why not make an app that is the overlay and follows the mouse”? That way, the overlay is real, and the recording, rendering, and compositing steps can be skipped.

Not a bad question, as that overhead is pretty much all the overhead involved with this process. But a lot of flexibility is lost. Not only in being able to control the video compositing, but we lose the ability to have any other kind of post-processing control as well:

What if I wanted to add the ability to annotate the data afterward for richer overlays?
What if I could small recover mistakes by massaging the log data? I lose that.
What if I wanted to tweak the rendering options and re-render the overlays? If the callouts were directly baked into the video capture, that would require reenacting the whole video capture.

Additional Work

Some ideas for additional work, as well as issues that may need to be addressed.

I’m wondering if Unity’s entire rendering process could just be directly integrated as a Sony Vegas plugin. So instead of pre-rendering frames, they would be rendered on the fly as an overlay effect. Then the need to render the frames to 1080p PNG frame stacks before previewing them on top of the video in Vegas is currently the bulk of the overhead. And don’t get me started on if I want to tweak and iterate – that’s an entire frame stack render each time. The most immediate consequence to that would be that I would need to create a random access timeline to allow arbitrary scrubbing in the NLE.

Many of the keys don’t look like how they do on a real keyboard. This is because they all use the same system of just matching a keycode to a string literal. So no size changes, “rich layouts”, or images can currently be used as key faces.

A real keyboard has different visuals on some keys. Some are tiny words that can span multiple lines. Some are multiple symbols stacked on top of each other. Some are graphical icons.

Since only raw keypresses are shown, the significance of key chords (pressing a key combined with system keys – i.e., Ctrl/Alt/Shift) is harder to notice. I want to find a way to show all the keys while also emphasizing chords. The same goes for showing mouse actions combined with system keys.

The text that gets pressed inserts itself on the very left of where keys are shown to put it closest to the mouse cursor as the most recent event. This gets weird for typing normal text since the text is being processed right-to-left, instead of how we naturally read text left-to-right.
Yes, I know that not all languages read left to right. ‘Murica!

It would be nice if the renderer had state-knowledge of the application: both application-specific information and context. For example, if I press a key-chord, the renderer could recognize it and display what that chord does. If this is turned into a generic system, some system mechanics and application-profile authoring tools need to be made.

If the input rendering system had information on what actions did, additional overlays could be automated in.

The example above isn’t an elegant example, but hopefully, it gets the basic idea across. Although, for these kinds of content, we may relegate it to a card on the corner of the screen instead of adding additional clutter near the mouse.

A concept for placing automated cards out of the way (bottom right corner).

– Stay strong, code on. William Leu