Body Tracking With Azure Kinect DK
Creating fun body tracking applications with the Azure Kinect Developer Kit
The kit provides hardware/software to capture videos in color/depth and to extract body tracking information from them. In this article (and the companion github repo : mpdroid/bones)
, we explore use of this kit with Azure Cognitive Services to enhance how a person can interact with objects around them in 3-dimensional space.
What is inside the box?
The hardware device is supported by two SDKs:
- Sensor SDK — a C API to connect to and start the device, to extract depth and color images and to transform points between depth and color coordinate systems.
- Body tracking SDK — also a C API, that extracts information about human bodies present in the FOV. Each body frame is composed of 32 joints (eyes, nose, head, hands, feet etc.), each characterized by a position and an orientation. The API also provides a “body index map”: a data structure that tells us which depth pixels belong to which body in the video frame.
The body tracking demos
The mpdroid/bones
project includes a few applications that demonstrate the dev kit capabilities. The README in the github repo provides a detailed list of instructions to download, build and run the project. While the project was developed on Ubuntu 20.04, it should work on lower Linux versions and Windows. These are the pre-requisites:
- Gnu C Compiler(gcc 9.3.0+)
- cmake
- ninja-build
- Azure Kinect Sensor and Body Tracking libraries
- Eigen3 (For vector and matrix operations)
- Obtain an Azure Vision subscription and then store endpoint and key in
AZURE_VISION_ENDPOINT
andAZURE_VISION_KEY
environment variables respectively.
The project also makes use of a few external libraries that are embedded with the source code and one that gets downloaded as a sub-module when you git clone it with the — recursive
option;
- kzampog/cilantro — Point Cloud manipulation including clustering.
- ocurnut/imgui — Rendering depth and camera images with drawing overlays.
- deercoder/cpprestsdk-example — REST client to invoke Azure vision services.
The diagram below shows how the components interact.
- Sensor API methods are wrapped inside
Kinector
. Euclid
wraps body tracking API and implements the geometry.Renderor
handles presenting camera frames with annotations on application window.*Scene
classes implement scene comprehension and annotation.
The project applies this basic framework to implement a few body tracking applications:
- Displaying Joint Information: Visualizes the body tracking information by displaying joint depth position and orientation on the color camera video feed.
- Light Sabers: Uses elbow/hand/thumb position and orientation to attach a light saber to each body in the video. Demonstrates usage in augmented reality games without the need for expensive controllers physically attached to the human body.
- Air Writing: Lets the subject create letters or other artwork in the space around them by simply moving their hands. Demonstrates how to recognize gestures and use them to direct virtual or real-world action based on these gestures.
- Thing-finder: Recognizes objects being pointed at by the subject. This demonstrates how to combine body tracking with point cloud geometry and with Azure cognitive services to create powerful 3-D vision AI applications.
- and others…
Thing-finder: Details
As Thing-finder touches upon all important aspects of body tracking, the rest of the article is devoted to an in-depth treatment.
Kinect DK API basics
Configuration
- Check if the device is connected:
const uint32_t deviceCount = k4a::device::get_installed_count();
- Choose a configuration using the
k4a_device_configuration_t
structure. Below is a sample:
k4a_device_configuration_t kinect_config = K4A_DEVICE_CONFIG_INIT_DISABLE_ALL;
kinect_config.camera_fps = K4A_FRAMES_PER_SECOND_30; kinect_config.depth_mode = K4A_DEPTH_MODE_NFOV_UNBINNED; kinect_config.color_format = K4A_IMAGE_FORMAT_COLOR_BGRA32; kinect_config.color_resolution = K4A_COLOR_RESOLUTION_720P; kinect_config.synchronized_images_only = true;
Initialization
Using above configuration, open the device, set up image buffers to receive depth/color images and size them based on the selected configuration. The API methods used during initialization are:
k4a::device::open
andstart_cameras
to do the obviousk4a_device_get_calibration
to get calibration information based on selected configurationk4a_transformation_create
to setup transformations between coordinate systemsk4a_image_create
to initialize the depth and color image buffers based on configured resolutions
Initialize the windowing application for rendering video frames on the computer monitor. This project uses ImGui for rendering.
Frame loop
Once camera and window initialization are complete, start the application loop. Each iteration executes below steps for each frame captured real-time from the camera feed:
- Capture camera frames using
k4a::device::get_capture
. - Process these frames to store them in color and depth image buffers. Transform depth image buffer into a special 2D (XY_Table) data structure for efficient processing.
- Pass image buffers and other transformed data to the
Scene
class so it cancomprehend
and extract meaningful information relevant to the application. - Let
Scene
classannotate
the image buffers (e.g. with a pointer arrow, bounding cube and object name). - Render image buffers with annotations on the application window.
- Release handles and de-allocate unused memory.
The Thing-finding pipeline
See appendix for the requisite geometry.
After preliminary tasks described above are complete, the ThingFinder::comprehend
method executes the following:
- Find where the subject is pointing
- In order to avoid evaluating every object in the entire depth camera FOV, construct a narrower FOV that is a 60º cone emitting out of the tip of the right hand.
- Extract joint positions for the right elbow and right hand. Use them to define a ray that starts at the elbow and points towards the hand.
- Displace this ray so it originates at the tip of the right hand. This ray now represents the origin and direction of the right hand FOV.
2. Find which depth points lie inside the hand FOV
- Take the depth point cloud and apply actual colors from corresponding points on color image (done within the
Kinector
class). - Then determine which of these points lie inside the hand FOV. See appendix for the math.
- The points inside the hand FOV become a slice of the original point cloud.
3. Find which objects lie inside the hand FOV
- Throw this sliced and colorized point cloud on to a thread (to avoid blocking the main application loop and causing lag in the video stream).
- Segment the colorized point cloud using the
cilantro
library. The library uses the DBSCAN clustering algorithm for segmentation and offers several methods and hyper parameters to tune the segmentation. We use a combination of color similarity and linear distance to segment points in the cloud. Each segment/cluster represents a discrete object in the hand FOV. cilantro
returns a clustered point cloud, from which, we can extract discrete objects and separate them out from noise points.
4. Find which specific object is being pointed at exactly
- Create a bounding cube for each object point cluster using the minimum and maximum x, y, z values over all points in that cluster.
- Use ray-plane intersection geometry (see appendix) to identify the nearest cube that lies exactly on the path of the pointing ray.
- Project the intersecting cube on to the color image. Slice the color image using this projection.
5. Recognize the pointee object
- Create a vision analysis request with the color image slice and send it to the Azure Vision API endpoint. We use code borrowed from deercoder/cpprestsdk-example to construct and post the vision request and parse the detected object name from the response.
- Enqueue the bounding cube and name of object detected by vision API (if any) for annotation.
6. Annotate the video frame
- The
ThingFinder::annotate
method draws the pointing rays, bounding cubes and detected object names on to the color image frame. - Note that due to the asynchronous implementation of point cloud clustering and object recognition, the detected object bounding cubes may appear on screen with a lag.
What’s next?
The potential for body tracking applications is huge — robotics, assistive technologies and augmented reality gaming, to name a few.
However there are a few areas that require more work within Kinect DK;
- Accuracy of finer joints such as eyes and fingers.
- Support for higher order gesture detection (as was available in the original Kinect).
- Joint accuracy when the body is presenting a sideways or lateral pose.
In the meanwhile, if you have a Kinect DK kit, take bones
out for a spin, see if you can leverage the framework to create your own applications and report your issues and observations either here or on the github repo.
Happy coding.
Appendix
References
- Microsoft, microsoft/Azure-Kinect-Sensor-SDK (2019), Github
- Microsoft, microsoft/Azure-Kinect-Samples (2019), Github
- O. Cornut, ocornut/imgui (2014), Github
- K. Zampogiannis, C. Fermuller and Y. Aloimonos, cilantro, A Lean and Efficient Library for Point Cloud Data Processing , kzampog/cilantro (2018), Github
- S. Symons, Note on Ray-Plane intersection (2017), samsymons.com
Body tracking geometry
Note that, as medium does not support LaTeX, this section uses the following custom conventions:
- Uppercase bold italicized letter e.g. V : Vector
- Lowercase bold italicized letter e.g. v: Unit vector
- Lowercase normal italicized e.g. t: Scalar
- For the rest, interpret based on context and hope for the best
Rays
- Remember that a ray is specified using an origin O and a direction u i.e. ray R: [O, u].
- The origin O is a 3D vector that specifies the start coordinates. The direction u is a 3D unit vector (magnitude of 1). You can find the coordinates of any point P along the ray, say at distance t from the origin, by scalar-multiplying the direction vector by t and vector-adding the result to the origin. i.e. P = O + t * u, where |u| = 1 .
Constructing a ray with two points
- Given two points, the origin O and another point H that lives on the ray, then direction u is computed as the vector-difference between H and O divided by the magnitude of that difference i.e. u = (H-O)/|H-O|.
- Given a vector V: [x,y,z], its magnitude |V| = sqrt(x² + y² + z²).
Constructing a ray with body joint coordinates and orientation
- The body tracking API provides two sets of numbers for each of 32 joints in each body present in the video frame.
- The first is a 3D-vector say O: [x, y, z], where x, y and z are global coordinates.
- The second is a quaternion Q: [w, a, b, c]. See here for a detailed explanation of quaternions. For our purposes, all we need to know is that we can do something to Q to obtain a rotational transformation matrix (see code below).
- O and Q can be used to create a combo transformation matrix, which in turn, can used to transform any point(vector) in the joint coordinate system to a point(vector) in the global coordinate system.
- The inverse transformation matrix helps transforming vectors from global to joint coordinate systems.
- We use the
linmath
library to perform these transformations. Here is a code snippet.
k4a_float3_t p = joint.position;
k4a_quaternion_t q = joint.orientation;
mat4x4 transformation_matrix;// Transformation involves linear translation followed by
// a rotation
mat4x4 translation, rotation;
mat4x4_translate(translation, p.v[0], p.v[1], p.v[2]);
quaternion_to_mat4x4(rotation, q.v[0], q.v[1], q.v[2], q.v[3]);// translation is matrix multiplied by rotation to form
// the transformation matrix
mat4x4_mul(transformation_matrix, translation, rotation);
- Now that we have a transformation matrix, we can convert any point from joint coordinates to global coordinates.
- So if we want to draw a ray from the joint pointing along the x-axis, we define a point T say 100 pixels in front of the joint i.e. at [100, 0, 0].
- T is in joint coordinates and we want to transform it to T` in global coordinates:
// As our transformation matrix is 4 x 4, we add 1 as
// another element to allow matrix multiplication
vec4 t = { 100.f, 0.f, 0.f, 1.f };
vec4 t_prime;
mat4x4_mul_vec4(t_prime, transformation_matrix, t);
- Using the joint position O and T` above, we can compute the direction vector.
Does a point lie within a field of view?
- Let us start with a ray [O, u] representing the position of our hand and direction in which it is pointing.
- Let us define the field of view for our hand to be a cone with angle 2⍬ starting at the hand position and coaxial with the direction u.
- A target point V lies within the field of view if absolute angle formed by direction to V from O and the FOV axis direction u is less than or equal to ⍬.
- If v (computed as (V-O)/|V-O|) represents the direction to V from O, this inequality can be written as u . v ≥ cos(⍬) (assuming both direction vectors are unit vectors and the dot represents their dot product).
- The point V lies within the hand FOV if u . v ≥ cos(⍬).
- Equivalently, the point W lies outside of FOV if u . v < cos(⍬).
What object is the person pointing at?
While there are a few accurate methods to do ray tracing with point clouds, here we take a simple approach. We construct a minimal bounding cube that encloses all the points that identify that object. A further simplification is to keep the sides of this cubes aligned to the global coordinate system i.e this is not the absolute minimal cube, rather the only one that aligns with x, y and z axes.
The math to check if a ray intersects a plane is a bit hard to explain. I have provided context on how it (the math) has been applied to our problem. Click here for a better explanation of the procedure.
In the above diagram,
- O is the position of the tip of the hand
- u is a unit direction vector calculated based on vector difference of elbow and hand.
- The intercept I (if it exists) is given by I = O + t*u where t is a scalar that is to be computed … (1).
- v is any unit vector on the plane stated as the vector difference between any other known point on the plane and I. If we choose the center C as the other point, then v= (C-I) / |C-I|.
- Normal n can be computed by taking a cross product of any two vectors on the plane (take any two sets of corners — vector subtract them to create 2 plane vectors — take the cross-product ). Divide the resulting vector by its norm to obtain a unit normal vector.
- By definition, dot-product of any vector on a plane and its normal must equal zero i.e. u.n = 0 ... (2) .
- Decompose vector equations (1) and (2) above to linear equations and solve for t. If a solution does not exist, then ray is pointing away from the plane. If infinite solutions exist (divide by zero), then the ray is either on the plane or parallel to it.
- Plug t back into the (1) to get exact location of the intercept.
- Then check if the intersecting point lies inside the plane segment. Simply check if each coordinate is bounded by the minimum and maximum coordinate values of that segment i.e. abs(xmax-xmin) > abs(xmax-x) and so on.
- If the pointing ray intersects any side, then that cube (and contained object) is on the path of the ray. Take the distance from origin to intercept. If the ray intersects multiple sides, take the shortest intercept distance.
- Find all such cubes in the region of interest and declare the nearest cube (based on distance to intercept) as the pointee.