1,181
Views
2
CrossRef citations to date
0
Altmetric
Original Articles

Interactive and immersive devices with perceptual computing technologies

ABSTRACT

In the recent years, we are increasingly endowing devices and machines with the abilities to “sense”, “understand”, and “interact” with us and the physical world, aided by the rapid advances in natural sensing and perceptual computing technologies. In this article, we describe a new class of intelligent systems and immersive applications based on real-time 3D sensing, including interactive computing devices, autonomous robots and drones, as well as augmented and virtual reality systems, blurring the border between the real and the virtual words.

1. Introduction

Science fiction novels and movies have long depicted intelligent and interactive machines that could sense and understand the world, mimicking the human sensation and perceptual processes, and use these capabilities to autonomously navigate in the 3D world, interact with us in natural and intuitive ways, and assist us in our daily lives at home and at work. This dream is now approaching reality in a number of applications, thanks to the rapid advances in sensing and transduction technologies, powerful and energy-efficient computation and processing hardware, machine vision and speech recognition utilizing artificial intelligence algorithms, besides significant progress in other related fields such as communications, portable energy sources, locomotion and manipulation, etc. The introduction and proliferation of touch-sensing displays in mobile devices have extended human interfaces beyond the traditional keyboard and mouse devices in the recent years, and have enabled direct and natural interactions. Similarly, recent breakthrough developments in speech recognition technologies have enabled voice based interfaces and interactions. With the advances in new 3D sensing and processing technologies, various systems and applications are increasingly being introduced in the marketplace with the abilities to “see”, “understand”, and “interact” in the 3D world with natural human-interface capabilities Citation[1].

We map and understand the 3D world around us and navigate in it with a robust spatial tracking capability primarily utilizing our visual and vestibular sensing systems, also aided by the other sensation and perception mechanisms. Vision is among the most important of our senses. The human eye is a sophisticated visual imaging and transduction device. Light from the physical world enters through the pupil, and is focused on the retina at the back of the eye by the cornea and the lens combination, forming a projected 2D image of the 3D world. Our visual sensing system comprises a binocular imaging scheme, where the two eyes form two distinct images of the same 3D scene in the physical world, as they image and capture the scene from two different locations. Due to this geometric construction, a point in the 3D space in front of us projects to two distinct locations on the retina of the left and the right eyes. The physical separation of the corresponding points on the retina is called the binocular disparity, which is inversely proportional to the distance of the physical point from the viewer. The human visual information processing system compares the two projected images formed by the left and the right eyes to discern the binocular disparity map corresponding to the 3D scene. This mechanism, termed stereopsis, along with a number of additional important monocular 3D cues and other sensory information, helps create a 3D spatial reconstruction of the visual world. In addition, the human vestibular sensory system provides information on our movement, spatial orientation and balance, which are crucial to our abilities of real-time positional tracking and navigation in the 3D world. The sensors of the vestibular system include a set of three semi-circular canals in each of the inner ears, which indicate rotational movement information detected by the motion of the fluidic material inside the canals caused by our movements in the 3D space. In addition, a set of otoliths, the three oval-shaped structures in each of the inner ears, measure the linear accelerations resulting from our movements. The visual and vestibular systems work in synchronicity, providing consistent spatial and motion cues, thereby enabling us to create and update the 3D map of the environment around us in real-time Citation[2].

In this article, we review the recent advances in real-time 3D sensing technologies, and applications in intelligent and interactive devices and systems. As an example of a commercially deployed platform, we present the 3D sensing and processing hardware and intelligent software capabilities of Intel® RealSense technology, and highlight its recent deployment in a number of applications. Examples include interactive computing systems such as laptops, all-in-one computers, mobile devices, autonomous robots, unmanned aerial vehicles, smart mirrors, as well as a new class of immersive virtual and augmented reality devices.

2. 3D-sensing technologies

Cameras are now ubiquitous part of many devices and systems, as numerous applications based on capturing and consuming visual images have become part of our daily lives. However, the traditional cameras that are embedded in typical electronic devices are designed to capture 2D images of the 3D scene projected onto a single image sensor by the optics of the imaging system. This process can be represented by a matrix formalism as shown in the following equation, where the 3D points in the world are mapped onto a corresponding array of 2D points on the imaging device with a combination of necessary transformation matrices consisting of rotation and translation of coordinate systems as well as a perspective projection matrix.

As a result of this projection, the original 3D information cannot be generally recovered from the 2D images that are captured, since the captured 2D images preserve only a partial information about the original 3D space. Reconstruction of 3D surfaces from single intensity images is a widely researched subject, and continues to make significant progress Citation[3]. However, implementation of real-time 3D interaction applications based on the single 2D image sensing devices remains limited in scope and computationally intensive.

In contrast, 3D imaging techniques, typically consisting of acquisition of a pair of color and depth images corresponding to the 3D scene, are designed to capture the 3D visual information. There have been significant progress in the areas of 3D visual sensing technologies in the recent years, resulting in small form-factor imaging systems that are able to capture both color and 3D spatial information in real-time with low power consumption. While there are many ways for real-time 3D visual sensing, the prevalent methods are stereo-3D imaging, structured- or coded-light projection systems, and time-of-flight range imaging techniques. We describe the overviews of these techniques below. Chapters 5 – 7 in Reference Citation[1] provide in-depth details of the working principles of various 3D sensing technologies.

Stereo-imaging based 3D computer vision techniques attempt to mimic the human-visual system, in which two calibrated imaging devices, laterally displaced from each other, capture synchronized images of the scene. The depth for the points in the 3D space mapped to the corresponding image pixels is extracted from the binocular disparities. The basic principles behind this technique are illustrated in , where C1 and C2 are the two camera centers with focal length f, forming images of a point in the 3D world, P, at positions A and B in their respective image planes. In this simple case, where the cameras are parallel and calibrated, it can be shown that the distance of the object, perpendicular to the baseline connecting the two camera centers, is inversely proportional to the binocular disparity: depth = f × L / Δ. Algorithms for determining binocular disparity and depth information from stereo images have been widely researched and further advances continue to be made.

Figure 1. Basics of stereo-3D imaging method, illustrated with the simple case of parallel and calibrated camera pair with optical centers at C1 and C2, respectively, separated by the baseline distance of L. The point, P, in the 3D world is imaged at points A and B on the left and the right cameras, respectively. A′ on the right image plane corresponds to the point A on the left image plane. The distance between B and A′ on the epipolar line is called the binocular disparity, Δ, which can be shown to be inversely proportional to the distance of the point P from the baseline Citation[1].

Figure 1. Basics of stereo-3D imaging method, illustrated with the simple case of parallel and calibrated camera pair with optical centers at C1 and C2, respectively, separated by the baseline distance of L. The point, P, in the 3D world is imaged at points A and B on the left and the right cameras, respectively. A′ on the right image plane corresponds to the point A on the left image plane. The distance between B and A′ on the epipolar line is called the binocular disparity, Δ, which can be shown to be inversely proportional to the distance of the point P from the baseline Citation[1].

In the case of structured-light based 3D sensing methods, a patterned or “structured” beam of light, typically infrared, is projected onto the object or scene of interest. The image of the light pattern deformed due to the shape of the object or scene is then captured using an image sensor. Finally, the depth map and 3D geometric shape of the object or scene are determined using this distortion of the projected optical pattern. This is conceptually illustrated in Citation[4]. Further advances have been made on these techniques, such as using time-multiplexed binary code patterns to assign a unique digital code to each point indicative of its location in the 3D space, as described in Chapter 5 of Reference Citation[1].

Figure 2. Principles of a projected structured-light 3D image capture method Citation[4]. (a) An illumination pattern is projected onto the scene and the reflected image is captured by a camera. The depth of a point is determined from the relative displacement of it in the pattern and the image. (b) An illustrative example of a projected stripe pattern. In practical applications, typically infrared light is used with more complex patterns. (c) The captured image of the stripe pattern reflected from the 3D object.

Figure 2. Principles of a projected structured-light 3D image capture method Citation[4]. (a) An illumination pattern is projected onto the scene and the reflected image is captured by a camera. The depth of a point is determined from the relative displacement of it in the pattern and the image. (b) An illustrative example of a projected stripe pattern. In practical applications, typically infrared light is used with more complex patterns. (c) The captured image of the stripe pattern reflected from the 3D object.

The time-of-flight 3D imaging method measures the distance of the object points, hence the depth map, by illuminating an object or scene with a beam of modulated infrared light and determining the time it takes for the light to travel back to an imaging device after being reflected from the object or scene, typically using a phase-shift measurement technique Citation[1]. The system typically comprises a full-field range imaging capability, including amplitude-modulated illumination source and an image sensor array. illustrates the method for converting the phase shifts of the reflected optical signal to the distance of the point. The reflected signal, shown in the dashed curve, is phase-shifted by φ relative to the original emitted signal. It's also attenuated in strength and the detector picks up some background signal as well, which is assumed to be constant. With this configuration, it can be shown that the distance of the object that reflected the signal, d = (λm/2)x(φ/2π), where λm is the modulation wavelength of the optical signal.

Figure 3. Principles of 3D imaging using the time-of-flight based range measurement technique Citation[1]. The solid sinusoidal curve is the amplitude-modulated infrared light that is emitted onto the scene by a source, and the dashed curve is the reflected signal that is detected by an imaging device. Note that the reflected signal is attenuated and phase-shifted by an angle φ relative to the emitted signal, and includes a background signal that is assumed to be constant. The distance or the depth map is determined using the phase shift and the modulation wavelength.

Figure 3. Principles of 3D imaging using the time-of-flight based range measurement technique Citation[1]. The solid sinusoidal curve is the amplitude-modulated infrared light that is emitted onto the scene by a source, and the dashed curve is the reflected signal that is detected by an imaging device. Note that the reflected signal is attenuated and phase-shifted by an angle φ relative to the emitted signal, and includes a background signal that is assumed to be constant. The distance or the depth map is determined using the phase shift and the modulation wavelength.

Besides the advances in 3D visual sensing technologies as narrated above, developments in inertial measurement units (IMU) are allowing incorporation of motion sensing capabilities into devices and systems. With accelerometers, gyroscopes, and often magnetometers integrated into small form-factor devices, systems are able to use positional tracking data from the IMUs for navigation purposes. A new and exciting method for real-time 3D tracking involves visual-inertial odometry, which combines the features and motions measured by the visual sensors as well as the IMUs for accurate localization, tracking and mapping Citation[5].

3. Intel® RealSense technology

In this section, we describe the RealSense technologies and the series of products based on these technologies that have been introduced to the market, incorporating real-time 3D-sensing and interaction capabilities into various classes of devices and systems. The RealSense technology includes hardware sensor modules for capturing the 3D environment via real-time color and depth (RGB-D) imaging and 3D motion sensing, and a set of middleware libraries included in software developments kits to enable interactive applications and usages that utilize the 3D information Citation[6].

As examples, shows two of the RealSense modules that have been integrated in various interactive and autonomous devices and systems. The module shown in the top figure is the RealSense F200 device, which is based on the coded-light 3D-sensing technology. As illustrated in the figure, this module consists of an infrared (IR) laser and a microelectromechanical systems (MEMS) projector to illuminate the environment in front of it with specific binary IR patterns. An IR image sensor on the module rapidly captures the images of these patterns that are reflected from the 3D scene. At the same time, a color camera which is also part of the module captures high-resolution RGB images.A custom-built special-purpose processor on the module runs algorithms designed to compute the depth maps in real-time from the captured binary codes, which are synchronized and calibrated with the corresponding color images. The pairs of color and depth images are made available to the middleware layers and applications running on the computing systems via a single USB 3.0 interface, which also provides the power to the module.

Figure 4. Intel® RealSense camera modules. The top figure shows the F200 version based on coded-light 3D-imaging technique, whereas the bottom figure shows the R200 version based on stereo-3D imaging technique. The imaging processors consist of power-efficient hardware for 3D computation and processing.

Figure 4. Intel® RealSense camera modules. The top figure shows the F200 version based on coded-light 3D-imaging technique, whereas the bottom figure shows the R200 version based on stereo-3D imaging technique. The imaging processors consist of power-efficient hardware for 3D computation and processing.

The bottom figure shows the RealSense R200 module, which is based on IR-assisted and hardware-accelerated stereo-3D imaging technology. The illumination subsystem on the module projects a texture of IR light onto the 3D objects and scene in front of the camera system. The two IR image sensors separated by a baseline distance capture real-time images of the environment. An onboard imaging processor hardcoded with power-efficient algorithms runs rectification and stereo-correlation algorithms to compute binocular disparities and the corresponding depth images in real-time. At the same time, the color image sensor captures high-definition RGB images, which are synchronized with the corresponding depth maps. Similar to the F200 module, a single USB 3.0 interface provides power to the sensor module as well as transmits the data.

Both devices are less than 4mm in thickness, enabling integration into a wide array of computing devices and systems. As narrated above, the sensor modules capture pairs of color and depth images, also referred to as RGB-D images, in real time. shows a pair of RGB-D images of a scene, recorded with a RealSense camera. Every pixel on the image shown in the left indicates the color value associated with the corresponding point in the 3D space, and that on the image on the right indicates the depth of the point from the sensor module. With these pair of images, the 3D world is captured by the RealSense devices in real-time, complete with both color and 3D coordinate information for the points, enabling development of interactive and intelligent systems. An additional module that has been recently introduced but not shown in is RealSense ZR300, which incorporates a wide field-of-view fish-eye camera and an inertial measurement unit (IMU) in addition to an R200 RGB-D sensor, all integrated onto a common stiffener, synced and calibrated for accurate visual-inertial 3D-tracking applications as will be described later.

Figure 5. A pair of RGB-D images captured with Intel® RealSense camera. The left figure shows a color image, while the right figure shows the corresponding pseudo-colored depth image where the nearer points are shown in bluer colors and farther points are shown in redder colors. The background objects that are further away from the range of the depth sensor are shown in dark blue.

Figure 5. A pair of RGB-D images captured with Intel® RealSense camera. The left figure shows a color image, while the right figure shows the corresponding pseudo-colored depth image where the nearer points are shown in bluer colors and farther points are shown in redder colors. The background objects that are further away from the range of the depth sensor are shown in dark blue.

Besides the 3D-sensing hardware modules, the RealSense software development kit includes a number of middleware libraries and application programming interfaces to enable the development of a new class of interactive applications. illustrates a few of the 3D computer vision middleware technologies, including a hand skeleton tracking library, a face detection and tracking library, a background segmentation library, and a 3D reconstruction library, among many other capabilities.

Figure 6. Examples of 3D computer vision middleware libraries included in the RealSense software development kit. Top left: 3D hand skeleton tracking; top right: face detection and tracking; bottom left: 3D background segmentation; bottom right: 3D scanning and reconstruction.

Figure 6. Examples of 3D computer vision middleware libraries included in the RealSense software development kit. Top left: 3D hand skeleton tracking; top right: face detection and tracking; bottom left: 3D background segmentation; bottom right: 3D scanning and reconstruction.

4. Systems and applications

RealSense devices have been integrated in a wide array of computing devices available from a number of system makers, including interactive computers, mobile devices, autonomous robots, unmanned aerial vehicles, virtual dressing mirrors, augmented reality helmets, among many other emerging applications. In this section, we highlight a number of different types of devices and systems that are already available in the market and demonstrate a wide range of applications that are enabled by real-time 3D-sensing technologies.

4.1. Interactive computing devices

A number of computing devices have incorporated RealSense cameras and are commercially available. shows a couple of examples, including a state-of-the art all-in-one desktop computer with a curved display and front-facing RealSense F200 camera, and a 2-in-1 tablet with rear-facing RealSense R00 camera. Numerous applications have been developed based on the real-time RGB-D imaging and 3D computer vision middleware libraries, including 3D interactive games, login applications using facial recognition, video conferencing using virtual green-screen effects, and 3D scanning of humans and objects, to highlight just a few Citation[7]. We have also demonstrated and reported on a smartphone device with integrated 3D-sensing technology based on RealSense Citation[5]. shows an example of dense 3D reconstruction of large spaces with this device.

Figure 7. Examples of commercially available computers with embedded RealSense technologies. Left: an interactive all-in-one desktop computer, right: a 2-in-1 laptop/tablet device.

Figure 7. Examples of commercially available computers with embedded RealSense technologies. Left: an interactive all-in-one desktop computer, right: a 2-in-1 laptop/tablet device.

Figure 8. Dense 3D reconstruction of an office environment captured with a mobile device incorporating RealSense technology. The real-time depth imaging with high-density point cloud allows rapid reconstruction of 3D spaces, objects, and humans.

Figure 8. Dense 3D reconstruction of an office environment captured with a mobile device incorporating RealSense technology. The real-time depth imaging with high-density point cloud allows rapid reconstruction of 3D spaces, objects, and humans.

4.2. Autonomous robots and unmanned aerial vehicles

Among the most exciting areas of applications for real-time 3D-sensing and spatial tracking technologies are robots and drones that can sense and understand the environment around them, navigate autonomously, and interact naturally with humans. Robots have already had a major positive impact to the world economy by automating the industrial manufacturing and assembly lines. This has significantly increased the production throughput across numerous sectors, spanning semiconductor chips to consumer electronics devices to automotive assembly processes to food production and processing, to name just a few among numerous examples. In this section, we highlight the capabilities of the family of RealSense technologies that are enabling a new generation of autonomous and intelligent machines.

As described in the introduction section, we use our visual-vestibular spatial sensing and 3D-tracking capabilities for mapping and navigating in the 3D world. The RealSense technology attempts to endow devices and machines with sensing and perception capabilities inspired by such biological systems. For example, the RealSense ZR300 module comprises a depth-sensing module, a fish-eye camera, an inertial motion capture device, and a time-stamping circuitry to synchronize the inputs from all the sensors for multi-sensory tracking and navigation applications. We have implemented a real-time visual-inertial odometry solution based on this platform, which is shown in . The image in the middle shows the view from the fish-eye camera, while the figure on the left shows the 2D map view as the device navigates through a large 3D space. This technology allows devices and systems to map a 3D environment, localize within the map, and autonomously navigate in the 3D space. As shown in the right image of , such implementations can work well in a relatively large space, which makes it suitable for autonomous robotic navigation applications.

Figure 9. Real-time 3D spatial tracking with six degrees-of-freedom using visual-inertial odometry. The image in the middle shows the view from the fish-eye camera, the image on the left shows the 2D view map traced while navigating within a 3D space. The figure on the right shows large-scale 3D mapping and navigation spanning an entire floor of a building.

Figure 9. Real-time 3D spatial tracking with six degrees-of-freedom using visual-inertial odometry. The image in the middle shows the view from the fish-eye camera, the image on the left shows the 2D view map traced while navigating within a 3D space. The figure on the right shows large-scale 3D mapping and navigation spanning an entire floor of a building.

A number of autonomous robots with various application areas have been introduced which incorporate RealSense devices. As shown in , some of the examples include a hotel butler robot from Savioke that autonomously navigates within a hotel to deliver items to the guests Citation[8], a multipurpose Segway personal transportation robot unveiled by Ninebot in the 2016 Consumer Electronics Show which can interact with the users and the environment Citation[9], and an intelligent home assistant robot demonstrated by Asus in the 2016 Computex show Citation[10].

Figure 10. Examples of autonomous robots equipped with RealSense technology. Left: a hotel butler robot from Savioke; middle: Segway personal transporter robot from Ninebot; right: a personal assistant home robot from Asus.

Figure 10. Examples of autonomous robots equipped with RealSense technology. Left: a hotel butler robot from Savioke; middle: Segway personal transporter robot from Ninebot; right: a personal assistant home robot from Asus.

Unmanned Aerial Vehicles (UAVs), also referred to as drones, are a fast-growing market with applications ranging from consumer and professional photo and videography to commercial inspections to automated deliveries in the future. We have added real-time 3D-sensing capability to drones by integrating RealSense modules. With real-time depth-imaging technology provided by RealSense, we have implemented a collision avoidance solution on the drone, enabling it to safely and automatically fly around trees and other objects without hitting into them. shows a drone from Yuneec with a RealSense device onboard Citation[11], and the image from a demonstration of automatic collision avoidance function while the drone follows a biker on a trail.

Figure 11. The left image shows the Yuneec Typhoon H drone with integrated RealSense device as demonstrated in CES 2016. The right image shows a demonstration of real-time automatic collision avoidance as the drone follows a person biking through trees.

Figure 11. The left image shows the Yuneec Typhoon H drone with integrated RealSense device as demonstrated in CES 2016. The right image shows a demonstration of real-time automatic collision avoidance as the drone follows a person biking through trees.

4.3. Virtual and augmented reality devices

Now, we will focus on the application of real-time 3D-sensing and visual-inertial tracking technologies in a new class of virtual and augmented reality devices that enable immersive and interactive mixed reality usages. The development of virtual and augmented reality devices and applications has picked up significant pace around the world in the recent years. Let's first look at the key definitions of the current systems based on their usages and applications. A virtual-reality device places the user in a virtual environment, generating sensory stimuli (visual, vestibular, auditory, haptic, etc.) that provide sensation of presence and immersion. On the other hand, an augmented-reality device places virtual objects in the real-world while providing sensory cues to the user that are consistent between the physical and augmented elements. While a virtual-reality device immerses the user within a simulated environment, it also removes the user from the surrounding real world. In contrast, a mixed-reality device can blend real-world elements within the virtual environment.

We have demonstrated interactive mixed-reality applications based on embedded RealSense and visual-inertial spatial and motion tracking algorithms Citation[12]. As an example, the picture of a prototype head-mounted device is shown in the left image of . The visual-inertial tracking and real-time 3D-sensing capabilities allow the device to map the 3D environment around it, localize and track the positional information with six degrees of freedom. This enables immersive navigation in the virtual space without requiring external tracking systems. As shown in the right image in , the RealSense 3D-imaging technology also enables integrating the user's real hands into the simulated environment for direct interactions and manipulations of the virtual objects in the 3D space. also shows further mixed reality capabilities, as a person standing in front of the user is brought into the virtual world as viewed with the virtual reality headset. Besides enabling natural and immersive interactions, this technology also allows the user to avoid colliding into objects in the physical world while moving about in the virtual world. Finally, demonstrates an example of augmenting the real-world with virtually created objects with correct physical interactions, such as collisions, occlusions, and shadows, etc., where a virtual car races on a real kitchen table.

Figure 12. Left image shows an interactive mixed-reality device incorporating RealSense and visual-inertial spatial motion tracking technology. The image on the right shows an example of mixed reality capability of the device, where the 3D images of the user's hands as well as a person standing in front of the user are brought into the virtual world. This capability is also used to allow the user to avoid colliding into physical objects.

Figure 12. Left image shows an interactive mixed-reality device incorporating RealSense and visual-inertial spatial motion tracking technology. The image on the right shows an example of mixed reality capability of the device, where the 3D images of the user's hands as well as a person standing in front of the user are brought into the virtual world. This capability is also used to allow the user to avoid colliding into physical objects.

Figure 13. Augmentation of the real physical world with virtually rendered 3D objects using a device with embedded RealSense module. Here, a digitally rendered car is shown racing on a real kitchen table and colliding into a physical bowl, with realistic physical effects such as collision with real objects, correct occlusion and shadows, etc.

Figure 13. Augmentation of the real physical world with virtually rendered 3D objects using a device with embedded RealSense module. Here, a digitally rendered car is shown racing on a real kitchen table and colliding into a physical bowl, with realistic physical effects such as collision with real objects, correct occlusion and shadows, etc.

4.4. Emerging applications

Besides the areas discussed above, there are numerous other interactive and intelligent systems and applications that are being enabled by the real-time 3D-sensing and RGB-D imaging technologies incorporated in RealSense devices. Examples include 3D body scanning for fitness tracking, virtual clothing appliances, gaming peripherals, sporting and entertainment applications.

5. Summary

We have presented the recent developments in the field of 3D-sensing technologies, systems, and applications. As an example of a commercially deployed platform, we have described the 3D-sensing technologies with real-time RGB-D imaging and 3D spatial tracking capabilities incorporated in the Intel® RealSense cameras, and the associated 3D computer vision and spatial-understanding algorithms and middleware libraries. We have reviewed a number of applications of these technologies in intelligent systems, including a new class of interactive computing devices, autonomous machines such as robots and unmanned aerial vehicles, and immersive mixed-reality devices that blend real world objects into the virtual world and enable natural interactions.

Acknowledgement

The author gratefully acknowledges the contributions of the members of the Perceptual Computing Group at Intel Corporation, as well as collaborations with partners in the computing ecosystem as exemplified in the article.

References