News Menu

3D human gesture recognition using integral imaging

Synthetic aperture integral imaging recognizes human actions better than monocular acquisition technology.

28 April 2015

Pedro Latorre-Carmona, Filiberto Pla, Eva Salvador-Balaguer and Bahram Javidi

Optical image sensing and visualization technologies in 3D have been researched extensively in fields as diverse as TV broadcasting, entertainment, medical sciences, and robotics.^1–4 One promising technology, integral imaging, is an autostereoscopic 3D imaging method that offers a passive and relatively inexpensive way to capture 3D information and visualize it optically or computationally.^5–7Integral imaging belongs to the broader class of multi-view imaging techniques that allow depth analysis from three points of view: stereo, time-of-flight, and structured-light strategies.⁸ Integral imaging has also been used for classification tasks.⁹ However, we are the first to apply it to action recognition.^{10, 11}

Integral imaging provides the 3D profile and range of the objects in a scene using an array of high-resolution imaging sensors or in a synthetic aperture mode (see Figure 1). When a single sensor captures multiple 2D images, it is possible to obtain larger field-of-view (FOV) 2D images. In the synthetic aperture integral imaging mode that we used, a series of sensors are distributed in a grid, or a single sensor is moved to the positions in the grid. The horizontal or vertical distance between two of these positions is called the pitch (p). The 3D image reconstruction can be achieved by computationally simulating the optical back-projection of the elemental images. In Figure 1, c_x and c_v are the horizontal and vertical sizes of the sensor and f its focal length. We used a computer-synthesized, virtual pinhole array to inversely map the elemental images into the object space (see Figure 1). Superimposing the properly shifted elemental images created the 3D reconstructed images.

Figure 1. Synthetic aperture integral imaging acquisition and computational reconstruction method. Pitch (p) is the distance between the sensor centers (left), c_x and c_v represent the horizontal and vertical sizes of the sensor, and f is the sensor focal length (right).

Our methodology is based on acquiring 3D videos of hand gestures using an integral imaging system formed by an array of 3×3 cameras. We analyzed the potential of gesture recognition using 3D integral imaging and compared the performance to 2D single-camera videos. We processed sectional reconstructed representations of the objects in the scene using gesture recognition strategies. We believe the experiments provide evidence of the feasibility of gesture recognition with integral imaging.

Our setup included a 3×3 array of Stingray F080B/C cameras. Using an IEEE 1394 high-speed communication serial bus, we captured nine synchronized videos at 15fps and a resolution of 1024×768 pixels. We subsequently rectified the acquired videos.¹² We then acquired two performances of three different gestures from 10 people. We captured the three gestures, which were made by extending the right arm: left, deny, and opening and closing the hand (see Figure 2).^10,11 Each camera lens was focused at a plane about 2m away. The depth of field allowed for all objects and people from 0.5 to 3.5m away to be in focus. The 10 people were about 2.5m in front of the camera array. We acquired their gestures in a laboratory with no other movements. We recorded 60 videos corresponding to the three actions the 10 people performed twice. The 3D volume for its first frame was reconstructed to infer the distance at which the hand was in focus. This distance was then used to reconstruct the volume for the remaining frames. We made the reconstruction from 1 to 3.5m in 10mm steps. Figure 2 shows the reconstructed plane where the action is in focus for the three different gestures we considered.

Figure 2. Three hand gestures used in our research. Left: Left gesture. Middle: Deny gesture. Right: Open/close gesture.

The method for characterizing and recognizing gestures can be summarized as follows:¹³ generating and characterizing spatiotemporal interest points (STIPs) for each video (see Figure 3);^10,11 quantizing the resulting descriptors in a number of visual words (also called the codebook); creating a bag-of-words (BoW) representation for each video using its STIPs and the resulting codebook; and classifying unseen videos from their BoW representations.

Figure 3. Examples of the spatiotemporal interest points (STIPs) detected (yellow circles) at three different frames in the ‘open’ gesture video. The different circle sizes represent each STIP's detection scale.

We generated STIPs and, from them, histograms of oriented gradients (HOG) and of optic flow (HOF). We quantized these histograms into visual words through k-means clustering. We represented each video by creating a histogram of codewords.¹⁴To estimate the gesture recognition performance, we followed a ‘leave-one-subject-out’ protocol.^{15, 16} We chose support vector machines as the pattern recognition method for classifying the gestures.¹⁷Our results showed¹⁰ that integral imaging outperformed acquisition with the central camera of the 3×3 array when comparing the best descriptor in each case (HOF for monocular and HOF+HOG for integral imaging).

In summary, 3D information shows potential in improving the accuracy of human gesture recognition. Integral imaging allows us to reconstruct a 3D scene for only the planes where the gesture preferentially appears. This opens the door to the application of recognition strategies that were not previously possible and eventually to substantially increased recognition capability. Our next step will be to computationally parallelize the entire process so that it can be applied nearly in real time and to use other gesture recognition descriptors that exploit the focusing capabilities of integral imaging.

Pedro Latorre-Carmona, Filiberto Pla, and Eva Salvador-Balaguer

Institute of New Imaging Technologies
Universitat Jaume I
Castell´on de la Plana, Spain

Pedro Latorre-Carmona is a postdoctoral researcher whose interests are 3D integral imaging, pattern recognition, photon-starved visualization, and multispectral image processing.

Bahram Javidi

Department of Electrical and Computer Engineering
University of Connecticut

Storrs, CT

Bahram Javidi received his BS from George Washington University and MS and PhD from the Pennsylvania State University, all in electrical engineering. He is the Board of Trustees Distinguished Professor at the University of Connecticut. He has more than 900 publications, including over 400 peer-reviewed journal articles and over 440 conference proceedings, among them some 120 plenary addresses, keynote addresses, and invited conference papers.

References:

1. X. Xiao, B. Javidi, M. Martinez-Corral, A. Stern, Advances in 3D integral imaging: sensing, display, and applications, Appl. Opt. 52(4), p. 546-560, 2013.

2. M. Cho, M. Daneshpanah, I. Moon, B. Javidi, 3D optical sensing and visualization using integral imaging, Proc. IEEE 99(4), p. 556-575, 2011.

3. R. Martinez-Cuenca, G. Saavedra, M. Martinez-Corral, B. Javidi, Progress in 3D multiperspective display by integral imaging, Proc. IEEE 97(6), p. 1067-1077, 2009.

4. J.-Y. Son, W.-H. Son, S.-K. Kim, K.-H. Lee, B. Javidi, 3D imaging for creating real-world-like environments, Proc. IEEE 101(1), p. 190-205, 2013.

5. J. Arai, F. Okano, M. Kawakita, M. Okui, Y. Haino, M. Yohimura, M. Furuya, M. Sato, Integral 3D television using a 33-megapixel imaging system, J. Display Technol. 6(10), p. 422-430, 2010.

6. H. H. Tran, H. Suenaga, K. Kuwana, K. Masamune, T. Dohi, S. Nakajima, H. Liao, Augmented reality system for oral surgery using 3D stereoscopic visualization, Lect. Notes Comput. Sci. 6891, p. 81-88, 2011.

7. H. Liao, T. Inomata, I. Sakuma, T. Dohi, 3D augmented reality for MRI-guided surgery using integral videography autostereoscopic image overlay, IEEE Trans. Biomed. Eng. 57(6), p. 1476-1486, 2010.

8. L. Chen, H. Wei, J. Ferryman, A survey of human motion analysis using depth imagery, Pattern Recognit. Lett. 34, p. 1995-2006, 2013.

9. C. M. Do, R. Martínez-Cuenca, B. Javidi, 3D object-distortion-tolerant recognition for integral imaging using independent component analysis, J. Opt. Soc. Am. A 26, p. 245-251, 2009.

10. V. Javier Traver, P. Latorre-Carmona, E. Salvador-Balaguer, F. Pla, B. Javidi, Human gesture recognition using 3D integral imaging, J. Opt. Soc. Am. A 31, p. 2312-2320, 2014.

11. P. Latorre Carmona, E. Salvador-Balaguer, F. Pla, B. Javidi, Integral imaging acquisition and processing for human gesture recognition, Proc. SPIE 9495, p.94950K, 2015. doi:10.1117/12.2179748

12. Z. Zhang, A flexible new technique for camera calibration, IEEE Trans. Pattern Anal. Mach. Intell. 22, p. 1330-1334, 2000.

13. H. Wang, M. M. Ullah, A. Kläser, I. Laptev, C. Schmid, Evaluation of local spatio-temporal features for action recognition, Br. Mach. Vis. Conf. , 2009.

14. I. Laptev, On space-time interest points, Int'l J. Comput. Vis. 64, p. 107-123, 2005.

15. K. Schindler, L. J. V. Gool, Action snippets: how many frames does human action recognition require?, IEEE Conf. Comput. Vis. Pattern Recognit. CVPR, 2008.

16. Z. Lin, Z. Jiang, L. S. Davis, Recognizing human actions by learning and matching shape-motion prototype trees, IEEE Trans. Pattern Anal. Mach. Intell. 34, p. 533-547, 2012.

17. N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, 2000.