Video face tracking: a work in progress

Despite improvements in performance, as yet no face-tracking algorithm is capable of simultaneously tracking and identifying a relatively large number of faces in high-resolution video frames in real time.
08 September 2014
Tong Zhang and Herman Martins Gomes

Face tracking, usually in conjunction with other methods, is one way of tackling problems such as indexing and summarizing videos, and identifying individuals from large unconstrained video data such as television programs and surveillance video. It has applications in a wide range of domains, including security, health care, retail, home automation, transportation, entertainment, safety, and education. Face-tracking algorithms can be implemented as software on general-purpose machines, or as hardware in specialized video-processing units.

Purchase SPIE Field Guide to IR Systems, Detectors and FPAsIn video face tracking, either in a recorded or live video, one or more faces are initially located in an early frame, either manually or automatically, by applying a face detection engine.1 The faces are sought in successive frames with a tracking algorithm. During tracking, face detection may be called upon every so often to find new faces and to update the locations of tracked faces.

Face-tracking algorithms can be categorized in a variety of ways. For example, they can be grouped by how they track a face (the entire face as a single entity or as individual facial features), or by whether the tracking is in 2D ‘image space’ or 3D ‘pose space.’2 If instead we group them by the underlying method, it becomes apparent that most of the best-known and commonly used face-tracking methods and their variations belong to one of several basic families, which include the KLT feature tracker, Mean Shift/CamShift methods, and Kalman filter/particle filter methods.

The first family is based on the classic Kanade-Lucas-Tomasi (KLT) approach for object tracking, originally proposed in 1981.3 It is built upon three basic assumptions of the optical flow, i.e., brightness constancy (the brightness of a pixel does not change from frame to frame), temporal persistence (the object does not move much from frame to frame), and spatial coherence (neighboring points have similar motion). The basic idea of KLT is local search using gradients weighted by an approximation of the second derivative of the image. As it is straightforward to implement and relatively fast, the KLT algorithm has been used as the core tracking method in many existing face-tracking approaches.

For example, Ngo and coworkers used the KLT tracker to track points of interest throughout video frames.5Each face track was then formed by the faces detected in different frames that shared a large enough number of tracked points. However, as the KLT algorithm relies only on local information that is derived from some small window surrounding each of the points of interest, large motions can move points outside of the local window and thus become impossible for the algorithm to find. The pyramidal KLT algorithm addresses this problem by starting the tracking at the highest level of an image pyramid (lowest detail) and working down to lower levels (finer detail). Tracking over image pyramids allows large motions to be caught by local windows. A pyramid Lucas-Kanade code is available in the OpenCV library (an open source C++ code library for computer vision techniques6) for coarse-to-fine optical flow estimation.

A second family of algorithms is based on the ‘Mean Shift’ analysis technique, which is a non-parametric (i.e., with no assumptions about the probability distributions of the variables being assessed) iterative method for estimating density gradients. It rapidly discovers the maxima of a density function given discrete data sampled from that function. Comaniciu and co-authors used this technique to perform object tracking in feature space (i.e., space of feature vectors) where Mean Shift iterations are conducted to find the most probable target position in the current frame.7It provides a practical, fast, and efficient solution for tracking objects.

Comaniciu and Ramesh applied Mean Shift optimization to face detection and tracking in 2000.8They computed the Bhattacharyya coefficient (a measure of the divergence of one distribution from another) as the similarity measure between a face model and candidate patches, and used the spatial gradient of this measure to guide a fast search for the best candidate. The optimization achieves convergence in a few iterations, enabling real-time face tracking on a standard PC.

The CamShift (continuously adaptive Mean Shift) algorithm extends Mean Shift with an adaptive region-sizing step, which is more appropriate for tracking objects whose size and angle may change during a video sequence (1998).9The Mean Shift and CamShift methods have been employed in a large body of face-tracking work due to their efficiency and robustness. However, algorithms in this family may fall into local maxima and are unable to deal with multimodal probability density functions (PDFs).

The third family is made up of Bayes filters for solving state estimation problems. Among these, the Kalman filter is widely known, but it is limited to linear models with additive Gaussian noises. For nonlinear, non-Gaussian situations, the particle filter is often used. This sequential Monte Carlo method is based on point mass (‘particle’) representations of probability densities. That is, the density is approximated directly as a finite number of samples. Given sufficient particles and observations, the particle filter method has the advantage of recovering, to a certain extent, from a loss of track where there is occlusion (overlap) and clutter.10Compared with Mean Shift, the particle filter is more complex, but it handles occlusion better and can deal with multimodal PDFs. It has been applied extensively in face-tracking approaches, and a large number of solutions have been proposed to solve certain issues or problems such as sampling strategy, the ‘drift’ effect, and multi-target tracking.11–13

More recent tracking approaches employ an adaptive tracking-by-detection strategy, in which a detector predicts the position of an object and adapts its parameters to the current object's appearance. However, this strategy may fail when there is occlusion. The tracking-learning-detection (TLD) family of methods overcomes this problem by training and updating a detector on the fly with samples of the tracked objects. In the TLD framework proposed in 2010 by Kalal and coworkers, the long-term tracking task (that is, where the process should run indefinitely long) is decomposed into three sub-tasks: tracking, learning, and detection, where the KLT tracker is employed in the tracking part.15In particular, tracking and detection are independent processes that exchange information using learning. Kalal and coworkers invented the P-N learning paradigm, which estimates detection errors and uses these errors to bootstrap the classifier. When they applied the TLD method to face tracking, they obtained two notable experimental results. One is on a 23min-long sitcom where 54% recall and 75% precision are achieved for tracking one character. The other is on a surveillance video recorded at a shop with a frame rate of 1 frame per second (fps), where they reported 35% recall and 79% precision for tracking one subject.16

The probability hypothesis density (PHD) filter, which belongs to the particle filter family, is a multiple-target filter for recursively estimating the number and state of a set of targets given a set of observations. As it propagates only the first-order moment (using a single mean position for each target) instead of the full multi-target posterior probability functions (where the function for each target involves all targets), the PHD filter is a computationally cheaper alternative to optimal multi-target filtering (that is, it reduces the growth in complexity with the number of targets from exponential to linear). Maggio and coworkers applied it to multiple-face tracking in 2007, showing that some face detection errors can be removed, and faces recovered, after a total occlusion.17 The reported average frame rate is 6fps on a Pentium IV 3GHz processor.

Object trackers of all the above families use a variety of features, and selecting the right features plays a critical role in this context. The uniqueness of a feature is key for easily distinguishing objects in the feature space. For face tracking, local descriptors, such as gradients and histograms of gradients, are present in much of the relevant work in the area. Color is also an important feature, since the tone of skin, hair and other facial attributes are, up to a certain degree, distinctive from the tones of other regions normally found in a scene. Besides these basic features, other face-tracking schemes may consider appearance-based models or hybrid feature sets.

For video indexing and search tasks, face tracking is often used together with clustering or other non-supervised techniques. For example, in 2013 Zhang and coworkers presented a system to extract temporal face sequences from videos and group them into clusters, with each cluster containing video clips of a same person.18Their system employs face detection (to locate an initial occurrence of a face) and bi-directional (i.e., forward and backward) face tracking. The face regions found in these two ways are combined into a temporal face sequence, from which representative faces are selected based on face image qualities. (A face sequence may contain too many face variations for clustering). Next, the system extracts appearance and temporal features from the representative faces and performs a similarity analysis. Finally, face sequences belonging to the same person are grouped by a semi-supervised agglomerative clustering, taking as input a similarity matrix resulting from the previous step.

Many commercial and non-commercial face-tracking software packages are available for download or trial on the Internet. Figures 1 and 2 show two examples. Non-commercial software usually implements or demonstrates a method or principle not yet found in a commercial product but used to develop novel applications. An early example in this category is the Carnegie Mellon Advanced Multimedia Processing Lab face-tracking library (2000), which provides an algorithm that uses color matching combined with deformable templates.19 Alongside this line, for general object tracking (including faces), OpenCV implements several methods such as Lucas-Kanade, Mean Shift, CamShift, and Kalman filtering. OpenCV is free for both academic and commercial uses. FaceTrackNoIR is a free (for non-commercial use only) game interface that integrates existing head and face trackers with game control protocols and tracking stabilization filters.20 Commercial software, on the other hand, consists of libraries and systems specifically designed as part of a business portfolio. We should point out that most of the examples in this category do not present an explicit description of underlying technologies, most likely due to intellectual property protection issues, although some operation capabilities (for example, tolerance to face pose variations) are provided. A number of publicly available databases, including simulated sequences in laboratories and real-world sequences, are available to evaluate the performance of face-tracking algorithms (see Table 1).


Figure 1. Face-tracking demo using the Visage/SDK FaceTrack package, a real-time and configurable face-tracking engine provided by Visage Technologies.4

Figure 2. Demo of the multiplatform SHORE Engine developed by the Fraunhofer Institute, which enables the detection and tracking of objects and faces, as well as the analysis of faces.14
Table 1. Databases for face tracking.
Simulated sequences in laboratories
Honda/UCSD video database21
Stan Birchfield dataset22
Single23 and multiple17 face datasets
Chokepoint dataset24
ICT-3DHP dataset25
Real-world sequences
YouTube Celebrities Face Tracking and Recognition Dataset11
Episodes from TV series26
Hannah movie dataset27

While plenty of research has been conducted on object tracking, including face tracking, in the last decade or so, and many promising approaches have been proposed, as yet there is no solution capable of tracking a relatively large number of faces simultaneously in high-resolution video frames, and identifying the faces, in real time. A number of relevant research components need further investigation, including the design of efficient video-processing architecture with a parallel computing pipeline, fast and accurate tracking of multiple faces at the same time, and identification of individuals in face sequences taking advantage of multiple views of the face.

With the pervasiveness of monitoring cameras installed in public areas, schools, hospitals, workplaces, and homes, video analytics technologies for interpreting these video contents are becoming increasingly relevant to people's lives. Remarkable progress has been made in the last two decades on robust and efficient tracking of human faces. However, there is still a significant gap between current available technologies and the industrial needs of a scalable and fast-responding system for tracking and identifying a large number of human beings.


Tong Zhang
Hewlett-Packard Labs
Palo Alto, CA

Tong Zhang is a principal scientist and project lead in Hewlett-Packard Labs.

Herman Martins Gomes
Federal University of Campina Grande (UFCG)
Campina Grande, Brazil

Herman Martins Gomes is a professor at UFCG.


References:
1. P. Viola, M. Jones, Robust real-time face detection, Int'l J. Comp. Vis. 57(2), p. 137-154, 2004.
2. A. Roy-Chowdhury, Y. Xu, Face tracking, Encyclopedia of Biometrics, Springer, 2009.
3. B. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, Int'l Joint Conf. Artific. Intell., p. 674-679, 1981.
4. http://www.visagetechnologies.com/products/visagesdk/facetrack/ Information on FaceTrack on the Visage Technologies website. Accessed 26 August 2014.
5. T. D. Ngo, D.-D. Le, S. Satoh, D. A. Duong, Robust face track finding in video using tracked points, 2008.
6. G. Bradski, A. Kaehler, Learning OpenCV, O'Reilly, 2008.
7. D. Comaniciu, V. Ramesh, P. Meer, Real-time tracking of non-rigid objects using mean shift, IEEE Conf. Comp. Vis. Pattern Recog. (CVPR), p. 142-149, 2000.
8. D. Comaniciu, V. Ramesh, Robust detection and tracking of human faces with an active camera, Third IEEE Int'l Worksh. Vis. Surveill., 2000.
9. G. R. Bradski, Computer vision face tracking for use in a perceptual user interface, Intel Technol. J., 2nd Quarter, 1998.
10. M. Arulampalam, A. Maskell, N. Gordon, T. Clapp, A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking, IEEE Trans. Signal Process. 50(2), 2002.
11. M. Kim, S. Kumar, V. Pavlovic, H. Rowley, Face tracking and recognition with visual constraints in real-world videos, IEEE Conf. Comp. Vis. Pattern Recog. (CVPR), 2008.
12. V. Belagiannis, F. Schubert, N. Navab, S. Ilic, Segmentation based particle filtering for real-time 2d object tracking, 12th Eur. Conf. Comp. Vis. (ECCV) IV, p. 842-855, 2012.
13. M. Du, L. Guan, Monocular human motion tracking with the DE-MC particle filter, IEEE Int'l Conf. Acoust. Speech Sig. Process. (ICASSP) 2, 2006.
14. http://www.iis.fraunhofer.de/en/ff/bsy/tech/bildanalyse/shore-gesichtsdetektion.html Information on SHORE on the Fraunhofer Institute website. Accessed 26 August 2014.
15. Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Trans. Pattern Anal. Mach. Intell. 6(1), 2010.
16. Z. Kalal, K. Mikolajczyk, J. Matas, Face-TLD: tracking-learning-detection applied to faces, IEEE Int'l Conf. Image Process. (ICIP), 2010.
17. E. Maggio, E. Piccardo, C. Regazzoni, A. Cavallaro, Particle PHD filter for multi-target visual tracking, IEEE Int'l Conf. Acoust. Speech Sign. Process. (ICASSP), p. 15-20, 2007.
18. T. Zhang, D. Wen, X. Ding, Person-based video summarization and retrieval by tracking and clustering temporal face sequences, Proc. SPIE 8664, p. 86640O, 2013. doi:10.1117/12.2009127
19. F. J. Huang, T. Chen, Tracking of multiple faces for human-computer interfaces and virtual environments, IEEE Int'l Conf. Multimed. Expo (ICME), 2000.
20. http://facetracknoir.sourceforge.net/home/default.htm The FaceTrackNoIR website. Accessed 29 April 2014.
21. K. C. Lee, J. Ho, M. H. Yang, D. Kriegman, Visual tracking and recognition using probabilistic appearance manifolds, J. Comp. Vis. Image Understanding 99(3), p. 303-331, 2005.
22. S. Birchfield, Elliptical head tracking using intensity gradients and color histograms, Proc. IEEE Conf. Comp. Vis. Pattern Recog. (CVPR), p. 232-237, 1998.
23. E. Maggio, A. Cavallaro, Hybrid particle filter and mean shift tracker with adaptive transition model, IEEE Int'l Conf. Acoust. Speech Sign. Process. (ICASSP), p. 221-224, 2005.
24. Y. Wong, S. Chen, et al., Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition, Comp. Vis. Pattern Recog. Worksh., p. 74-81, 2011.
25. T. Baltrusaitis, P. Robinson, et al., 3D constrained local model for rigid and non-rigid facial tracking, IEEE Conf. Comp. Vis. Pattern Recog. (CVPR), p. 2610-2617, 2012.
26. M. Bäuml, M. Tapaswi, R. Stiefelhagen, Semi-supervised learning with constraints for person identification in multimedia data, IEEE Conf. Comp. Vis. Pattern Recog. (CVPR), p. 3602-3609, 2013.
27. A. Ozerov, J.-R. Vigouroux, L. Chevallier, P. Pérez, On evaluating face tracks in movies, IEEE Int'l Conf. Image Process. (ICIP), 2013.
PREMIUM CONTENT
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research