Video face tracking: a work in progress
Face tracking, usually in conjunction with other methods, is one way of tackling problems such as indexing and summarizing videos, and identifying individuals from large unconstrained video data such as television programs and surveillance video. It has applications in a wide range of domains, including security, health care, retail, home automation, transportation, entertainment, safety, and education. Face-tracking algorithms can be implemented as software on general-purpose machines, or as hardware in specialized video-processing units.
In video face tracking, either in a recorded or live video, one or more faces are initially located in an early frame, either manually or automatically, by applying a face detection engine.1 The faces are sought in successive frames with a tracking algorithm. During tracking, face detection may be called upon every so often to find new faces and to update the locations of tracked faces.
Face-tracking algorithms can be categorized in a variety of ways. For example, they can be grouped by how they track a face (the entire face as a single entity or as individual facial features), or by whether the tracking is in 2D ‘image space’ or 3D ‘pose space.’2 If instead we group them by the underlying method, it becomes apparent that most of the best-known and commonly used face-tracking methods and their variations belong to one of several basic families, which include the KLT feature tracker, Mean Shift/CamShift methods, and Kalman filter/particle filter methods.
The first family is based on the classic Kanade-Lucas-Tomasi (KLT) approach for object tracking, originally proposed in 1981.3 It is built upon three basic assumptions of the optical flow, i.e., brightness constancy (the brightness of a pixel does not change from frame to frame), temporal persistence (the object does not move much from frame to frame), and spatial coherence (neighboring points have similar motion). The basic idea of KLT is local search using gradients weighted by an approximation of the second derivative of the image. As it is straightforward to implement and relatively fast, the KLT algorithm has been used as the core tracking method in many existing face-tracking approaches.
For example, Ngo and coworkers used the KLT tracker to track points of interest throughout video frames.5Each face track was then formed by the faces detected in different frames that shared a large enough number of tracked points. However, as the KLT algorithm relies only on local information that is derived from some small window surrounding each of the points of interest, large motions can move points outside of the local window and thus become impossible for the algorithm to find. The pyramidal KLT algorithm addresses this problem by starting the tracking at the highest level of an image pyramid (lowest detail) and working down to lower levels (finer detail). Tracking over image pyramids allows large motions to be caught by local windows. A pyramid Lucas-Kanade code is available in the OpenCV library (an open source C++ code library for computer vision techniques6) for coarse-to-fine optical flow estimation.
A second family of algorithms is based on the ‘Mean Shift’ analysis technique, which is a non-parametric (i.e., with no assumptions about the probability distributions of the variables being assessed) iterative method for estimating density gradients. It rapidly discovers the maxima of a density function given discrete data sampled from that function. Comaniciu and co-authors used this technique to perform object tracking in feature space (i.e., space of feature vectors) where Mean Shift iterations are conducted to find the most probable target position in the current frame.7It provides a practical, fast, and efficient solution for tracking objects.
Comaniciu and Ramesh applied Mean Shift optimization to face detection and tracking in 2000.8They computed the Bhattacharyya coefficient (a measure of the divergence of one distribution from another) as the similarity measure between a face model and candidate patches, and used the spatial gradient of this measure to guide a fast search for the best candidate. The optimization achieves convergence in a few iterations, enabling real-time face tracking on a standard PC.
The CamShift (continuously adaptive Mean Shift) algorithm extends Mean Shift with an adaptive region-sizing step, which is more appropriate for tracking objects whose size and angle may change during a video sequence (1998).9The Mean Shift and CamShift methods have been employed in a large body of face-tracking work due to their efficiency and robustness. However, algorithms in this family may fall into local maxima and are unable to deal with multimodal probability density functions (PDFs).
The third family is made up of Bayes filters for solving state estimation problems. Among these, the Kalman filter is widely known, but it is limited to linear models with additive Gaussian noises. For nonlinear, non-Gaussian situations, the particle filter is often used. This sequential Monte Carlo method is based on point mass (‘particle’) representations of probability densities. That is, the density is approximated directly as a finite number of samples. Given sufficient particles and observations, the particle filter method has the advantage of recovering, to a certain extent, from a loss of track where there is occlusion (overlap) and clutter.10Compared with Mean Shift, the particle filter is more complex, but it handles occlusion better and can deal with multimodal PDFs. It has been applied extensively in face-tracking approaches, and a large number of solutions have been proposed to solve certain issues or problems such as sampling strategy, the ‘drift’ effect, and multi-target tracking.11–13
More recent tracking approaches employ an adaptive tracking-by-detection strategy, in which a detector predicts the position of an object and adapts its parameters to the current object's appearance. However, this strategy may fail when there is occlusion. The tracking-learning-detection (TLD) family of methods overcomes this problem by training and updating a detector on the fly with samples of the tracked objects. In the TLD framework proposed in 2010 by Kalal and coworkers, the long-term tracking task (that is, where the process should run indefinitely long) is decomposed into three sub-tasks: tracking, learning, and detection, where the KLT tracker is employed in the tracking part.15In particular, tracking and detection are independent processes that exchange information using learning. Kalal and coworkers invented the P-N learning paradigm, which estimates detection errors and uses these errors to bootstrap the classifier. When they applied the TLD method to face tracking, they obtained two notable experimental results. One is on a 23min-long sitcom where 54% recall and 75% precision are achieved for tracking one character. The other is on a surveillance video recorded at a shop with a frame rate of 1 frame per second (fps), where they reported 35% recall and 79% precision for tracking one subject.16
The probability hypothesis density (PHD) filter, which belongs to the particle filter family, is a multiple-target filter for recursively estimating the number and state of a set of targets given a set of observations. As it propagates only the first-order moment (using a single mean position for each target) instead of the full multi-target posterior probability functions (where the function for each target involves all targets), the PHD filter is a computationally cheaper alternative to optimal multi-target filtering (that is, it reduces the growth in complexity with the number of targets from exponential to linear). Maggio and coworkers applied it to multiple-face tracking in 2007, showing that some face detection errors can be removed, and faces recovered, after a total occlusion.17 The reported average frame rate is 6fps on a Pentium IV 3GHz processor.
Object trackers of all the above families use a variety of features, and selecting the right features plays a critical role in this context. The uniqueness of a feature is key for easily distinguishing objects in the feature space. For face tracking, local descriptors, such as gradients and histograms of gradients, are present in much of the relevant work in the area. Color is also an important feature, since the tone of skin, hair and other facial attributes are, up to a certain degree, distinctive from the tones of other regions normally found in a scene. Besides these basic features, other face-tracking schemes may consider appearance-based models or hybrid feature sets.
For video indexing and search tasks, face tracking is often used together with clustering or other non-supervised techniques. For example, in 2013 Zhang and coworkers presented a system to extract temporal face sequences from videos and group them into clusters, with each cluster containing video clips of a same person.18Their system employs face detection (to locate an initial occurrence of a face) and bi-directional (i.e., forward and backward) face tracking. The face regions found in these two ways are combined into a temporal face sequence, from which representative faces are selected based on face image qualities. (A face sequence may contain too many face variations for clustering). Next, the system extracts appearance and temporal features from the representative faces and performs a similarity analysis. Finally, face sequences belonging to the same person are grouped by a semi-supervised agglomerative clustering, taking as input a similarity matrix resulting from the previous step.
Many commercial and non-commercial face-tracking software packages are available for download or trial on the Internet. Figures 1 and 2 show two examples. Non-commercial software usually implements or demonstrates a method or principle not yet found in a commercial product but used to develop novel applications. An early example in this category is the Carnegie Mellon Advanced Multimedia Processing Lab face-tracking library (2000), which provides an algorithm that uses color matching combined with deformable templates.19 Alongside this line, for general object tracking (including faces), OpenCV implements several methods such as Lucas-Kanade, Mean Shift, CamShift, and Kalman filtering. OpenCV is free for both academic and commercial uses. FaceTrackNoIR is a free (for non-commercial use only) game interface that integrates existing head and face trackers with game control protocols and tracking stabilization filters.20 Commercial software, on the other hand, consists of libraries and systems specifically designed as part of a business portfolio. We should point out that most of the examples in this category do not present an explicit description of underlying technologies, most likely due to intellectual property protection issues, although some operation capabilities (for example, tolerance to face pose variations) are provided. A number of publicly available databases, including simulated sequences in laboratories and real-world sequences, are available to evaluate the performance of face-tracking algorithms (see Table 1).
|Simulated sequences in laboratories|
|Honda/UCSD video database21|
|Stan Birchfield dataset22|
|Single23 and multiple17 face datasets|
|YouTube Celebrities Face Tracking and Recognition Dataset11|
|Episodes from TV series26|
|Hannah movie dataset27|
While plenty of research has been conducted on object tracking, including face tracking, in the last decade or so, and many promising approaches have been proposed, as yet there is no solution capable of tracking a relatively large number of faces simultaneously in high-resolution video frames, and identifying the faces, in real time. A number of relevant research components need further investigation, including the design of efficient video-processing architecture with a parallel computing pipeline, fast and accurate tracking of multiple faces at the same time, and identification of individuals in face sequences taking advantage of multiple views of the face.
With the pervasiveness of monitoring cameras installed in public areas, schools, hospitals, workplaces, and homes, video analytics technologies for interpreting these video contents are becoming increasingly relevant to people's lives. Remarkable progress has been made in the last two decades on robust and efficient tracking of human faces. However, there is still a significant gap between current available technologies and the industrial needs of a scalable and fast-responding system for tracking and identifying a large number of human beings.
Tong Zhang is a principal scientist and project lead in Hewlett-Packard Labs.
Herman Martins Gomes is a professor at UFCG.