News Menu

Bio-inspired algorithm for online visual tracking

A novel smooth pursuit tracking method can be used to accurately track small targets in aerial videos.

06 October 2016

Mohammed Yousefhussien, N. Andrew Browning and Christopher Kanan

Tracking of targets within aerial footage is becoming increasingly important, for example, as the applications of unmanned aerial vehicles (UAVs) continue to expand (UAVs are now being used in film production, mining, news media, and agriculture). Moreover, the world's security agencies are gathering enormous amounts of UAV video data, from which they are looking for events of interest, e.g., suspicious vehicles. When such vehicles of interest are found in the data, the users would often like to use a UAV to follow the vehicle over time. Online visual tracking is therefore required for these endeavors, and the videos of the events can then be studied by a human analyst. Tracking ground vehicles in UAV video streams is especially difficult, however, because there is usually only a relatively small number of pixels on the target (compared with other tracking problems), and because the targets can change drastically in appearance (due to changes in lighting conditions, UAV altitude, and perspective).

In many tracking algorithms, the problem is defined as a binary classification task, i.e., where the object of interest and the rest of the video frame's content need to be differentiated. These algorithms involve training classifiers that are built on top of visual features (which encode only the target's appearance). Other information, however, can also be used to improve tracking. In addition, ‘smooth pursuit’ is a type of eye movement used by primates when they are tracking small moving objects. With this continuous eye movement, the object's movement is counteracted by keeping the object continually in the center of the field of view. Although saccadic eye movements (i.e., rapid movements of the eye between fixation points) can be elicited in a wide variety of situations, smooth pursuit is only possible when an object is in motion.¹

Motivated by the neural circuits that underlie smooth pursuit, we have thus created the smooth pursuit tracking (SPT) algorithm² for tracking problems in aerial video data. In this method, we combine the object appearance with motion and predicted location information to improve tracking. Although primates using smooth pursuit are limited to the tracking of one object at a time, with our SPT algorithm we can easily track multiple objects simultaneously (with little computational overhead).

In our SPT algorithm, we first generate a top-down appearance saliency map by using an online brain-inspired object recognition algorithm (known as a gnostic field).^{3, 4} A gnostic field consists of competing gnostic sets, where each set has a population of template-matching units for a particular class. The input we use for the gnostic field is the output of the convolutional feature maps produced by a convolutional neural network, which has been pre-trained on a massive collection of natural images.⁵ Our resulting appearance saliency map has high values in regions that correspond to areas of the image that resemble the target. We also create a motion saliency map by performing background subtraction. To do this, we use an average model of a subset of the previous frames, after aligning them to the current frame. In addition, for our location saliency map, we use a Kalman filter to predict the location of the target in the next frame. Finally, we create the smooth pursuit map by multiplicatively combining the appearance, motion, and location maps (see Figure 1).

Figure 1. An overview of the smooth pursuit tracking (SPT) algorithm. The input to the algorithm is the output of convolutional feature maps produced by a pre-trained convolutional neural network. Appearance, motion, and location saliency maps are produced and are then multiplicatively combined to create the final smooth pursuit map, in which the targets are identified.

One of the main strengths of our method is the ability to handle long-term occlusions. In primates, smooth pursuit only works when the object being tracked is both moving and visible. Other mechanisms are therefore required to recover a track that has been lost due to occlusion. In this situation, primates thus use saccades to recover the object's location. We simulated this type of scenario by running the selective search algorithm to generate bounding box hypotheses when occlusions were detected. For each candidate box the SPT algorithm measures the confidence that the box contains the target, and the box with the highest confidence is selected. The number of box hypotheses and the search radius will change according to the length of the occlusion. This means that SPT can be used to search for boxes farther away as the length of the occlusion increases.

We have also evaluated the ability of SPT to track vehicles in aerial footage from the Video Verification of Identity (VIVID) data set. In particular, we compared the SPT performance to that of seven recently developed trackers, i.e., the Color Visual Tracking,⁶ Adaptive Structural Local Sparse Appearance,⁷ L1 (referring to the L1 norm),⁸ Multiple Instance Learning,⁹ Kernalized Correlation Filters,¹⁰ Online AdaBoost,¹¹ and Structure Preserving Object Tracking¹² algorithms. We used standard metrics, such as precision plots, success plots, and center location error (see Table 1) to make these comparisons. Our results indicate that the SPT algorithm can be used to successfully track vehicles for significantly longer periods than the other state-of-the-art algorithms.

Table 1.Average center location error for pixels in our SPT algorithm—in single- and multi-target experiments—compared with seven state-of-the-art tracking methods (lower errors indicate better tracking). CVT: Color Visual Tracking. ASLA: Adaptive Structural Local Sparse Appearance. MIL: Multiple Instance Learning. KCF: Kernalized Correlation Filters. OAB: Online AdaBoost. SPOT: Structure Preserving Object Tracking. L1 refers to the L1 norm in linear algebra.

	SPT²	CVT⁶	ASLA⁷	L1⁸	MIL⁹	KCF¹⁰	OAB¹¹	SPOT¹²
Single-target	34.57	170.37	206.93	242.81	153.38	173.72	151.93	–
Multi-target	13.44	316.31	273.55	316.08	235.39	300.27	209.06	180.19

In summary, we have introduced a bio-inspired online tracking algorithm that we developed specifically for tracking small targets in aerial imagery. We find that our smooth pursuit tracking method outperforms a number of state-of-the-art algorithms, by a large margin. We are currently investigating how well our SPT algorithm works with larger objects and non-vehicle targets. In our future work we also plan to study how we can use modular neural networks to train SPT from end to end.

The research reported in this article was supported in part by the US Naval Air Systems Command, under contract N68335-14-C-033. The content is solely the responsibility of the authors and does not necessarily represent the official views of the sponsoring agencies.

Mohammed Yousefhussien, Christopher Kanan

Chester F. Carlson Center for Imaging Science
Rochester Institute of Technology

Rochester, NY

Mohammed Yousefhussien is a Fulbright scholar, working toward a PhD in imaging science. His research interests are centered around computer vision applications using machine learning algorithms.

Christopher Kanan is an assistant professor. He conducts basic and applied research in computer vision and machine learning. He received a PhD in computer science from the University of California, San Diego.

N. Andrew Browning

Scientific Systems Company, Inc.

Woburn, MA

Andrew Browning is a principal research scientist. He received a PhD in cognitive and neural systems from Boston University.

References:

1. C. Rashbass, The relationship between saccadic and smooth tracking eye movements, J. Physiol. 159, p. 326-338, 1961.

2. M. A. Yousefhussien, N. A. Browning, C. Kanan, Online tracking using saliency, Proc. IEEE Winter Conf. Appl. Comp. Vision, 2016. doi:10.1109/WACV.2016.7477569

3. C. Kanan, Recognizing sights, smells, and sounds with gnostic fields, PLOS ONE e54088, 2013. doi:10.1371/journal.pone.0054088

4. C. Kanan, Fine-grained object recognition with gnostic fields, Proc. IEEE Winter Conf. Appl. Comp. Vision, p. 23-30, 2014. doi:10.1109/WACV.2014.6836122

5. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, Proc. Int'l Conf. Learning Represent. , 2015.

6. M. Danelljan, F. S. Khan, M. Felsberg, J. van de Weijer, Adaptive color attributes for real-time visual tracking, Proc. IEEE Conf. Comp. Vision Pattern Recognit., p. 1090-1097, 2014. doi:10.1109/CVPR.2014.143

7. X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, Proc. IEEE Conf. Comp. Vision Pattern Recognit., p. 1822-1829, 2012. doi:10.1109/CVPR.2012.6247880

8. X. Mei, H. Ling, Robust visual tracking and vehicle classification via sparse representation, IEEE Trans. Pattern Anal. Machine Intell. 33, p. 2259-2272, 2011.

9. B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online Multiple Instance Learning, Proc. IEEE Conf. Comp. Vision Pattern Recognit., p. 983-990, 2009. doi:10.1109/CVPR.2009.5206737

10. J. F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant structure of tracking-by-detection with kernels, Proc. Eur. Conf. Comp. Vision, p. 702-715, 2012.

11. H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, Proc. Brit. Machine Vision Conf. 1, p. 47-56, 2006.

12. L. Zhang, L. van der Maaten, Structure preserving object tracking, Proc. IEEE Conf. Comp. Vision Pattern Recognit., p. 1838-1845, 2013. doi:10.1109/CVPR.2013.240