Proceedings Volume 11187

Optoelectronic Imaging and Multimedia Technology VI

cover
Proceedings Volume 11187

Optoelectronic Imaging and Multimedia Technology VI

Purchase the printed version of this volume at proceedings.com or access the digital version at SPIE Digital Library.

Volume Details

Date Published: 26 December 2019
Contents: 10 Sessions, 54 Papers, 0 Presentations
Conference: SPIE/COS Photonics Asia 2019
Volume Number: 11187

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 11187
  • Depth and Light Field
  • Computer Vision
  • Computational Optics
  • Computational Acquisition and Analysis I
  • Computational Acquisition and Analysis II
  • Computer Vision I
  • Computer Vision II
  • Image Processing
  • Poster Session
Front Matter: Volume 11187
icon_mobile_dropdown
Front Matter: Volume 11187
This PDF file contains the front matter associated with SPIE Proceedings Volume 11187, including the Title Page, Copyright Information, Table of Contents, Author and Conference Committee lists.
Depth and Light Field
icon_mobile_dropdown
High-resolution and real-time spectral-depth imaging with a compact system
We develop a compact imaging system to enable simultaneous acquisition of the spectral and depth information in real time. Our system consists of a spectral camera with low spatial resolution and an RGB camera with high spatial resolution, which captures two measurements from two different views of the same scene at the same time. Relying on an elaborate computational reconstruction algorithm with deep learning, our system can eventually obtain a spectral cube with a spatial resolution of 1920×1080 and a total of 16 spectral bands in the visible light section, as well as the corresponding depth map with the same spatial resolution. Evaluations on both benchmark datasets and real-world scenes show that our reconstruction results are accurate and reliable. To the best of our knowledge, this is the first attempt to capture 5D information (3D space + 1D spectrum + 1D time) with a miniaturized apparatus and without active illumination.
Realizing high angular resolution multi-view and light-field displays with multi-projection technique
In this paper, a novel method with high-speed angular steering device is presented to realize high angular resolution for the multi-projection based multi-view and light field 3D displays. The state-of-art multi-projection based multi-view and light field 3D displays always employ projector array with multi-layer structure to achieve high angular resolution, but the projector’s physical size still limits the angular resolution of 3D displays. To enhance the angular resolution of multiview and light field displays, a special designed angular steering optical device is integrated into the screen part of the display system, which is capable to deflect rays in a limited angular range to increase the density of projectors equivalently. A high-speed angular steering device is developed to deflect rays in a small angle based on the birefrigent effect of liquid crystal (LC), which mainly includes cascade structure of the pi cell LC and the birefringent LC microprism array. Furthermore, the steering angle control of this device can be well synchronized with the frame rate of projector array. With time-multiplexing of the steering angle, high angular resolution of multi-view and light field 3D displays with single-layer projector array are implemented. In our experiments, the developed angular steering device with steering angles of 0° and ±1° was successfully integrated into twelve-projector based multi-view and light field display systems to enhance the angular resolution respectively. The experimental results show that the angular resolution of the two different types of 3D displays is significantly enhanced to provide better visual experience for users.
Monocular depth estimation based on unsupervised learning
Due to the low cost and easy deployment, the depth estimation of monocular cameras has always attracted attention of researchers. As good performance based on deep learning technology in depth estimation, more and more training models has emerged for depth estimation. Most existing works have required very promising results that belongs to supervised learning methods, but corresponding ground truth depth data for training is inevitable that makes training complicated. To overcome this limitation, an unsupervised learning framework is used for monocular depth estimation from videos, which contains depth map and pose network. In this paper, better results can be achieved by optimizing training models and improving training loss. Besides, training and evaluation data is based on standard dataset KITTI (Karlsruhe Institute of Technology and Toyota Institute of Technology). In the end, the results are shown through comparing with different training models used in this paper.
A learning-based method using epipolar geometry for light field depth estimation
A novel method is proposed in this paper for light field depth estimation by using a convolutional neural network. Many approaches have been proposed to make light field depth estimation, while most of them have a contradiction between accuracy and runtime. In order to solve this problem, we proposed a method which can get more accurate light field depth estimation results with faster speed. First, the light field data is augmented by proposed method considering the light field geometry. Because of the large amount of the light field data, the number of images needs to be reduced appropriately to improve the operation speed, while maintaining the confidence of the estimation. Next, light field images are inputted into our network after data augmentation. The features of the images are extracted during the process, which could be used to calculate the disparity value. Finally, our network can generate an accurate depth map from the input light field image after training. Using this accurate depth map, the 3D structure in real world could be accurately reconstructed. Our method is verified by the HCI 4D Light Field Benchmark and real-world light field images captured with a Lytro light field camera.
Light field SLAM based on ray-space projection model
Yaning Li, Qi Zhang, Xue Wang, et al.
Pose estimation is the key step of simultaneous localization and mapping (SLAM). The relationship between the rays captured by multiple light field cameras can provide more constraints for pose estimation. In this paper, we propose a novel light field SLAM (LF-SLAM) based on ray-space projection model, including visual odometry, optimization, loop closing and mapping. Unlike traditional SLAM, which estimates pose based on point-point correspondence, we firstly utilize ray-space features to initialize camera motion based on light field fundamental matrix. In addition, a ray-ray cost function is presented to optimize camera pose and 3D points. Finally, we exhibit the motion map and 3D reconstruction results from a moving light field camera. Experimental results have verified the effectiveness and robustness of the proposed method.
Computer Vision
icon_mobile_dropdown
Dynamic-stride-net: deep convolutional neural network with dynamic stride
Zerui Yang, Yuhui Xu, Wenrui Dai, et al.
It is crucial to reduce the cost of deep convolutional neural networks while preserving their accuracy. Existing methods adaptively prune DNNs in a layer-wise or channel-wise manner based on the input image. In this paper, we develop a novel dynamic network, namely Dynamic-Stride-Net, to improve residual network with layer-wise adaptive strides in the convolution operations. Dynamic-Stride-Net leverages a gating network to adaptively select the strides of convolutional blocks based on the outputs of the previous layer. To optimize the selection of strides, the gating network is trained by reinforcement learning. The floating point operations per second (FLOPS) is significantly reduced by adapting the strides to convolutional layers without loss of accuracy. Dynamic-Stride-Net reduces the computational cost by 35%-50% with equivalent accuracy of the original model on CIFAR-10 and CIFAR-100 datasets. It outperforms the state-of-the-art dynamic networks and static compression methods.
No-reference image quality assessment based on an objective quality database and deep neural networks
Image quality assessment (IQA) has always been an active research topic since the birth of the digital image. Actually, the arrival of deep learning has made IQA more promising. However, most state-of-the-art no-reference (NR) IQA methods require regression training on distorted images or extracted features with subjective image scores, which makes them suffer from insufficient reference image content and training samples with subjectively scoring due to timeconsuming and laborious subjective testing. Furthermore, most convolutional neural networks (CNN)-based methods generally transform original images into patches to accommodate fixed-size input of CNN, which often alter the image's data and introduce noise into the neural network. This paper aims to solve the above problems by adopting new strategies and proposes a novel NRIQA method based on deep CNN. Specifically, first, we obtain image data with diverse image content, multiple image sizes, and reasonable distortion by crawling, filtrating, and degrading numerous publicly licensed high-quality images from the Internet. Then, we score all the images using an excellent full-reference (FR) IQA algorithm, thereby artificially construct a large objective IQA database. Next, we design a deep CNN, which can accept input images of original sizes from our database instead of patches, then we train the model with the FRIQA index as training objective thus propose the opinionunaware( OU) NRIQA method. Finally, the experiment results show that our method achieves excellent performance, which outperforms state-of-the-art OU-NRIQA models and is comparable to most of the traditional opinion-aware NRIQA methods, even some FRIQA methods on standard subjective IQA databases.
Video quality assessment based on LOG filtering of videos and spatiotemporal slice images
Center-surrounded receptive fields, which can be well simulated by the Laplacian of Gaussian (LOG) filter, have been found in the cells of the retina and lateral geniculate nucleus (LGN). With center-surrounded receptive fields, the human visual system (HVS) can reduce the visual redundancy by extracting the edges and contours of objects. Furthermore, current researches on image quality assessment (IQA) have shown that human's perception of image quality can be estimated by the correlation degree between the extracted perceptual-aware features of the reference and test images. Thus, this paper assesses the quality of a video by measuring the similarity of perceptual-aware features from LOG filtering between the test video and reference video.

Considering the spatial and temporal channel of the human visual system both include the second derivative of Gaussian function, we first construct a three-dimensional LOG (3D LOG) filter to simulate human visual filter and to extract the perceptual-aware features for the design of VQA algorithms. Moreover, since the correlation measuring based on 2D LOG filtering of video spatiotemporal slice (STS) images can capture the distortion of spatiotemporal motion structure accurately and effectively, then we apply the 2D LOG filtering to video STS images and using maximum pooling for distortion of vertical and horizontal STS images to improve prediction accuracy.

The performance of proposed algorithms is validated on the LIVE VQA database. The Spearman’s rank correlation coefficients of the proposed algorithms are all above 0.82, which shows that our methods are better than that of most mainstream VQA methods.
No-reference video quality assessment based on spatiotemporal slice images and deep convolutional neural networks
Most learning-based no-reference (NR) video quality assessment (VQA) needs to be trained with a lot of subjective quality scores. However, it is currently difficult to obtain a large volume of subjective scores for videos. Inspired by the success of full-reference VQA methods based on the spatiotemporal slice (STS) images in the extraction of perceptual features and evaluation of video quality, this paper adopts multi-directional video STS images, which are images composed of multi-directional sections of video data, to deal with the lacking of subjective quality scores. By sampling the STS images of video into image patches and adding noise to the quality labels of patches, a successful NR VQA model based on multi-directional STS images and neural network training is proposed. Specifically, first, we select the subjective database that currently contains the largest number of real distortion videos as the test set. Second, we perform multi-directional STS extraction on the videos and sample the local patches from the multi -directional STS to augment the training sample set. Besides, we add some noise to the quality label of the local patches. Third, a reasonable deep neural network is constructed and trained to obtain a local quality prediction model for each patch in the STS image, and then the quality of an entire video is obtained by averaging the model prediction results of multi -directional STS images. Finally, the experiment results indicate that the proposed method tackles the insufficiency of training samples in small subjective VQA dataset and obtains a high correlation with the subjective evaluation.
An efficient stereo matching based on superpixel segmentation
Haichao Li, Ke Han
The traditional semi-global matching methods provide a good trade-off between accuracy and complexity compared with the local matching methods and global matching methods, however, they still need to traverse the full disparity search range to find the best matching point. Therefore, it still needs high computational cost especially for stereo images with large disparity search range. We proposes an efficient semi-global matching method that disparity search range is reduced based on 3D plane fitting. Firstly, the simple linear iterative clustering (SLIC) algorithm is adopted to segment the stereo images. Secondly, the dense SIFT keypoints are extracted and matched from the left and right images. Thirdly, similar adjacent superpixels are merged based on the gray mean and variance, and for each merged region, 3-D plane is fitted based on matched keypoints. Finally, the pixel-wise disparity search range is limited into several pixels for more-global matching method which can reduce the computational complexity and obtain an accurate disparity map. Experimental results demonstrate that the computational speed of the new semi-global matching method is several times faster than that of the original method, as well as offering a more accurate disparity map.
Computational Optics
icon_mobile_dropdown
Optical coding of SPAD array and its application in compressive depth and transient imaging
Qilin Sun, Xiong Dun, Yifan Peng, et al.
Time-of- ight depth imaging and transient imaging are two imaging modalities that have recently received a lot of interest. Despite much research, existing hardware systems are limited either in terms of temporal resolution or are prohibitively expensive. Arrays of Single Photon Avalanche Diodes (SPADs) are promising candidates to fill this gap by providing higher temporal resolution at an affordable cost. Unfortunately, state-of-the-art SPAD arrays are only available in relatively small resolutions and low fill-factor. Furthermore, the low fill-factor issue leads to more ill-posed problems when seeking to realize the super-resolution imaging with SPAD array. In this work, we target on hand-crafting the optical structure of SPAD array to enable the super-resolution design of SPAD array. We particularly investigate the scenario of optical coding for SPAD array, including the improvement of fill-factor of SPAD array by assembling microstructures and the direct light modulation using a diffractive optical element. A part of the design work has been applied in our recent advance, where here we show several applications in depth and transient imaging.
Lensless wide-field imaging and depth sensing through a turbid scattering layer by round-trip field estimation
Wide-field depth-resolved imaging through scattering media has been a longstanding problem in recent years. In this paper, we proposed a reference-less compact imaging physical model, where the 3D light field data embedded in the volumetric speckle stack through a strong diffuser is explored and analyzed. By utilizing wave-optics and a coherent round-trip field estimation method, the scattering matrix of the diffuser is precisely calibrated as a priori knowledge. After then, the multi-slice targets are placed between the light source and the diffuser, and a set of defocused intensity pattern are recorded for recovering the scattered object field. The real object field is extracted from inverse diffracting of the field employing the conjugation of the calibrated scattering matrix. Wide-field imaging is verified experimentally by recording a resolution chart hidden behind a ground glass. The technique shows great potential in lens-less wave-front sensing and non-reference 3D imaging.
Single-pixel depth imaging
Huayi Wang, Liheng Bian, Jun Zhang
The conventional single-pixel imaging (SPI) is unable to directly obtain the target's depth information due to the lack of depth modulation and corresponding decoding. The existing SPI-based depth imaging systems utilize multiple single-pixel detectors to capture multi-angle images, or introduce depth modulation devices such as optical grating to achieve three-dimensional imaging. The methods require bulky systems and high computational complexity. In this paper, we present a novel and efficient three-dimensional SPI method that does not require any additional hardware compared to the conventional SPI system. Specifically, a multiplexing illumination strategy combining random and sinusoidal pattern is proposed, which is able to simultaneously encode the target's spatial and depth information into a measurement sequence captured by a single-pixel detector. To decode the three-dimensional information from one-dimensional measurements, we built and trained a deep convolutional neural network. The end-to-end framework largely accelerates reconstruction speed, reduces computational complexity and improves reconstruction precision. Both simulations and experiments validate the method's effectiveness and efficiency for depth imaging.
Computational Acquisition and Analysis I
icon_mobile_dropdown
Joint-designed achromatic diffractive optics for full-spectrum computational imaging
Diffractive optical elements (DOEs) are promising lens candidates in computational imaging because they can drastically reduce the size and weight of image systems. The inherent strong dispersion hinders the direct use of DOEs in full spectrum imaging, causing an unacceptable loss of color fidelity. State-of-the-art methods of designing diffractive achromats either rely on hand-crafted point spread functions (PSFs) as the intermediate metric, or frame a differential end-to-end design pipeline that interprets a 2D lens with limited pixels and only a few wavelengths.

In this work, we investigate the joint optimization of achromatic DOE and image processing using a full differentiable optimization model that maps the actual source image to the reconstructed one. This model includes wavelength-dependent propagation block, sensor sampling block, and imaging processing block. We jointly optimize the physical height of DOEs and the parameters of image processing block to minimize the errors over a hyperspectral image dataset. We simplify the rotational symmetric DOE to 1D profle to reduce the computational complexity of 2D propagation. The joint optimization is implemented using auto differentiation of Tensor ow to compute parameter gradients. Simulation results show that the proposed joint design outperforms conventional methods in preserving higher image fidelity.
Hole filling algorithm for image array of one-dimensional integrated imaging
Compared to traditional integrated imaging, the one-dimensional integrated imaging system abandons the stereoscopic effect in the vertical direction, leaving only the stereoscopic effect in the horizontal direction. While satisfying the stereoscopic perception, it greatly reduces the storage space of data and alleviates the large attenuation of resolution caused by integrated imaging. However, at present, there is no 3D film source suitable for a one-dimensional integrated imaging system. In order to address this problem, we propose an array image generation and padding algorithm based on a one-dimensional integrated imaging system. First, based on the theory of geometric optics and DIBR(Depth-image Based Rendering), we use depth maps to simulate viewpoint images at arbitrary locations. In the process, we classify the points mapped to the viewpoint image to make the generated viewpoint image more accurate. Secondly, when conducting hole filling, we first use the optical flow method to fill the large holes inside the image. We extract the edge of the hole, compare the depth values on both sides of the edge, and estimate the depth value of the hole by taking the optical flow value on the larger side, so as to calculate the mapping block of the hole on the reference frame. Finally, the other holes are filled using the Criminisi image restoration algorithm.
Computational Acquisition and Analysis II
icon_mobile_dropdown
High-SNR single-pixel phase imaging in the UV+VIS+NIR range
Meng Li, Liheng Bian, Jun Zhang
Phase imaging observes the phase of light interacted with the target. The conventional phase imaging methods such as interferometry employ two-dimensional sensors for image capture, resulting in limited spectrum range and low signal-to-noise ratio (SNR). Single-pixel imaging (SPI) provides an alternative solution for high-SNR acquisition of target information over a wide range of spectrum. However, the conventional SPI can only reconstruct light intensity without phase. Existing phase imaging methods using a single-pixel detector require phase modulation, leading to low light efficiency, slow modulation speed and poor noise robustness. In this paper, we propose a novel single-pixel phase imaging method without phase modulation. First, the binary intensity modulation is applied which provides simplified optical setup and high light efficiency. Second, inspired by the phase-retrieval theory, we derive a joint optimization algorithm to reconstruct both amplitude and phase information of the target, from the intensity measurements collected by a single-pixel detector. Both simulations and experiments demonstrate that the proposed method has high SNR, high frame rate, wide spectrum range (UV+VIS+NIR) and strong noise robustness. The method can be widely applied in optics, material and life science.
Underwater image color correction algorithm based on scattering statistical characteristics
Qichao Shi, Zongju Peng, Fen Chen, et al.
Different wavelengths of light have different attenuation during underwater transmission, and the attenuation of the same wavelength of light is also inconsistent in various waters. This loss of spectral information causes varying degrees of color distortion in underwater images. In this paper, a novel underwater color correction algorithm based on scattering statistics characteristics is proposed. The algorithm is based on a fact that scattering exists in both atmosphere and underwater. Firstly, we statistically analyze three kinds of images and summarize scattering characteristics. Secondly, a novel spectral compensation strategy is proposed to correct color distortion according to scattering characteristics. Finally, compared with the existing underwater image processing algorithms, there are better subjective and objective indicators. The algorithm is robust for a mass of real underwater images.
Effective 3D object reconstruction from densely sampled circular light fields
Zhengxi Song, Libing Yang, Qi Wu, et al.
Circular Light fields imaging is based on images taken on a regular circle with an equal space. Orientation information in epipolar plane images (EPIs) reveals strong depth clue for 3D reconstruction task. However, EPIs in Circular Light fields show a slightly distorted sinusoidal trajectory in 3D space. Rather than analyzing such spiral line on 2D image processing method, we present an algorithm based on 3D formula. By applying 3D Canny into densely sampled Circular Light fields, we can obtain a 3D point cloud in the image cube. Furthermore, we utilize structure tensor to analyze the disparity information in such 3D data. Finally, we build two Hough spaces to reconstruct depth information and obtain an accurate 3D object. Compared with state-of-the-art image-based 3D reconstruction methods, experiment results show our method can obtain improved reconstruction quality on synthetic data.
No-reference quality assessment for synthesized images based on local geometric distortions
Xiaoyan Ma, Fen Chen, Wenhui Zou, et al.
Depth-image-based-rendering (DIBR) techniques are significant for view synthesis. However, such technique may introduce challenging distortions. Unlike traditional uniform artifacts, the distortions of the synthesized could be local and non-uniform, thus are challenging for traditional image quality assessment metrics. To tackle this problem, aiming at the geometric distortions, a no reference quality assessment for DIBR-synthesized images is proposed in this paper. First, considering that the hue distribution of disoccluded regions is different from that of the natural image, the disoccluded regions are extracted from the hue difference map. The disoccluded regions with different sizes are calculated adaptively by overlapping the hue difference map according to the distortion intensity based on the progressive layer partitioning principle. Second, the artifacts of edges are measured as the distance between the patches at critical regions and their down-sampled versions based on the property of scale invariance. Finally, the perceptual quality is estimate by linearly pooling the scores of two geometric distortions together. The experimental results show that the PLCC, SRCC, RMSE of the proposed model are 0.7613, 0.6965, and 0.4244, respectively. In summary, the proposed metric achieves higher performance, but lower computational complexity than other models.
Light field planar homography and its application
Qi Zhang, Xue Wang, Qing Wang
Light field planar homography is essential for light field camera calibration and light field raw data rectification. However, most previous researches assume light field camera as pinhole camera array and deduce the planar homography matrix of sub-aperture image, which is analogous to traditional method. In this paper, we regard light field as a whole and present a novel light field planar homography matrix based on multi-projection-center (MPC) model. The projections of point, line and conic are exploited based on light field planar homography. In addition, a camera calibration method and a homography estimation are proposed to verify the light field planar homography. Experimental results on light field datasets have verified the performance of the proposed methods.
Solving computer vision tasks with diffractive neural networks
Modern computer vision tasks are achieved by first capturing and storing large-scale images and then performing the processing electronically, the paradigm of which has the fundamentally limited speed and power efficiency with the continuous increase of the data throughput and computational complexity. We propose to build the all-optical artificial intelligent for light-speed computing, which performs advanced computer vision tasks during the imaging so that the detector can directly measure the computed results. The proposed method uses light diffraction property to build the optical neural network, where the neuron function is achieved by tuning the optical diffraction with a nonlinear threshold. Since every target scene has different frequency components, the proposed diffractive neural network is trained to perform various filtering on different frequency components and achieves different transform functions for the target scenes. We demonstrate the proposed approach can be used for high-speed detecting and segmenting visual saliency objects of the microscopic samples and macroscopic scenes as well as performing the task of object classification. The low power consumption, light-speed processing, and high-throughput capability of the proposed approach can serve as significant support for high-performance computing and will find applications in self-driving automobile, video monitoring, and intelligent microscopy, etc.
Computer Vision I
icon_mobile_dropdown
Abnormal events detection method for surveillance video using an improved autoencoder with multi-modal input
This paper introduces an algorithm to solve the the anomaly behavior detection problem of surveillance video through an improved autoencoder with multimodal inputs. Using 3D convolution and 3D deconvolution, and the decoder adds a feature map corresponding to the encoder on a specific layer to enhance the image detail information. Taking the RGB frame and the optical flow as inputs, abnormality scores are calculated according to the reconstruction error for locating the abnormal segment. Experiments conducted in the CUHK Avenue dataset, the UCSD Pedestrian dataset and the Behave dataset, our approach works best compare to the original approach. While improving the AUC, due to the use of unsupervised learning, a lot of labeling time is saved, which is more in line with the diversity and contingency of abnormal behavior in real life.
Multispectral demosaicing via non-local low-rank regularization
Yugang Wang, Liheng Bian, Jun Zhang
Demosaicing is an essential technique in filter array (FA) based color and multispectral imaging. It aimes to recover missing pixels at different spectrum bands. The existing methods are limited to specific FAs and local regularization. To enhance generalization on different FA structures and improve reconstruction quality, here we present a non-local low-rank regularized demosaicing method, based on the non-local grouped sparsity of natural images. Specifically, the optimization model consists of two parts, including the regularization term of image formation model, and the low-rank term of non-local grouped image patches. The two terms ensure to remove noise and distortion while preserving image details. The model is solved by the weighted nuclear norm minimization and the alternating direction multiplier method framework. Experiments validate that the proposed algorithm has good generalization performance on both different FA patterns and channel numbers. The reconstruction accuracy is improved compared with the existing demosaicing algorithms.
Attention-guided GANs for human pose transfer
Jinsong Zhang, Yuyang Zhao, Kun Li, et al.
This paper presents a novel generative adversarial network for the task of human pose transfer, which aims at transferring the pose of a given person to a target pose. In order to deal with pixel-to-pixel misalignment due to the pose differences, we introduce an attention mechanism and propose Pose-Guided Attention Blocks. With these blocks, the generator can learn how to transfer the details from the conditional image to the target image based on the target pose. Our network can make the target pose truly guide the transfer of features. The effectiveness of the proposed network is validated on DeepFasion and Market-1501 datasets. Compared with state-of-the-art methods, our generated images are more realistic with better facial details.
Supervoxel based point cloud segmentation algorithm (Withdrawal Notice)
Fujing Tian, Hua Chen, Mei Yu, et al.
Publishers Note: This paper, originally published on 18 November 2019, was withdrawn 26 February 2020 for ethics violations.
Intermediate deep-feature compression for multitasking
Collaborative intelligence is a new strategy to deploy deep neural network model for AI-based mobile devices, which runs a part of model on the mobile to extract features, the rest part in the cloud. In such case, feature data but not the raw image needs to be transmitted to cloud, and the features uploaded to cloud need have generalization capability to complete multitask. To this end, we design an encoder-decoder network to get intermediate deep features of image, and propose a method to make the features complete different tasks. Finally, we use a lossy compression method for intermediate deep features to improve transmission efficiency. Experimental results show that the features extracted by our network can complete input reconstruction and object detection simultaneously. Besides, with the deep-feature compression method proposed in our work, the quality of reconstructed image is good in visual and index of quantitative assessment, and object detection also has a good result in accuracy.
Computer Vision II
icon_mobile_dropdown
Interactive gigapixel video streaming via multiscale acceleration
Zhan Ma, Peiyao Guo, Yu Meng, et al.
Immersive video applications grow faster for users to freely navigate within a virtualized 3D environment for entertainment, productivity, training, etc. Fundamentally, such system can be facilitated by an interactive Gigapixel Video Streaming (iGVS) platform from array camera capturing to end user interaction. This interactive system demands a large amount of network bandwidth to sustain the reliable service provisioning, hindering its massive market adoption. Thus, we propose to segment the gigapixel scene into non-overlapped spatial tiles. Each tile only covers a sub-region of the entire scene. One or more tiles will be used to represent an instantaneous viewport interested by a specific user. Tiles are then encoded at a variety of quality scales using various combinations of spatial, temporal and amplitude resolutions (STAR), which are typically encapsulated into temporally-aligned tile video chunks (or simply chunks). Chunks at different quality level can be processed in parallel for real-time purpose. With such setup, diverse chunk combinations can be simultaneously accessed by heterogeneous user per its request, and viewport-adaptation based content navigation in an immersive space can be also realized by adapting multiscale chunks properly, under the bandwidth constraints. A serial computational vision models measuring the perceptual quality of viewport video in terms of its quality scales, adaptation factors, as well as the peripheral vision thresholds, are devised to prepare and guide the chunk adaptation for the best perceptual quality index. Furthermore, in response to the time-varying network, a deep reinforcement learning (DRL) based adaptive real-time streaming (ARS) scheme is developed, by learning the future decision from the historical network states, to maximize the overall quality of experience (QoE) in a practical Internet-based streaming scenario. Our experiments have revealed that averaged QoE can be improved by about 60%, and its standard deviation can be also reduced by ≈ 30%, in comparison to the popular Google congestion control algorithm widely adopted in existing system for adaptive streaming, demonstrating the efficiency of our multiscale accelerated iGVS for immersive video application.
Semantic image inpainting with dense and dilated deep convolutional autoencoder adversarial network
The developments of generative adversarial networks (GANs) make it possible to fill missing regions in broken images with convincing details. However, many existing approaches fail to keep the inpainted content and structures consistent with their surroundings. In this paper, we propose a GAN-based inpainting model which can restore the semantic damaged images visually reasonable and coherent. In our model, the generative network has an autoencoder frame and the discriminator network is a CNN classifier. Different from the classic autoencoder, we design a novel bottleneck layer in the middle of the autoencoder which is comprised of four dense-net blocks and each block contains vanilla convolution layers and dilated convolution layers. The kernels of dilated convolution are spread out and result in an effective enlargement of the receptive field. Thus the model can capture more widely semantic information to ensure the consistency of inpainted images. Furthermore, the multiplex of different level’s features in each dense-net block can help the model understand the whole image better to produce a convincing image. We evaluate our model over the public datasets CelebA and Stanford Cars with random position masks of different ratios. The effectiveness of our model is verified by qualitative and quantitative experiments.
Multiple hidden-targets recognizing and tracking based on speckle correlation method
The light beam is diffused and scattered randomly when it passes through turbid media. Imaging through inhomogeneous samples, like ground glass, is regarded as a difficult challenge. Here, we propose a method to estimate the number of hidden targets and the pose of multi-targets hidden behind scattering medium by analyzing the distribution of autocorrelation of multi-targets speckle. The autocorrelation of multi-targets includes two parts, the autocorrelation of each sub-target and the cross-correlation among all targets. When multi-targets locate in the same row, the speckle autocorrelation shows a line shape. The autocorrelation of each sub-target overlaps on the center position and the crosscorrelations among them symmetrically distribute at both sides. When multi-targets distribute in different rows and columns, the speckle autocorrelation arranges symmetrically around the center. The autocorrelation of each sub-target overlaps on the center, and the cross-correlations among sub-targets symmetrically distribute around. The relative location among multi-targets can be estimated by calculating the distance from the cross-correlations of two sub-targets. Both simulation and experiment results successfully prove that our method possesses the ability to reconstruct multitargets in different column and cow within optical memory effect (OME) range. The method is expected to be applied to multi-targets recognizing, tracking and imaging through scattering medium in practical applications, such as biomedical imaging, astronomical observations and military detection.
Image Processing
icon_mobile_dropdown
Deep-learning for super-resolution full-waveform lidar
Full-waveform LIDAR is able to record the entire backscattered signal of each laser pulse, thus can obtain detailed information of the illuminated surface. In full-waveform LIDAR, system resolution is restricted by source pulse width and a data acquisition device bandwidth. To improve system-ranging resolution, we discuss a temporal super-resolution system with a deep learning network in this paper. In full waveform LiDAR system, When the emitted laser beam contact with different target, each time the emitted laser beam separates into a reflected echo signal and a transmission beam, the transmission beam travels in the same direction as the emitted laser. Until the transmission beam reach the ground, part of it will be absorbed by the ground and the other will become the final echo signal. Each beam transport in a different distance, and the backscattered beam will be collected and digitized by using low bandwidth detectors and A/D convertors. To reconstruct a super-resolution backscatter signal, we designed a deep-learning framework for obtaining LIDAR data with higher resolution. Inspired by the excellent performance of convolutional neural networks (CNN) and residual networks (ResNet) in image classification and image super-resolution. Considering that both image and LIDAR data could be regarded as a binary sequences that a machine could read and process in a manner, we come up with a deep-learning architecture which is specially designed for superresolution full wave-form LIDAR. After adjusting the hyperparameter and training the network, we find that deep-learning method is a feasible and suitable way for super-resolution full-waveform LIDAR.
Viewport-adaptive 360-degree video coding using non-uniform tile for virtual reality communication
Yufeng Zhou, Ken Chen, Mei Yu, et al.
360° video can provide users with immersive experience by showing the omnidirectional perspective, which is getting more attractive to consumers. However, 360° video tends to have higher resolution, resulting in increased bandwidth requirements for transmission. The characteristic of head-mounted displays (HMD) provides a new approach to reducing the cost of streaming 360° video bandwidth, which can encode 360° video by considering user’s orientation. In this paper, we propose a novel 360° video coding method based on the characteristics of Equi-rectangular Projection (ERP) and combined with user-oriented behavior. Specifically, a non-uniform tile method according is designed to the principle of ERP, which also meets the behavioral of users viewing 360° video. Additionally, appropriate coding parameters are set according to the positions of different tiles to reduce the redundancy introduced by oversampling to improve the coding efficiency. Experimental results show the proposed method can reduce the bandwidth requirement of streaming 360° video while ensuring the consistent visual quality, significantly.
Cloud and snow detection from remote sensing imagery based on convolutional neural network
Hongcai Du, Kun Li, Jianhua Guo, et al.
Cloud and snow detection is one of the most important tasks in remote sensing (RS) image processing areas. Distinguishing cloud and snow from RS images is a challenging task. Short-wave infrared (SWIR) band has been widely used for ice/snow detection. However, due to the lack of SWIR in high-resolution multispectral images, such as ZY-3 satellite imagery, traditional SWIR-based methods are no longer practical. In order to mitigate the adverse effects of cloud and snow detection, in this work, we propose an effective convolutional neural network (CNN) with a multilevel/ scale feature fusion module (MFFM), a channel and spatial attention module, and an encoder-decoder network structure for cloud and snow detection form ZY-3 satellite imageries. The MFFM can aggregate multiple-level/scale feature maps from the backbone network, ResNet50, for providing representative semantic feature information for cloud and snow detection. Channel and spatial attention module (CSAM) is used to further refine the semantic feature maps that outputs by MFFM thus making the network have better detection performance. The encoder-decoder structure allows the proposed CNN to restore detailed object boundaries thus making the detection results more accuracy. Experimental results on the ZY-3 satellite imageries dataset demonstrate that the proposed network can accurately detect cloud and snow, and outperforms several state-of-the-art methods.
Surface defect recognition of varistor based on deep convolutional neural networks
Surface defect recognition is one of the key technologies for varistor quality inspection, which can greatly improve detection efficiency and performance. In order to more accurately identify the surface defects of a varistor body and the pins, a method for identifying the surface defects based on deep convolutional neural networks (CNN) is proposed. The proposed method mainly includes four stages: image acquisition and data set construction, convolutional neural network modeling, CNN training and testing. Firstly, varistor images are acquired, and the body and pins of the varistor are segmented by image segmentation method. The number of samples is increased by data augmentation to make a data set of 5 classes. Secondly, according to the appearance characteristics of varistor, a CNN model is designed for varistor surface defect recognition. Third, using the created data set, the training data set with category labels are input to the proposed CNN for training. Finally, 1200 test samples were tested on the trained model in the test phase and the performance of the proposed algorithm was evaluated using mean average precision. The experimental results show that our method can identify the surface defects of the main body and pins of varistor efficiently and accurately.
Poster Session
icon_mobile_dropdown
Facial action units recognition by de-expression residue learning
Understanding human facial expressions is one of the key steps to achieving human-computer interaction. However, the facial expression is a combination of an expressive component called facial behavior and a neutral component of a person. The most commonly used taxonomy to describe facial behaviors is the Facial Action Coding System (FACS). FACS segments the visible effects of facial muscle activation into 30+ action units (AUs). So, we introduce a method to recognize AUs by extracting information of the expressive component through a de-expression learning procedure, called De-expression Residue Learning (DeRL). Firstly, we train a Generative Adversarial Network named cGAN to filter out the expressive information and generate the corresponding neutral face image. Then, we use the intermediate layers, which contains the action unit information, to recognition AUs. Our work alleviates problems of AUs recognition based on the pixel level difference, which is unreliable due to the variation between images i.e., rotation, translation and lighting condition changes, or the feature level difference, which is also unstable as the expression information may vary according to the identity information. As for experiments, we use the data augmentation method to avoid overfitting and trained deep network to recognition AUs on CK+ datasets. The results reveal that our work achieves more competitive performance than several other popular approaches.
Effect of defocus blur on the signal distribution of camera-based remote photoplethysmography
Heart rate can be extracted from facial videos by camera-based remote photoplethysmography (rPPG). For a defocus blurred facial image, the edge of the face is blurred and the pixels near this region will be contaminated by the background light. In this paper, we map rPPG signal quality (rPPGSQ) on faces in videos with different degrees of defocus and propose a method to evaluate the effect of defocus blur on the signal distribution of rPPG. Our results show that the degradation factor (DF) introduced in this paper can evaluate the effect of defocus blur on rPPGSQ effectively, and provide a clear region of high-rPPGSQ that can be selected as the optimized region of interest (ROI) in rPPG applications.
Optical hash function based on the interaction between multiple scattering media and coherent radiation
An approach for constructing optical hash function has been proposed based on the interaction between multiple scattering media and coherent radiation. Unlike the traditional Hash function via mathematical transformations or complex logic operations, the proposed method employs a multiple scattering media and Sobel filters for data scrambling and features extraction. An arbitrary length input data can be compressed into a fixed length (256-bit) Hash value after a cascade iterative processing. Its safety relies on the unpredicted and non-duplicated disorder multiple scattering media, in other word, there is tremendous difficulty of knowing the multiple scatting media with a specific internal state or efficiently simulating the light interaction effect between the multiple scattering media. Simulation results are presented to demonstrate the avalanche effect and collision resistance performance of the proposed designing strategy of the optical Hash function.
Binocular camera trap for wildlife detection
Zhongke Xu, Liang Sun, Xinwei Wang, et al.
Camera traps are commonly used in wildlife monitoring. Traditionally camera traps only capture 2D images of wildlife moving in front of them. However, size information of wildlife is lost, which is vital to determine their ages and genders. To solve this problem, this paper develops a binocular camera trap based on stereo imaging for wildlife detection. The camera trap consists of two cameras, motion sensors, a photosensitive sensor and infrared illumination with the central wavelength of 940nm. Motion sensors output triggers to cameras when animals move past, and then pictures are captured from two different perspectives simultaneously. Meanwhile the photosensitive sensor perceives ambient illumination to control infrared illumination. In this way, the camera trap provides both 2D images of wildlife and their size information obtained by binocular vision. In addition, different from normal binocular cameras placed horizontally, these two cameras are set vertically for the convenience of installation and the expansion of dynamic measure range. As verification, we develop a prototype binocular camera trap to measure a human’s height that is 178cm, and the estimation error approaches 2cm at the distance of 5m.
Infrared object image instance segmentation based on improved mask-RCNN
In recent years, the traditional automotive industry has also begun to enter the field of autonomous driving technology, seeking new breakthroughs. From simple pedestrian and vehicle detection, to the instance segmentation of traffic scenes, to the ideal all-intelligent driving, it has begun to be occupied by deep learning. Research scholars have tried to build a vehicle's control strategy system entirely using computers and also proposed a full convolutional neural network that replaced the fully connected structure with convolutional ideas, realizing the transition from image classification to dense pixel prediction. This is the first step for neural network to be used for scene instance segmentation, and it is also a key step for intelligent driving. However, the effect of the full convolutional neural network is not very ideal. An important problem is that the pooling layer will lose part of the location information while aggregating the background. For dense prediction of image instance segmentation, the context information (location information) of each pixel is indispensable, which is very important for the final classification of pixels. Thus, later researchers proposed three different structures to solve this problem: cavity convolution structure, codec structure, space pyramid structure. This paper analyzes the working principles and characteristics of several different structures and compares the differences between various networks. This paper combines a Mask-RCNN to construct a new network structure for image instance segmentation. The main innovations of this paper are as follows: 1. Introduce generative adversarial network into the field of image segmentation. Combine the conditional generative adversarial network idea, using the original image as the input of the generator, and the generative to generate the desired instance segmentation result. Combine the original image with the instance segmentation result generated by the generative, or combine the original image with the manually labeled segmentation result as the input to the discriminator. By training the network so that the discriminator cannot distinguish between the image generated by the generator and the result of the manual annotation, the generator can generate a satisfactory image segmentation result. 2. Introduce superpixel information of the image. In this paper, the boundary information obtained by superpixel segmentation is input into the generator network as a segmentation condition. For the original input image, this paper uses superpixel segmentation method to obtain the subtle contour of the image, and then stacks the superpixel segmentation result with the original image as the input of the generated network. 3. Reconstructed a new image segmentation structure. In the image translation model, the processing at the boundary is often difficult to achieve good results, so this article changes the output layer of the generator to K (K represents the number of classifications) channels to output the results. This paper adopts the Encoder-Decoder structure in DeconvNet, removes the full connection layer to reduce the model parameters, and changes the pooled indexing method to the direct stacking structure.
Structural light 3D reconstruction algorithm based on deep learning
With the powerful learning ability the neural network, fringe pattern can be effectively analyzed after calculating the phase of the fringe pattern. This paper proposes the combination algorithm of phase-shifting (PS) algorithm and structured light convolutional neural network (SL-CNN) that apply deep learning to Structural Light 3D reconstruction.
Spectral sensitivity estimation of color digital camera based on color checker
Hanyi Yuan, Junsheng Shi, Siyu Ning
Spectral sensitivity is an important parameter of color digital camera, which can be used for color correction and spectral reflectance reconstruction. In this paper, a spectral sensitivity estimated method for color digital camera based on standard color checker is studied. Under standard light source A, using the measured camera to take an image of a standard color checker, the spectral sensitivity is estimated from the camera response values of the color checker in the image, repeat the experiment under different illumination level of the same source and compare the estimated results. The experimental results show that compared with the traditional measurement methods, the estimated method in this paper can obtain higher precision spectral sensitivity and the illumination level of the ambient light source will affect the estimated results.
An L0 regularized framelet based model for high-density mixed-impulse noise and Gaussian noise removal
Images are often corrupted by impulse noise due to transmission errors, malfunctioning pixel elements in camera sensors, faulty memory locations in the imaging process. This paper proposes a new method for removing the mixed impulse noise and gaussian noise. The proposed method has two-phase, and the first phase is to identify candidate pixels existing impulse noise by using median filtering. The second phase processes the regions with impulse noise and leaves the others free with a mask generated by the previous phase. In order to protect the sharp image, we propose a L0 regularized framelet with a L1 fidelity term to recover the images. Numerical results demonstrate that the proposed method is a significantly advance over several state-of-the-art techniques on restoration performance.
Stereo matching using convolution neural network and LIDAR support point grid
This paper proposes a stereo matching method that uses a support point grid in order to compute the prior disparity. Convolutional neural networks are used to compute the matching cost between pixels in two pictures. The network architecture is described as well as teaching process. The method was evaluated on Middlebury benchmark images. The results of accuracy estimation in case of using data from a LIDAR as an input for the support points grid is described. This approach can be used in multi-sensor devices and can give an advantage in accuracy up to 15%.
Measuring the point spread function of a wide-field fluorescence microscope
Yubing Ma, Qionghai Dai, Jingtao Fan
The point spread function (PSF) of a wide-field fluorescence microscope, which measures the system’s impulse response, is a crucial parameter in non-blind deconvolution. To determine the PSF, traditional methods treat a fluorescent bead as a point source whose optical field distribution is approximate to it. However, beads with sufficiently small sizes are often difficult to observe in a microscope due to their low brightness. In this paper, we present a new approach to measure the PSF under the condition of non-ideal point sources and low signal-to-noise ratio (SNR). We first recorded a focal stack of fluorescent beads and automatically selected those that met certain requirements. Then, we computed a two-dimensional (2D) PSF for each plane at different defocus distances, some of which were fitted according to Gaussian distribution and the rest were calculated mainly by averaging the beads. Finally, we combined each 2D PSF based on the energy distribution to obtain a three-dimensional (3D) PSF. The proposed algorithm has been tested on the Real-time, Ultra-large-Scale imaging at High-resolution (RUSH) macroscope. By implementing deconvolution using the PSF derived by this method and a traditional method respectively, results show that the proposed algorithm has achieved a more accurate measurement of the PSF.
A competition-based image saliency model
Competition for visual representation is an important mechanism for selective visual attention. The traditional global distinctiveness based saliency models usually compute the distinctiveness to measure saliency via comparing the difference of image patches in various spaces. In this paper, we propose to use an improved neural competition model to replace the comparison. The pairwise competition responses for a patch to all of the other patches are summed up to represent the distinctiveness of that patch. Particularly, the competition response is computed by a neural competition model with the dissimilarity bias and the gradient based feature inputs. Experimental results validate that the proposed model presents high effectiveness in saliency detection by outperforming nine state-of-the-art models.
Generation of elemental image array based on photon mapping
Zhengnan Xu, Yan Zhao, Aijia Zhang, et al.
The current algorithms of generating element image array using ray-tracing are complex. Although the quality of images generated by ray tracing is high, it requires a lot of double counting. Therefore, we propose to utilize photon mapping algorithm for generating element image array to reduce a lot of calculations in this paper. First, for a given virtual model, a light source is set, the photon emission vector is calculated and photons are tracked. A global photon map is used to store the energy and incident direction of the photons. Then, we trace rays of light that start from the viewpoint and pass through each pixel into the scene, and record the surface material of the first collision point into the photon map. The image of one view point can be generated according to the estimated rays of light by photon map. The photon map need to be generated only once for all images of different viewpoints. Finally, the images of different viewpoints are converted into an element image array. Experimental results show that the proposed method improves the generation speed of element image array significantly in the complex scene, especially for the generation of multiple images. While in the simple scene, the proposed method might be slightly slower for generating one image, but it still shows its advantages in the case of generating multiple images.
Brain MRI image classification based on transfer learning and support vector machine
Zimeng Li, Yan Zhao, Shigang Wang
In order to improve the accuracy of brain magnetic resonance imaging (MRI) classification and to reduce classification time, this paper proposes a brain MRI image classification algorithm combined with transfer learning and support vector machine (SVM). Firstly, the three deep convolutional neural networks pre-trained on the ImageNet database, AlexNet, VGG16 and GoogleNet, are used as feature extractors to extract the features of brain MRI images in the Harvard Medical School Website database. The feature extraction process does not require to fine-tune the transferred networks. Then, the features extracted in each convolutional neural network are combined to form a feature vector of each brain MRI image and it is input to the SVM for classification. Finally the SVM classified the brain MRI images into healthy, Alzheimer's disease, and stroke. The experimental results show that the classification accuracy can achieve 100%, and the classification time is only 26 seconds. Compared with the brain MRI image classification algorithm proposed by the literature, the accuracy of the proposed method is increased by 8.67%, 1.09%, 0.55%, respectively. The proposed method can provide effective help for the diagnosis of brain diseases.
Multifunctional image processor based on rank differences signals weighing-selection processing method and their simulation
A new iterative process for sorting array of signals, which differs from the known structures by uniformity and versatility, and allows direct and inverse sorting of analog or digital signal arrays was proposed in this paper. Simple relational nodes are basic elements of the proposed sorting structures. Such elements can be implemented on a different element basis, for example, on devices of selecting a maximum or minimum of two analog or digital signals, which can be implemented on CMOS current mirrors and carry out the continuous logic limited difference function. The homogeneous sorting structure on such elements implementation, consisting of two layers and a multichannel sampling and holding device was offered. Nine signals corresponding to a selection window of a matrix sensor are fed to this structure, and are sorted by five iterative steps, and at the output we receive the signals sorted by the rank, which, using the code controlled programmable multiplexer, generates an output signal, that corresponds to the selected rank. Technical parameters of such relational preprocessor were evaluated. The paper considers results of design and modeling of CL BC based on current mirrors (CM) for creating picture type image processors (IP) with matrix parallel inputsoutputs. Such sorting nodes have a number of advantages: high speed and reliability, simplicity, small power consumption, high integration level. The inclusion of an iterative node for sorting signals into a modified nonlinear IP structure makes possible to significantly simplify its design and increase the functional capabilities of such processor. The simulation results confirm the proposed approaches to the design of sorting nodes of analog signals of the iterative type. The power consumption of the processors does not exceed 2mW, the response and processing times are 10μs and can be less by an order of magnitude, the supply voltage is 1.8÷3.3V, and the operating currents are optimally in the range of 10÷20μA. The energy efficiency of the proposed preprocessor with the iterative sorting node is 25x109 operations per second per watt, which corresponds to the best technical solutions. In the work we show, that after sorting or comparative analysis of signals by levels, a promising opportunity appears to implement image processors with enhanced functionality using the new method of weighting-selecting rank differences of signals. The essence of the method is that by composing the differences of the signals ordered by rank and the upper level of their range, we can simultaneously form several resulting output signals, choosing the necessary difference signals from their set according to the control commands and weighing them additionally before the summation. We show that using this approach and the method of processing we can significantly expands the set of operations and functions for image filtering, simplifying hardware implementation of IP, especially for analog and mixed technologies. We determined set of basic executable instruction-functions of the processors, presenting the simulation results in Mathcad, PSpice OrCad and other environments. We discussed the comparative evaluation of various modifications and options for implementing processor. We analyzed the new approach for the programmable selecting function or set of functions, including the selecting the required differences between the ranks of signals and their weights. We show the results of design and modeling the proposed new FPGA-implementations of MIP. Simulation results show that processing time in such circuits does not exceed 25 nanoseconds. Circuits are simple, have low supply voltage (2.5 V), low power consumption (50mW), digital accuracy. Calculations show that in the case of using an Altera FPGA chip EP3C16F484 Cyclone III family, it is possible to implement MIP with register memory for image size of 64*64 and window 3*3 in the one chip. For the chip for 2.5V and clock frequency 200MHz the power consumption will be at the level of 200mW, and the calculation time for pixel of filters will be at the level of 25ns.
Fast bundle adjustment using adaptive moment estimation
Bundle adjustment (BA) is an important task for feature matching in multiple applications such as image stitching and position mapping. It aims to reconstruct the 8-parameter homography matrix, which is used for perspective transformation among different images. The existing algorithms such as the Levenberg-Marquardt (LM) algorithm and the Gauss{Newton (GN) algorithm require much computation and a large number of iterations. To accelerate reconstruction speed, here we propose a novel BA algorithm based on adaptive moment estimation (Adam). The Adam solver uses the mean and uncentered variance of the gradients in the previous iterations to dynamically adjust the gradient direction of the current iteration, which improves reconstruction quality and increases convergence speed. Besides, it requires only the first derivate calculation, and thus obtains low computational complexity. Both simulations and experiments validate that the proposed method converges faster than the conventional BA methods.
A Lite Asymmetric DenseNet for effective object detection based on convolutional neural networks (CNN)
Recently, convolutional neural networks (CNN) have been widely used in object detection and image recognition for their effectiveness. Many highly accurate classification models based on CNN have been developed for various machine learning applications, but they generally computationally costly and require a hardware-based platform with super computing power and memory resources to implement the algorithm. In order to accurately and efficiently achieve object detection tasks using CNN on a system with limited resources such as a mobile device, we propose an innovative type of DenseNet, which is a lightweight convolutional neural network algorithm called Lite Asymmetric DenseNet (LADenseNet). Aiming to compress the CNN model complexity, we replace the 7 x 7 convolution and 3 x 3 max-pool with multiple 3 x 3 convolutions and a 2 x 2 max-pool in the initial down-sampling process to significantly reduce the computing cost. In the design of the dense blocks, channel splitting and channel shuffling are employed to enhance the information exchange of feature maps and improve the expressive ability of the network. We decompose the 3 x 3 convolution in the dense block into a combination of 3 x 1 and 1 x 3 convolutions, which can speed up the computations and extract more spatial features by using asymmetric convolutions. To evaluate the performance of the proposed approach we develop an experimental system in which LA-DenseNet is used to extract features and Single Shot MultiBox Detector (SSD) is used to detect objects. With VOC2007+12 as training and testing datasets, our model achieves comparable detection accuracy as YOLOv2 with a fraction of its computational cost and memory usage.
Image inpainting using layered fusion and exemplar-based
Based on the original exemplar-based Criminisi algorithm, we proposed two points to improve the result of image inpainting. First, in order to solve the problem that the searched matching block existing in the optimal block search is not optimal, this paper proposes a fusion repair strategy. The first n blocks are selected as matching blocks in the search of the optimal block, and then the weighted averages are performed on the matching blocks, which is used as the target block to be repaired. Second, considering the size of the block to be repaired, a layered repair strategy is adopted. An image to be repaired is first downsampled to obtain images at different scales, and then repaired from the topmost image. The experimental results show that the proposed algorithm improves the quality of the repair subjectively and objectively.
BNU-LCSAD: a video database for classroom student action recognition
With the development and application of digital cameras, especially in education, a great number of digital video recordings are produced in classrooms. Taking Beijing Normal University as an example, 3.4 TB of videos are recorded every day in more than 200 classrooms. Such huge data is beneficial for us, computer vision researchers, to automatically recognize students' classroom actions and even evaluate the quality of classroom teaching. To focus action recognition on students, we propose Beijing Normal University Large-scale Classroom Student Action Database version 1.0(BNU-LCSAD) which is the first large-scale classroom student action database for student action recognition and consists of 10 classroom student action classes from digital camera recordings at BNU. We introduce the construct and label Processing of this database in detail. In Addition , we provide baseline of student action recognition results based our new database using C3D network.
Improvement of semi-supervised learning in real application scenarios
Due to the high demand of deep learning for data quantity, semi-supervised learning (SSL) has a very important application prospect because of its successful use of unlabeled data. Existing SSL algorithms have achieved high accuracy on MINIST, CIFAR-10 and SHVN datasets, and even outperform fully supervised algorithms. However, because the above three datasets have the characteristics of balanced data categories and simple identification tasks which can’t be ignored for classification problems, the SSL algorithm has uncertainties of effectiveness in the case of unbalanced datasets and specific recognition tasks. We analyze the datasets and find that the number of “disgust” in expressions dataset is less than other categories, and so is “discussion” in the classroom action recognition dataset. Therefore, we use a novel SSL model: Deep Co-Training (DCT) model to experiment on the expression recognition database (FER2013), as well as our own classroom student action database (BNU-LCSAD) and analyze the effectiveness of the algorithm in specific application scenarios. Moreover, we use a training strategy of TSA when train our model to solve the problem of being easily overfitting which is more likely to occur when data categories are not balanced. The experimental results prove the effectiveness of the SSL algorithm in practical application and the significance of using TSA.
Efficient spectral confocal meta-lens in the near infrared
Spectral confocal technology is an important three-dimensional measurement technology with high accuracy and non-contact; however, traditional spectral confocal system usually consists of prisons and several lens whose volume and weight is enormous and heavy, besides, due to the chromatic aberration characteristics of ordinary optical lenses, it is difficult to perfectly focus light in a wide bandwidth. Meta-surfaces are expected to realize the miniaturization of conventional optical element due to its superb abilities of controlling phase and amplitude of wavefront of incident at subwavelength scale, and in this paper, an efficient spectral confocal meta-lens (ESCM) working in the near infrared spectrum (1300nm-2000nm) is proposed and numerically demonstrated. ESCM can focus incident light at different focal lengths from 16.7 to 24.5μm along a perpendicular off-axis focal plane with NA varying from 0.385 to 0.530. The meta-lens consists of a group of Si nanofins providing high polarization conversion efficiency lager than 50%, and the phase required for focusing incident light is well rebuilt by the resonant phase which is proportional to the frequency and the wavelength-independent geometric phase, PB phase. Such dispersive components can also be used in implements requiring dispersive device such as spectrometers.