Proceedings Volume 10995

Pattern Recognition and Tracking XXX

cover
Proceedings Volume 10995

Pattern Recognition and Tracking XXX

Purchase the printed version of this volume at proceedings.com or access the digital version at SPIE Digital Library.

Volume Details

Date Published: 22 August 2019
Contents: 7 Sessions, 25 Papers, 21 Presentations
Conference: SPIE Defense + Commercial Sensing 2019
Volume Number: 10995

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 10995
  • Novel Pattern Recognition Techniques
  • Target Tracking and Classification
  • Correlation Based Recognition
  • Deep Learning Based Recognition
  • Advanced Recognition Techniques
  • Poster Session
Front Matter: Volume 10995
icon_mobile_dropdown
Front Matter: Volume 10995
This PDF file contains the front matter associated with SPIE Proceedings Volume 10995, including the title page, copyright information, table of contents, and author and committee lists.
Novel Pattern Recognition Techniques
icon_mobile_dropdown
Improved visible to IR image transformation using synthetic data augmentation with cycle-consistent adversarial networks
Infrared (IR) images are essential to improve the visibility of dark or camouflaged objects. Object recognition and segmentation based on a neural network using IR images provide more accuracy and insight than color visible images. But the bottleneck is the amount of relevant IR images for training. It is difficult to collect real-world IR images for special purposes, including space exploration, military and fire-fighting applications. To solve this problem, we created color visible and IR images using a Unity-based 3D game editor. These synthetically generated color visible and IR images were used to train cycle consistent adversarial networks (CycleGAN) to convert visible images to IR images. CycleGAN has the advantage that it does not require precisely matching visible and IR pairs for transformation training. In this study, we discovered that additional synthetic data can help improve CycleGAN performance. Neural network training using real data (N = 20) performed more accurate transformations than training using real (N = 10) and synthetic (N = 10) data combinations. The result indicates that the synthetic data cannot exceed the quality of the real data. Neural network training using real (N = 10) and synthetic (N = 100) data combinations showed almost the same performance as training using real data (N = 20). At least 10 times more synthetic data than real data is required to achieve the same performance. In summary, CycleGAN is used with synthetic data to improve the IR image conversion performance of visible images.
Multi-task deep learning architecture for semantic segmentation in EO imagery (Conference Presentation)
Semantic segmentation deep learning architectures provide incredible results in segmentation and classification of various scenes. These convolutional-based networks create deep representations for classification and have extended connected weight sets to improve the boundary characteristics of segmentation. We propose a multi-task architecture for these deep learning networks to further improve the boundary characteristics of neural networks. Basic edge detection architectures are able to develop good boundaries, but are unable to fully characterize the necessary boundary information from the imagery. We supplement these deep neural network architectures with specific boundary information to remove boundary features that are not indicative of the boundaries of the classified regions. We utilize various standard semantic segmentation datasets, like Cityscapes and MIT Scene Parsing Benchmark, to test and evaluate the network architectures. When compared to the original architectures, we observe an increase in segmentation accuracy and boundary recreation using this approach. The incorporation of multi-task learning helps improve the semantic segmentation results of the deep learning architectures.
Neural network in a multi-agent system for line detection task in images
Lines are one of the most informative structure elements in any images. For this reason, objects detection and recognition problem are often reduced to edge detection task. Radon transform and Hough transform are widely used in straight-line detection. However, these methods allow estimating only the straight line parameters (but not line segment). It is proposed to split the image into square fragments (blocks) in which straight-line segments are detected to solve this problem. A multi-agent system is used to combine segments into curves and drop false detections. The use of artificial neural networks (NN) for programming a part of agent behavior in the multi-agent system is the main theme of this work.
Target tracking and classification directly in compressive measurement for low quality videos
Although compressive measurements save data storage and bandwidth usage, they are difficult to be used directly for target tracking and classification without pixel reconstruction. This is because the Gaussian random matrix destroys the target location information in the original video frames. This paper summarizes our research effort on target tracking and classification directly in the compressive measurement domain. We focus on one type of compressive measurement using pixel subsampling. That is, original pixels in video frames are randomly subsampled to generate the compressive measurements. Even in such special compressive sensing setting, conventional trackers do not work in a satisfactory manner. We propose a deep learning approach that integrates YOLO (You Only Look Once) and ResNet (residual network) for multiple target tracking and classification. YOLO is used for multiple target tracking and ResNet is for target classification. Extensive experiments using optical videos in the SENSIAC database demonstrated the efficacy of the proposed approach.
Compressive object tracking and classification using deep learning for infrared videos
Object tracking and classification in infrared videos are challenging due to large variations in illumination, target sizes, and target orientations. Moreover, if the infrared videos only generate compressive measurements, then it will be even more difficult to perform target tracking and classification directly in the compressive measurement domain, as many conventional trackers and classifiers can only handle reconstructed frames from compressive measurements. This paper summarizes our research effort on target tracking and classification directly in the compressive measurement domain. We focus on one special type of compressive measurement using pixel subsampling. That is, the original pixels in the video frames are randomly subsampled. Even in such special compressive sensing setting, conventional trackers do not work in a satisfactory manner. We propose a deep learning approach that integrates YOLO (You Only Look Once) and ResNet (residual network) for multiple target tracking and classification. YOLO is used for multiple target tracking and ResNet is for target classification. Extensive experiments using short wave infrared (SWIR) videos demonstrated the efficacy of the proposed approach even though the training data are very scarce.
Target Tracking and Classification
icon_mobile_dropdown
Small target detection for search and rescue operations using distributed deep learning and synthetic data generation
It is important to find the target as soon as possible for search and rescue operations. Surveillance camera systems and unmanned aerial vehicles (UAVs) are used to support search and rescue. Automatic object detection is important because a person cannot monitor multiple surveillance screens simultaneously for 24 hours. Also, the object is often too small to be recognized by the human eye on the surveillance screen. This study used an UAVs around the Port of Houston and fixed surveillance cameras to build an automatic target detection system that supports the US Coast Guard (USCG) to help find targets (e.g., person overboard). We combined image segmentation, enhancement, and convolution neural networks to reduce detection time to detect small targets. We compared the performance between the autodetection system and the human eye. Our system detected the target within 8 seconds, but the human eye detected the target within 25 seconds. Our systems also used synthetic data generation and data augmentation techniques to improve target detection accuracy. This solution may help the search and rescue operations of the first responders in a timely manner.
Automatic pavement crack classification on two-dimensional VIAPIX images
Road maintenance management presents a complex task for road authorities. The first presumption for the evaluation analysis and correct road construction rehabilitation is to have precise and up-to-date information about road pavement condition and level degradation. Different road crack types were proposed in the state of art in order to provide useful information for making pavement maintenance strategies. For this reason, we present in this paper a novel research to automatically detect and classify road cracks on two-dimensional digital images. Indeed, our proposed package is composed of two methods: crack detection and crack classification. The first method consists in detecting the cracks on images acquired by the VIAPIX® system developed by our company ACTRIS. To do so, we are based on our unsupervised approach cited in for road crack detection on two-dimensional pavement images. Then, in order to categorize each of the detected cracks, the second method of our package is applied. Based on principal component analysis (PCA), our method permits the classification of all the detected cracks into three types: vertical, horizontal, and oblique. The obtained results demonstrate the efficiency of our robust approaches in terms of good detection and classification on a variety of pavement images.
Robust infrared target tracking using thermal information in mean-shift
The basic computational module of the technique is an old pattern recognition procedure: the mean-shift. In case of gray level feature domain, the spatial information of the target is lost when the background brightness histogram is the same as the target histogram. In this paper, we propose a new algorithm that is independent of background contrast by changing features from a conventional brightness based histogram to a temperature- based histogram. The proposed algorithm can track targets robustly regardless to target-background contrast. The experiment results demonstrate that the temperature-based Mean-Shift outperforms comparing with the brightness-based Mean-Shift when track a object with successive background variations.
Automatic target recognition using non-negative matrix approximations (Conference Presentation)
The set of orthogonal Eigen-vectors built via principal component analysis (PCA), while very effective for compression, can often lead to loss of crucial discriminative information in signals. In this work, we build a basis set using non-negative matrix approximations (NNMAs). We are interested in testing radar data with the non-negative basis and an accompanying non-negative coefficient set in order to understand if the NNMAs is able to produce a more accurate generative model than the PCA the PCA basis which lacks direct physical interpretation. It is hoped that the NNMA basis vectors, while not orthogonal, capture discriminative local components of targets in processed radar data. We test the merits of the NNMA basis representation for the problem of automatic target recognition. Experiments on synthetic radar data are performed and the results are discussed.
Correlation Based Recognition
icon_mobile_dropdown
Taking correlation from 2D to 3D: optical methods and performance evaluation
P. P. Banerjee, U. Abeywickrema, H. Zhou, et al.
Correlation of two dimensional (2D) images using photorefractive materials are first reviewed. The performance of a joint transform correlator based on photorefractive beam coupling is analyzed by determining the dependence of typical figures of merit such as the discrimination ratio, peak-to-correlation plane energy ratio, peak-to-noise ratio, etc. on the photorefractive gain coefficient and beam power ratio using typical reference and signal images. Furthermore, correlation of three dimensional (3D) images is introduced as the correlation of their 2D digital holograms. Critical figures of merit used for assessment of 2D correlation of images are applied to the correlation of holograms.
Nonparametric kernel smoothing classification to enhance optical correlation decision performances
Matthieu Saumard, Marwa El Bouz, Michaël Aron, et al.
Optical correlation is a pattern recognition method which is very famous to recognize an image from a database. It is simple to implement, to use and allows to obtain good performances. However, it suffers from a global decision based on the location, height and shape of the correlation peak within the correlation plane. It entails a considerable reduction of its robustness. Moreover, the correlation is sensitive to the rotation, to the scale, it pulls a deformation on the correlation plane which will decrease the performances of this method. In this paper, to overcome these problems, we propose and validate a new method of nonparametric modelling of the correlation plane. This method is based on a kernel estimation of the regression function used to classify the individuals according to the correlation plane. The idea is to enhance the decision by taking into consideration the shape and the distribution of energy in the correlation plane. This relies on calculations of the Hausdorff distance between the target correlation plane and the correlation planes coming from the database. The results showed the very good performance of our method compared to other in the literature especially in terms of a significant rate of good detection and a very low rate of false alarm.
Convolutional auto-encoder for vehicle detection in aerial imagery (Conference Presentation)
Much research has been done in implementing deep learning architectures in detection and recognition tasks. Current work in auto-encoders and generative adversarial networks suggest the ability to recreate scenes based on previously trained data. It can be assumed that with the ability to recreate information is the ability to differentiate information. We propose a convolutional auto-encoder for both recreating information of the scene and for detection of vehicles from within the scene. In essence, the auto-encoder creates a low-dimensional representation of the data projected in a latent space, which can also be used for classification. The convolutional neural network is based on the concept of receptive fields created by the network, which are part of the detection process. The proposed architecture includes a discriminator network connected in the latent space, which is trained for the detection of vehicles. Through work in multi-task learning, it is advantageous to learn multiple representations of the data from different tasks to help improve task performance. To test and evaluated the network, we use standard aerial vehicle data sets, like Vehicle Detection in Aerial Imagery (VEDAI) and Columbus Large Image Format (CLIF). We observe that the neural network is able to create features representative of the data and is able to classify the imagery into vehicle and non-vehicle regions.
Hardware design of correlation filters for target detection
Correlation filters due to its three protuberant advantages have proven very effective for automatic target detection, biometric verification and security applications. In this paper, correlation filters are implemented in hardware FPGA keeping in view their importance in real time applications. Hardware implementation results are placed in comparison with results generated through software. These results are almost similar with a negligible variation i.e. 10-4, which is demonstrated in the experimental section, in addition to valuable time reduction. The hardware design of these filters is implemented in LabView which can be subsequently employed in real-time security applications. This design may be expanded for other advanced variants of correlation filters in future work.
Wireless charging pad detection and alignment using a fisheye camera for electric vehicles
The market for electric vehicles is growing day by day and electric car chargers can be seen often on pavements of the major cities and towns. With this growing market, industry is already looking for another breakthrough, i.e. wireless vehicle charging. This is much like charging smart phones using wireless charging pads instead of plugging the vehicle in. Industry is exploring ways to charge Electric vehicles wirelessly when the car is parked over a charger on the ground beneath it. For the wireless charging to work, both elements must be well aligned. This paper explores using vision based approaches to provide the automatic recognition, localisation and tracking of an inductive plate for wireless car charging. Visual feedback is provided to a motion control system for accurate charger alignment.
Deep Learning Based Recognition
icon_mobile_dropdown
Suitability of features of deep convolutional neural networks for modeling somatosensory information processing
Deep Learning (DL) has recently led to great excitement and success in AI and attracted further attention due to the fact that the features extracted in the early cortical layers have properties similar to those of real neurons in the primary visual cortex. Understanding cortical mechanisms of sensory information processing is important for improving DL systems as well as for developing more realistic simulations of cortical systems. Just like representing speech in timefrequency domain makes it possible to use DL systems trained on images for speech recognition, simulating tactile information processing can also be modeled and studied using DL. Tactile stimulators, such as those invented by Cortical Metrics, are efficient tools for quantitative sensory testing. To model the features extracted by tactile information processing in the somatosensory cortex more accurately, we propose a novel unsupervised DL method using transfer learning and the principle of contextual guidance. Our approach helps describe the goal of sensory coding in early cortical areas because low-level stimulus features that would be behaviorally useful as building blocks enabling the construction of high-level behaviorally significant features have to rely on some principles, such as contextual guidance and transfer-utility, applicable to any sensory modality (whether visual, auditory, or tactile). We show that the emergent features offer higher performance accuracy than AlexNet on the Caltech-101 dataset and on classification of textures resembling tactile stimuli S1 somatosensory area processes. Our computational modeling approach can help improve (i) Cortical Metrics approaches, (ii) sensorimotor cortical models, and (iii) deep hybrids of unsupervised and supervised networks.
Optimized training of deep neural network for image analysis using synthetic objects and augmented reality
Thomas Lu, Alexander Huyen, Luan Nguyen, et al.
Acquiring large amounts of data for training and testing Deep Learning (DL) models is time consuming and costly. The development of a process to generate synthetic objects and scenes using 3D graphics software is presented. By programming the path and environment in a 3D graphical engine, complex objects and scenes can be generated for the purpose of training and testing a Deep Neural Network (DNN) model in specific vision tasks. An automatic process has been developed to label and segment objects in synthetic images and generate their corresponding ground truth files. Performances of DNNs trained with synthetic data have been shown to outperform DNNs trained with real data.
Actionable surveillance identification (ASI)
Christopher R. Bell, Iain Macleod
The Defence Science and Technology Laboratory wished to specify requirements for long range imaging systems that could be passed to system integrators. We were interested in facial identification and wanted a suitable metric. The UK Home Office produced a test to verify the setup of CCTV cameras based on facial identification. This used synthetic faces with a given number of pixels across the face. This is now part of British Standard EN 62676-4:2015. We were interested in how the number of pixels affected the probability of the faces being correctly identified. We ran an observer trial using the synthetic faces pixelated at different resolutions. It was found that the probability of correctly identifying the pixelated faces did not exceed ~60% however many pixels. This led to us suggesting that a pragmatic pixel count was at the 50% probability point (in-line with Johnson’s) of correctly identifying faces. We have christened this actionable surveillance identification (ASI).
Advanced Recognition Techniques
icon_mobile_dropdown
Robust detection and removal of dust artifacts in retinal images via dictionary learning and sparse-based inpainting
Retinal images are acquired with eye fundus cameras which, like any other camera, can suffer from dust particles attached to the sensor and lens. These particles impede light from reaching the sensor, and therefore they appear as dark spots in the image which can be mistaken as small lesions like microaneurysms. We propose a robust method for detecting dust artifacts from more than one image as input and, for the removal, we propose a sparse-based inpainting technique with dictionary learning. The detection is based on a closing operation to remove small dark features. We compute the difference with the original image to highlight the artifacts and perform a filtering approach with a filter bank of artifact models of different sizes. The candidate artifacts are identified via non-maxima suppression. Because the artifacts do not change position in the images, after processing all input images, the candidate artifacts which are not in the same approximate position in different images are rejected and kept unchanged in the image. The experimental results show that our method can successfully detect and remove artifacts, while ensuring the continuity of retinal structures, such as blood vessels.
Enhanced target recognition employing spatial correlation filters and affine scale invariant feature transform
A spatial domain optimal trade-off Maximum Average Correlation Height (SPOT-MACH) filter has been shown to have advantages over frequency domain implementations of the Optimal Trade-Off Maximum Average Correlation Height (OR-MACH) filter as it can be made locally adaptive to spatial variations in the input image background clutter and normalized for local intensity changes. This enables the spatial domain implementation to be resistant to illumination changes. The Affine Scale Invariant Feature Transform (ASIFT) is an extension of previous feature transform algorithms; its features are invariant to six affine parameters which are translation (2 parameters), zoom, rotation and two camera axis orientations. This results in it accurately matching increased numbers of key points which can then be used for matching between different images of the object being tested. In this paper a novel approach will be adopted for enhancing the performance of the spatial correlation filter (SPOT MACH filter) using ASIFT in a pre-processing stage enabling fully invariant object detection and recognition in images with geometric distortions. An optimization criterion is also be developed to overcome the temporal overhead of the spatial domain approach. In order to evaluate effectiveness of algorithm, experiments were conducted on two different data sets. Several test cases were created based on illumination, rotational and scale changes in the target object. The performance of correlation algorithms was also tested against composite images as references and it was found that this results in a well-trained filter with better detection ability even when the target object has gone through large rotational changes.
Optimization of the jamming signal parameters for conical-scan systems
Because of their proliferation to many regions groups in the world and easy availability by the groups, the infrared (IR) missiles are probably the most dangerous threats for the aircraft platforms. The first generations of the infrared guided missiles which are surface-to-air or air-to-air guided missiles use non-imaging reticle-based seekers. On the other hand, there are some IR countermeasure (IRCM) techniques which may be applied by the aircraft to protect itself from the approaching IR guided missile. One of IRCM techniques for protecting aircraft platforms against IR guided missiles is to use a modulated jamming signal. But to be effective, optimizing the parameters of the jammer modulation is an important issue. The jamming operation may not be successful for protecting the aircraft if the required jamming characteristic is not satisfied. To define the jammer signal modulation characteristic several parameters must be optimized. In the present paper, we consider protection of a helicopter platform against conical-scan reticle based seeker systems. We investigate the optimized value intervals of the jamming parameters via self organizing maps and multidimensional particle swarm optimization methods. The data for investigation is retrieved from a simulator. The corresponding MATLAB-coded simulator includes the model of the guided missile with reticle-based conical-scan seeker, the aircraft model with aircraft radiation and aircraft motion models and jammer system model on the aircraft.
Underwater image quality improvement approach based on an adapted Gabor multi-channels filtering
This paper presents a new approach to improve the identification of underwater fiducial markers for camera pose estimation. The use of marker detection is new in the underwater field. Hence, it requires a new image preprocessing to reach the same performance as in onshore environment. This is a challenging task due to the poor quality of underwater images. Images captured in highly turbid environment are strongly degraded by light attenuation and scattering. In this context, dehazing methods are increasingly used. However, they are less effective because the scattering of light in the water is different from the atmosphere. Therefore, the estimation of dehazing parameters on the target image can lead to a bad image restoration. For this reason, an objectoriented dehazing method is proposed to optimize the contrast of markers. The proposed system exploits the texture features derived by multi-channel filtering for image segmentation. To achieve this, saliency detection is applied to estimate the visually salient objects in the image. The generated saliency map is passed through a Gabor filters bank and the significant texture features are clustered by K-means algorithm to produce the segmented image. Once different objects of the image are separated, an optimized Dark Channel Prior (DCP) dehazing method is applied to optimize the contrast of each individual object. The implemented system has been tested on a large image dataset taken during night offshore experiments in turbid waters at 15 meters depth. Results showed that the object-oriented dehazing improves the successful of markers identification in underwater environment.
Classroom engagement evaluation using computer vision techniques
Prakash Duraisamy, James Van Haneghan, William Blackwell, et al.
Classroom engagement is one important factor that determines whether students learn from lectures. Most of the traditional classroom assessment tools are based on summary judgements of students in the form of student surveys filled out in class or online once a semester. Such ratings are often bias and do not capture the real time teaching of professors. In addition, they fail for the most part to capture the locus of teaching and learning difficulties. They cannot differentiate whether ongoing poor performance of students is a function of the instructor's lack of teaching skill or the student's lack of engagement in the class. So, in order to streamline and improve the evaluation of classroom engagement, we introduce human gestures as additional tool to improve teaching evaluation along with other techniques. In this paper we report the results of using a novel technique that uses a semi-automatic computer vision based approach to obtain accurate prediction of classroom engagement in classes where students do not have digital devices like laptops and, cellphones during lectures. We conducted our experiment in various class room sizes at different times of the day. We computed the engagement through a semi- automatic process (using Azure, and manual observation). We combined our initial computational algorithms with human judgment to build confidence the validity of the results. Application of the technique in the presence of distractors like laptops and cellphones is also discussed.
Poster Session
icon_mobile_dropdown
Automated detection and classification for early stage lung cancer on CT images using deep learning
Since accurate early detection of malignant lung nodule can greatly enhance the survival of the patient, detection of early stage lung cancer with chest computed tomography (CT) scans is a major problem from the last couple of decades. Therefore, automated lung cancer detection techniques is important. However, it is a significant challenge to accurately detect lung cancer at the early stage due to substantial similarities in the structure of the benign and the malignant lung nodules. The major task is to reduce the false positive and false negative results in lung cancer detection. Recent advancements in convolutional neural network (CNN) models have improved image detection and classification for many tasks. In this study, we presented a deep learning-based framework for automated lung cancer detection. The proposed framework works in multiple stages on 3D lung CT scan images to detect and determine the malignancy of the nodules. Considering 3D nature of lung CT data and the compactness of mixed link network (MixNet), two deep 3D faster R-CNN and U-Net encoder-decoder with MixNet were designed to detect and learn the features of the lung nodule, respectively. For the classification of the nodules, the gradient boosting machine (GBM) with 3D MixNet was proposed. The proposed system was tested with manually draw radiologist contours on 1200 images obtained from LIDC-IDRI including 3250 nodules by using statistical measures. LIDC-IDRI comprises of equal number of benign and malignant lung nodules. The proposed system was evaluated on this data set in the form of sensitivity (94%), specificity (90%), area under the receiver operating curve (0.99) and obtained better results compared to the existing methods.
A crowd counting method based on multi-column dilated convolutional neural network
Crowd counting is an important part of crowd analysis, which is of great significance to crowd control and management. The convolutional neural network (CNN) based crowd counting method is widely used to solve the problem of insufficient counting accuracy due to heavy occlusion, background clutters, head scale and perspective changes in crowd scenes. The multi-column convolutional neural network (MCNN) is a CNN-based method for crowd counting, which adapts to head scale variation of crowd scenes by constructing multi-column convolutional neural network composing of three single-column networks corresponding to the convolution kernel with different sizes (large, medium and small). However, as the MCNN network is relatively shallow, its receptive field is also limited, which affects the adaptability to large scale variations. In addition, due to insufficient training data, it is necessary to carry out a pre-training strategies which pre-trains the single-column convolutional neural network individually and combines the cumbersome. In this paper, a crowd counting method based on multi-column dilated convolutional neural network was proposed. Dilated convolution was used to enhance the receptive field of the network, so as to be better adaptive to the head scale variations. The image patches were obtained by randomly clipping from the original training data set images in the process of each iterative training to further expand the training data, while the training could be achieved without tedious pre-training. The experimental results on ShanghaiTech public dataset showed that the accuracy of crowd counting proposed in this paper was better than that of MCNN, which proved that this method is more robust to head scale variations in crowd scenes.
A Chinese acoustic model based on convolutional neural network
Speech recognition has always been one of the research focuses in the field of human-computer communication and interaction. The main purpose of automatic speech recognition (ASR) is to convert speech waveform signals into text. Acoustic model is the main component of ASR, which is used to connect the observation features of speech signals with the speech modeling units. In recent years, deep learning has become the mainstream technology in the field of speech recognition. In this paper, a convolutional neural network architecture composed of VGG and Connectionist Temporal Classification (CTC) loss function was proposed for speech recognition acoustic model. Traditional acoustic model training is based on frame-level labels with cross-entropy criterion, which requires a tedious label alignment procedure. The CTC loss was adopted to automatically learn the alignments between speech frames and label sequences, such that the training process is end-to-end. The architecture can exploit temporal and spectral structures of speech signals simultaneously. Batch normalization (BN) technique was used for normalizing each layers input to reduce internal covariance shift. To prevent overfitting, dropout technique was used during training to improve network generalization ability. The speech signal was transformed into a spectral image through a series of processing to be the input of the neural network. The input feature is 200 dimensions, and output labels of acoustic mode is 415 Chinese pronunciation without pitch. The experimental results demonstrated that the proposed model achieves the Character error rate (CER) of 17.97% and 23.86% on public Mandarin speech corpus, AISHELL-1 and ST-CMDS-20170001_1, respectively.
Deep network based 3D hand keypoints prediction from single RGB images
3D hand keypoints prediction is an important and fundamental task in Human-Computer Interaction. In this paper, we present an approach to predict 3D hand keypoints from single RGB images. Single RGB images are very common in daily life. However, it is challenging to predict 3D hand keypoints using single RGB images, because of depth ambiguities and occlusions. To deal with these challenges, we exploit deep neural networks to predict 3D hand keypoints. So far, there are several methods which predict 3D hand keypoints from single RGB images. Most of them separate the task into three stages. i.e., hand detection, 2D hand keypoints estimation and 3D hand keypoints prediction. We follow the idea and focus on the 2D hand keypoints estimation and 3D hand keypoints prediction. We improve an existing deep-network-based technique and get better results. Specifically, we combine the convolution and deconvolution network to get the pixel-wise estimation of 2D hand keypoints, and propose a new loss function to predict 3D hand keypoints from 2D keypoints. We evaluate our network on several public datasets and get better results than several other methods. Besides, ablation studies demonstrate that our network is valid.
Generative adversarial networks based super resolution of satellite aircraft imagery
Generative Adversarial Networks (GANs) are one of the most popular Machine Learning algorithms developed in recent times, and are a class of neural networks that are used in unsupervised machine learning. The advantage of unsupervised machine learning approaches such as GANs is that they do not need a large amount of labeled data, which is costly and time consuming. GANs may be used in a variety of applications, including image synthesis, semantic image editing, style transfer, image super-resolution and classification. In this work, GANs are utilized to solve the single image super-resolution problem. This approach in literature is referred to as super resolution GANs (SRGAN), and employs a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and the original photo-realistic images, and the content loss is motivated by the perceptual similarity and not the similarity in the pixel space. This paper presents implementation of SRGAN using Deep convolution network applied to both the aerial and satellite imagery of the aircrafts. The results thus obtained are compared with traditional super resolution methods. The resulting estimates of SRGAN are compared against the traditional methods using peak signal to noise ratio (PSNR) and structure similarity index metric (SSIM). The PSNR and SSIM of SRGAN estimates are similar to traditional method such as Bicubic interpolation but traditional methods are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution.