Proceedings Volume 8183

High-Performance Computing in Remote Sensing

cover
Proceedings Volume 8183

High-Performance Computing in Remote Sensing

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 11 October 2011
Contents: 6 Sessions, 24 Papers, 0 Presentations
Conference: SPIE Remote Sensing 2011
Volume Number: 8183

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 8183
  • HPC for Remote Sensing and Astronomical Data Processing
  • HPC for Remote Sensing Data Compression
  • HPC for Hyper- and Multispectral Remote Sensing I
  • HPC for Hyper- and Multispectral Remote Sensing II
  • Applications of HPC in Remote Sensing
Front Matter: Volume 8183
icon_mobile_dropdown
Front Matter: Volume 8183
This PDF file contains the front matter associated with SPIE Proceedings Volume 8183, including the Title Page, Copyright information, Table of Contents, and the Conference Committee listing.
HPC for Remote Sensing and Astronomical Data Processing
icon_mobile_dropdown
Fuzzy clustering of large satellite images using high performance computing
Dana Petcu, Daniela Zaharie, Silviu Panica, et al.
Fuzzy clustering is one of the most frequently used methods for identifying homogeneous regions in remote sensing images. In the case of large images, the computational costs of fuzzy clustering can be prohibitive unless high performance computing is used. Therefore, efficient parallel implementations are highly desirable. This paper presents results on the efficiency of a parallelization strategy for the Fuzzy c-Means (FCM) algorithm. In addition, the parallelization strategy has been extended in the case of two FCM variants, which incorporates spatial information (Spatial FCM and Gaussian Kernel-based FCM with spatial bias correction). The high-level requirements that guided the formulation of the proposed parallel implementations are: (i) find appropriate partitioning of large images in order to ensure a balanced load of processors; (ii) use as much as possible the collective computations; (iii) reduce the cost of communications between processors. The parallel implementations were tested through several test cases including multispectral images and images having a large number of pixels. The experiments were conducted on both a computational cluster and a BlueGene/P supercomputer with up to 1024 processors. Generally, good scalability was obtained both with respect to the number of clusters and the number of spectral bands.
3D-processor arrays accelerators for high-performance computing in remote sensing applications
A. Castillo Atoche, J. Vazquez Castillo, L. Rizo Dominguez, et al.
The conceptualization and employment of efficient 3D processor arrays (3D-PAs) accelerator units in aggregation with the HW/SW co-design technique is developed in this study in a FPGA platform, for the real-time enhancement/reconstruction of large-scale remote sensing (RS) imaging for Geospatial applications. The addressed architecture implements the previously proposed robust fused Bayesian-regularization (RFBR) enhanced radar imaging method for the solution of ill-conditioned inverse spatial spectrum pattern (SSP) estimation problems. Finally, we show how the proposed 3D-PAs accelerators drastically reduce the computational load of the real-world Geospatial imagery tasks suitable for the real-time implementation.
A GPU-accelerated extended Kalman filter
The extended Kalman filter is one of the most widely used techniques for state estimation of nonlinear systems. In its two steps of forecast and data assimilation, many matrix operations including multiplication and inversion are involved. As recent graphic processor units (GPU) have shown to provide much speedup in matrix operations, we will explore in this work a GPU-based implementation of the extended Kalman filter. The Compute Unified Device Architecture (CUDA) on the Nvidia GeForce GTX 590 GPU hardware will be used for comparison with a single threaded CPU counterpart. Experiments were conducted on typical large-scale over-determined systems with thousands of components in states and measurements. Within the GPU memory limit, a speedup of 1386x is achieved for a system with measurements having 5000 components and states having 3750 components. The speedup profile for various combinations of measurement and state sizes will serve as good reference for future implementation of extended Kalman filter on real large-scale applications.
Efficient data storage of astronomical data using HDF5 and PEC compression
Jordi Portell de Mora, Enrique García-Berro, Carlos Estepa, et al.
Future space missions are based on a new generation of instruments that often generate vast amounts of data. Transferring this data to ground, and once there, between different computing facilities is not an easy task whatsoever. A clear example of these missions is Gaia, a space astrometry mission of ESA. To carry out the data reduction tasks on ground, an international consortium has been set up. Among its tasks perhaps the most demanding one is the Intermediate Data Updating, which will have to repeatedly re-process nearly 100 TB of raw data received from the satellite using the latest instrument calibrations available. On the other hand, one of the best data compression solutions is the Prediction Error Coder, a highly optimized entropy coder that performs very well with data following realitic statistics. Regarding file formats, HDF5 provides a completely indexed, easily customizable file with a quick and parallel access. Moreover, HDF5 has a friendly presentation format and multi-platform compatibility. Thus, it is a powerful environment to store data compressed using the above mentioned coder. Here we show the integration of both systems for the storage of Gaia raw data. However, this integration can be applied to the efficient storage of any kind of data. Moreover, we show that the file sizes obtained using this solution are similar to those obtained using other compression algorithms that require more computing power.
An efficient framework for Java data processing systems in HPC environments
Aidan Fries, Javier Castañeda, Yago Isasi, et al.
Java is a commonly used programming language, although its use in High Performance Computing (HPC) remains relatively low. One of the reasons is a lack of libraries offering specific HPC functions to Java applications. In this paper we present a Java-based framework, called DpcbTools, designed to provide a set of functions that fill this gap. It includes a set of efficient data communication functions based on message-passing, thus providing, when a low latency network such as Myrinet is available, higher throughputs and lower latencies than standard solutions used by Java. DpcbTools also includes routines for the launching, monitoring and management of Java applications on several computing nodes by making use of JMX to communicate with remote Java VMs. The Gaia Data Processing and Analysis Consortium (DPAC) is a real case where scientific data from the ESA Gaia astrometric satellite will be entirely processed using Java. In this paper we describe the main elements of DPAC and its usage of the DpcbTools framework. We also assess the usefulness and performance of DpcbTools through its performance evaluation and the analysis of its impact on some DPAC systems deployed in the MareNostrum supercomputer (Barcelona Supercomputing Center).
HPC for Remote Sensing Data Compression
icon_mobile_dropdown
Geostatistical analysis of Landsat-TM lossy compression images in a high-performance computing environment
Lluís Pesquer, Ana Cortés, Ivette Serral, et al.
The main goal of this study is to characterize the effects of lossy image compression procedures on the spatial patterns of remotely sensed images, as well as to test the performance of job distribution tools specifically designed for obtaining geostatistical parameters (variogram) in a High Performance Computing (HPC) environment. To this purpose, radiometrically and geometrically corrected Landsat-5 TM images from April, July, August and September 2006 were compressed using two different methods: Band-Independent Fixed-Rate (BIFR) and three-dimensional Discrete Wavelet Transform (3d-DWT) applied to the JPEG 2000 standard. For both methods, a wide range of compression ratios (2.5:1, 5:1, 10:1, 50:1, 100:1, 200:1 and 400:1, from soft to hard compression) were compared. Variogram analyses conclude that all compression ratios maintain the variogram shapes and that the higher ratios (more than 100:1) reduce variance in the sill parameter of about 5%. Moreover, the parallel solution in a distributed environment demonstrates that HPC offers a suitable scientific test bed for time demanding execution processes, as in geostatistical analyses of remote sensing images.
Is the CCSDS rice coding suitable for GPU massively parallel implementation?
The Consultative Committee for Space Data Systems (CCSDS) Rice Coding is a recommendation for lossless compression of satellite data. It was also integrated with HDF (Hierarchical Data Format) software for lossless compression of scientific data, and was proposed for lossless compression of medical images. The CCSDS Rice coding is an approximate adaptive entropy coder. It uses a subset of the family of Golomb codes to produce a simpler, suboptimal prefix code. The default preprocessor is a unit-delay predictor with positive mapping. The adaptive entropy coder concurrently applies a set of variable-length codes to a block of consecutive preprocessed samples. The code option that yields the shortest codeword sequence for the current block of samples is then selected for transmission. A unique identifier bit sequence is attached to the code block to indicate to the decoder which decoding option to use. In this paper we explore the parallel efficiency of the CCSDS Rice code running on Graphics Processing Units (GPUs) with Compute Unified Device Architecture (CUDA). The GPU-based CCSDS Rice encoder will process several codeword blocks in a massively parallel fashion on different GPU multiprocessors. We parallelized the CCSDS Rice coding by using reduction sum for code option selection, prefix sum for intra-block and inter-block bit stream concatenation as well as asynchronous data transfer. For NASA AVIRIS hyperspectral data, the speedup is near 6× as compared to the single-threaded CPU counterpart. The CCSDS Rice coding has too many flow control instructions which significantly affect the instruction throughput by causing threads of the same CUDA warp to diverge. Consequently, the different execution paths must be serialized, increasing the total number of instructions executed within the same warp. We conclude that this branching and divergence issue is the bottleneck of the Rice coding that leads to smaller speedup than other entropy coding on GPUs.
Accelerating arithmetic coding on a graphic processing unit
Liang Chen, Yong Fang, Bormin Huang
The popularity of Graphic Processing Units (GPUs) opens a new avenue for general-purpose computation including the acceleration of algorithms. Massively parallel computations using GPUs have been applied in various fields by researchers. Arithmetic coding (AC) is widely used in lossless data compression and shows better compression efficiency than the well-known Huffman Coding. However, AC possesses much higher computational complexity due to frequent multiplication and branching operations. In this paper, we implement the block-parallel arithmetic encoder on GPUs using the NVIDIA GPU and the Computer Unified Device Architecture (CUDA) programming model. The source data sequence is divided into small blocks. Each CUDA thread processes one data block so that data blocks can be encoded in parallel. By exploiting the GPU computational power, a significant speedup is achieved. We show that the GPU-based AC speedup result depends on data distribution and size. It is observed that the GPU speedup increases with higher compression ratios, due to the fact that higher compression ratio corresponds to smaller compressed data output which reduces the bit stream concatenation time as well as the device-to-host transfer time. Applying to the selected test images in the USC-SIPI image database, we obtain speedup values ranging from 26x to 42x while compression ratios ranging from 1.4 to 2.7.
High-performance computing in remote sensing image compression
Albert Lin, C. F. Chang, M. C. Lin, et al.
The high-performance computing is necessary for remote sensing image compression to achieve real time output. There are one Panchromatic (PAN) band and four Multi-Spectrum (MS) bands with total 970Mbps data rate in the FORMOSAT-5 Remote Sensing Instrument (RSI). Three Xilinx Virtex 5 FPGAs with external memory are used to perform real time image data compression based on CCSDS 122.0-B-1. Parallel and concurrent handling strategies are used to achieve high-performance computing in the process.
HPC for Hyper- and Multispectral Remote Sensing I
icon_mobile_dropdown
Parallel implementation of linear and nonlinear spectral unmixing of remotely sensed hyperspectral images
Antonio Plaza, Javier Plaza
Hyperspectral unmixing is a very important task for remotely sensed hyperspectral data exploitation. It addresses the (possibly) mixed nature of pixels collected by instruments for Earth observation, which are due to several phenomena including limited spatial resolution, presence of mixing effects at different scales, etc. Spectral unmixing involves the separation of a mixed pixel spectrum into its pure component spectra (called endmembers) and the estimation of the proportion (abundance) of endmember in the pixel. Two models have been widely used in the literature in order to address the mixture problem in hyperspectral data. The linear model assumes that the endmember substances are sitting side-by-side within the field of view of the imaging instrument. On the other hand, the nonlinear mixture model assumes nonlinear interactions between endmember substances. Both techniques can be computationally expensive, in particular, for high-dimensional hyperspectral data sets. In this paper, we develop and compare parallel implementations of linear and nonlinear unmixing techniques for remotely sensed hyperspectral data. For the linear model, we adopt a parallel unsupervised processing chain made up of two steps: i) identification of pure spectral materials or endmembers, and ii) estimation of the abundance of each endmember in each pixel of the scene. For the nonlinear model, we adopt a supervised procedure based on the training of a parallel multi-layer perceptron neural network using intelligently selected training samples also derived in parallel fashion. The compared techniques are experimentally validated using hyperspectral data collected at different altitudes over a so-called Dehesa (semi-arid environment) in Extremadura, Spain, and evaluated in terms of computational performance using high performance computing systems such as commodity Beowulf clusters.
A comparative analysis of GPU implementations of spectral unmixing algorithms
Sergio Sanchez, Antonio Plaza
Spectral unmixing is a very important task for remotely sensed hyperspectral data exploitation. It involves the separation of a mixed pixel spectrum into its pure component spectra (called endmembers) and the estimation of the proportion (abundance) of each endmember in the pixel. Over the last years, several algorithms have been proposed for: i) automatic extraction of endmembers, and ii) estimation of the abundance of endmembers in each pixel of the hyperspectral image. The latter step usually imposes two constraints in abundance estimation: the non-negativity constraint (meaning that the estimated abundances cannot be negative) and the sum-toone constraint (meaning that the sum of endmember fractional abundances for a given pixel must be unity). These two steps comprise a hyperspectral unmixing chain, which can be very time-consuming (particularly for high-dimensional hyperspectral images). Parallel computing architectures have offered an attractive solution for fast unmixing of hyperspectral data sets, but these systems are expensive and difficult to adapt to on-board data processing scenarios, in which low-weight and low-power integrated components are essential to reduce mission payload and obtain analysis results in (near) real-time. In this paper, we perform an inter-comparison of parallel algorithms for automatic extraction of pure spectral signatures or endmembers and for estimation of the abundance of endmembers in each pixel of the scene. The compared techniques are implemented in graphics processing units (GPUs). These hardware accelerators can bridge the gap towards on-board processing of this kind of data. The considered algorithms comprise the orthogonal subspace projection (OSP), iterative error analysis (IEA) and N-FINDR algorithms for endmember extraction, as well as unconstrained, partially constrained and fully constrained abundance estimation. The considered implementations are inter-compared using different GPU architectures and hyperspectral data sets collected by the NASA's Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS).
FPGA implementation of endmember extraction algorithms from hyperspectral imagery: pixel purity index versus N-FINDR
Carlos Gonzalez, Daniel Mozos, Javier Resano, et al.
Endmember extraction is an important task for remotely sensed hyperspectral data exploitation. It comprises the identification of spectral signatures corresponding to macroscopically pure components in the scene, so that mixed pixels (resulting from limited spatial resolution, mixing phenomena happening at different scales, etc.) can be decomposed into combinations of pure component spectra weighted by an estimation of the proportion (abundance) of each endmember in the pixel. Over the last years, several algorithms have been proposed for automatic extraction of endmembers from hyperspectral images. These algorithms can be time-consuming (particularly for high-dimensional hyperspectral images). Parallel computing architectures have offered an attractive solution for fast endmember extraction from hyperspectral data sets, but these systems are expensive and difficult to adapt to on-board data processing scenarios, in which low-weight and low-power hardware components are essential to reduce mission payload, overcome downlink bandwidth limitations in the transmission of the hyperspectral data to ground stations on Earth, and obtain analysis results in (near) real-time. In this paper, we perform an inter-comparison of the hardware implementations of two widely used techniques for automatic endmember extraction from remotely sensed hyperspectral images: the pixel purity index (PPI) and the N-FINDR. The hardware versions have been developed in field programmable gate arrays (FPGAs). Our study reveals that these reconfigurable hardware devices can bridge the gap towards on-board processing of remotely sensed hyperspectral data and provide implementations that can significantly outperform the (optimized) equivalent software versions of the considered endmember extraction algorithms.
Lossy hyperspectral image compression with state-of-the-art video encoder
Lucana Santos, Sebastian López, Gustavo M. Callicó, et al.
One of the main drawbacks encountered when dealing with hyperspectral images is the vast amount of data to process. This is especially dramatic when data are acquired by a satellite or an aircraft due to the limited bandwidth channel needed to transmit data to a ground station. Several solutions are being explored by the scientific community. Software approaches have limited throughput performance, are power hungry and most of the times do not match the expectations needed for real time applications. Under the hardware point of view, FPGAs, GPUs and even the Cell Processor, represent attractive options, although they present complex solutions and potential problems for their on-board inclusion. However, sometimes there is an impetus for developing new architectural and technological solutions while there is plenty of work done in the past that can be exploited for solving drawbacks in the present. In this scenario, H.264/AVC arises as the state-of-the-art standard in video coding, showing increased compression efficiency with respect to any previous standard, and although mainly used for video applications, it is worthwhile to explore its convenience for processing hyperspectral imaginery. In this work, an inductive exercise of compressing hyperspectral cubes with H.264/AVC is carried out. An exhaustive set of simulations have been performed, applying this standard locally to each spectral band and evaluating globally the effect of the quantization factor, QP, in order to determine an optimum configuration of the baseline encoder for INTRA prediction modes. Results are presented in terms of spectral angle as a metric for determining the feasibility of the endmember extraction. These results demonstrate that under certain assumptions, the use of standard video codecs represent a good compromise solution in terms of complexity, flexibility and performance.
HPC for Hyper- and Multispectral Remote Sensing II
icon_mobile_dropdown
GPU implementation of JPEG2000 for hyperspectral image compression
Milosz Ciznicki, Krzysztof Kurowski, Antonio Plaza
Hyperspectral image compression has received considerable interest in recent years due to the enormous data volumes collected by imaging spectrometers for Earth Observation. JPEG2000 is an important technique for data compression which has been successfully used in the context of hyperspectral image compression, either in lossless and lossy fashion. Due to the increasing spatial, spectral and temporal resolution of remotely sensed hyperspectral data sets, fast (onboard) compression of hyperspectral data is becoming a very important and challenging objective, with the potential to reduce the limitations in the downlink connection between the Earth Observation platform and the receiving ground stations on Earth. For this purpose, implementation of hyperspectral image compression algorithms on specialized hardware devices are currently being investigated. In this paper, we develop an implementation of the JPEG2000 compression standard in commodity graphics processing units (GPUs). These hardware accelerators are characterized by their low cost and weight, and can bridge the gap towards on-board processing of remotely sensed hyperspectral data. Specifically, we develop GPU implementations of the lossless and lossy modes of JPEG2000. For the lossy mode, we investigate the utility of the compressed hyperspectral images for different compression ratios, using a standard technique for hyperspectral data exploitation such as spectral unmixing. In all cases, we investigate the speedups that can be gained by using the GPU implementations with regards to the serial implementations. Our study reveals that GPUs represent a source of computational power that is both accessible and applicable to obtaining compression results in valid response times in information extraction applications from remotely sensed hyperspectral imagery.
Parallel implementation of RX anomaly detection on multi-core processors: impact of data partitioning strategies
Jose M. Molero, Ester M. Garzón, Inmaculada García, et al.
Anomaly detection is an important task for remotely sensed hyperspectral data exploitation. One of the most widely used and successful algorithms for anomaly detection in hyperspectral images is the Reed-Xiaoli (RX) algorithm. Despite its wide acceptance and high computational complexity when applied to real hyperspectral scenes, few documented parallel implementations of this algorithm exist, in particular for multi-core processors. The advantage of multi-core platforms over other specialized parallel architectures is that they are a low-power, inexpensive, widely available and well-known technology. A critical issue in the parallel implementation of RX is the sample covariance matrix calculation, which can be approached in global or local fashion. This aspect is crucial for the RX implementation since the consideration of a local or global strategy for the computation of the sample covariance matrix is expected to affect both the scalability of the parallel solution and the anomaly detection results. In this paper, we develop new parallel implementations of the RX in multi-core processors and specifically investigate the impact of different data partitioning strategies when parallelizing its computations. For this purpose, we consider both global and local data partitioning strategies in the spatial domain of the scene, and further analyze their scalability in different multi-core platforms. The numerical effectiveness of the considered solutions is evaluated using receiver operating characteristics (ROC) curves, analyzing their capacity to detect thermal hot spots (anomalies) in hyperspectral data collected by the NASA's Airborne Visible Infra- Red Imaging Spectrometer system over the World Trade Center in New York, five days after the terrorist attacks of September 11th, 2001.
Real time orthorectification of high resolution airborne pushbroom imagery
Advanced architectures have been proposed for efficient orthorectification of digital airborne camera images, including a system based on GPU processing and distributed computing able to geocorrect three digital still aerial photographs per second. Here, we address the computationally harder problem of geocorrecting image data from airborne pushbroom sensors, where each individual image line has associated its own camera attitude and position parameters. Using OpenGL and CUDA interoperability and projective texture techniques, originally developed for fast shadow rendering, image data is projected onto a Digital Terrain Model (DTM) as if by a slide projector placed and rotated in accordance with GPS position and inertial navigation (IMU) data. Each line is sequentially projected onto the DTM to generate an intermediate frame, consisting of a unique projected line shaped by the DTM relief. The frames are then merged into a geometrically corrected georeferenced orthoimage. To target hyperband systems, avoiding the high dimensional overhead, we deal with an orthoimage of pixel placeholders pointing to the raw image data, which are then combined as needed for visualization or processing tasks. We achieved faster than real-time performance in a hyperspectral pushbroom system working at a line rate of 30 Hz with 200 bands and 1280 pixel wide swath over a 1 m grid DTM, reaching a minimum processing speed of 356 lines per second (up to 511 lps), over eleven (up to seventeen) times the acquisition rate. Our method also allows the correction of systematic GPS and/or IMU biases by means of 3D user interactive navigation.
Design and analysis of algorithms for enhancing the quality and the resolution of Dubai Sat-1 images
DubaiSat-1 (DS1) captures multispectral images with 5-meter resolution using three visible bands red (420 to 510 nm), green (510 to 580 nm), blue (600 to 720 nm) and one near-IR band (760 to 890 nm). It also has a panchromatic channel with 2.5-meter resolution (420 to 720 nm). [1] Under certain conditions, degradation in quality might occur over DS1 captured images. The aim of this project is to enhance the quality of the image in terms of resolution, sharpness and color quality. It is well known that the enhancement procedure is a very difficult task due to the significant noise increase resulted from any sharpening action. Moreover, sometimes the color of the captured images might become saturated, thus some areas will be given false coloring (i.e., some colors will be presented as gray instead of their original colors).
Applications of HPC in Remote Sensing
icon_mobile_dropdown
Efficient GPU implementation of tsunami simulation
Muhammad T. Satria, Bormin Huang, Tung-Ju Hsieh, et al.
Tsunami propagation in shallow water zone is often modeled by the shallow water equations (also called Saint-Venant equations) that are derived from conservation of mass and conservation of momentum equations. Adding friction slope to the conservation of momentum equations enables the system to simulate the propagation over the coastal area. This means the system is also able to estimate inundation zone caused by the tsunami. Applying Neumann boundary condition and Hansen numerical filter bring more interesting complexities into the system. We solve the system using the two-step finite-difference MacCormack scheme which is potentially parallelizable. In this paper, we discuss the parallel implementation of the MacCormack scheme for the shallow water equations in modern graphics processing unit (GPU) architecture using NVIDIA CUDA technology. On a single Fermi-generation NVIDIA GPU C2050, we achieved 223x speedup with the result output at each time step over the original C code compiled with -O3 optimization flag. If the experiment only outputs the final time step result to the host, our CUDA implementation achieved around 818x speedup over its single-threaded CPU counterpart.
GPU acceleration of the WRF Purdue Lin cloud microphysics scheme
Jun Wang, Bormin Huang, Hung-Lung Allen Huang, et al.
The Weather Research and Forecasting (WRF) model is a numerical weather prediction and atmospheric simulation system. It has been designed for both research and operational applications. WRF code can be run in different computing environments ranging from laptops to supercomputers. Purdue Lin scheme is a relatively sophisticated microphysics scheme in WRF. The scheme includes six classes of hydro meteors: water vapor, cloud water, raid, cloud ice, snow and graupel. In this paper, we accelerate the Purdue Lin scheme on the multi-core NVIDIA Graphics Processing Units (GPUs). Lately, GPUs have evolved into highly parallel, multi-threaded, many-core processors possessing tremendous computational speed and a high memory bandwidth. We discuss how our GPU implementation exploits the massive parallelism, resulting in a highly efficient acceleration of the Purdue Lin scheme. We utilize a low-cost personal supercomputer with 512 CUDA cores on a GTX590 GPU. We achieve an overall speedup of 156× in case of 1 GPU as compared to the single-threaded CPU version. Since Purdue Lin microphysics scheme is only an intermediate module of the entire WRF model, host-device I/O should not happen, i.e. its input data is already available in the GPU global memory from previous modules and its output data should reside in the GPU global memory for later usage by other modules. The speedup without host-device data transfer time is 692×.
GPU acceleration of WRF WSM5 microphysics
The Weather Research and Forecast (WRF) model is the most widely used community weather forecast and research model in the world. There are several single moment ice microphysics models in WRF. A mixed phase for WRF Single Moment (WSM) represents the condensation, precipitation, and thermodynamic effects of latent heat release. In this paper, we will show our optimization efforts on WSM5. The processing time can be reduced from 16928 ms on CPU to 48.3 ms using General Purpose Graphics Processing Unit (GPGPU). Thus, the speedup is 350x without I/O using a single GPU. Taking I/O transfer times into account the speedup is 202x.
Compute Unified Device Architecture (CUDA)-based parallelization of WRF Kessler cloud microphysics scheme
Jun Wang, Bormin Huang, Hung-Lung Allen Huang, et al.
The Weather Research and Forecasting (WRF) model is the latest-generation numerical weather prediction model. It has been designed to serve both operational forecasting and atmospheric research needs. It proves useful for a broad spectrum of applications for scales ranging from meters to thousands of kilometers. WRF computes an approximate solution to the differential equations which govern the air motion of the whole atmosphere. Kessler microphysics module in WRF is a simple warm cloud scheme that includes water vapor, cloud water and rain. Microphysics processes which are modeled are rain production, fall and evaporation. The accretion and auto-conversion of cloud water processes are also included along with the production of cloud water from condensation. In this paper, we develop an efficient WRF Kessler microphysics scheme which runs on Graphics Processing Units (GPUs) using the NVIDIA Compute Unified Device Architecture (CUDA). The GPU-based implementation of Kessler microphysics scheme achieves a significant speedup of 70x over its CPU based single-threaded counterpart. The speedup on a GPU without host-device data transfer time is 816x. Since Kessler microphysics scheme is just an intermediate modules of the entire WRF model, the GPU I/O should not occur, i.e. its input data should be already available in the GPU global memory from previous modules and their output data should reside at the GPU global memory for later usage by other modules. Thus, the limited scaling of Kessler scheme with I/O will not be an issue once all modules have been rewritten using CUDA. High speed WRF running completely on GPUs promises more accurate forecasts in considerably less time.
Development of the GPU-based Stony-Brook University 5-class microphysics scheme in the weather research and forecasting model
The weather research and Forecasting (WRF) model in an atmospheric simulation system, which is designed for both operational and research use. This common tool aspect promotes closer ties between research and operational communities. It contains a lot a different physics and dynamics options reflecting the experience and input of the broad scientific community. The WRF physics categories and microphysics, cumulus parametrization, planetary boundary layer, land-surface model and radiation. Explicitly resolved water vapor, cloud and precipitation processes are included in microphysics. Several bulk water microphysics schemes are available within the Weather Research and Forecasting (WRF) model, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University (SBU-YLIN) microphysics scheme is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In this paper, we develop an efficient graphics processing unit (GPU) based SBU-YLIN scheme. WRF computation domain is 3D grid layed over the earth. SBU-YLIN performs the same computation for each spatial position in the whole domain. This repletion of the same computation on different data sets allows using GPU's Single Instruction Multiple Dataset (SIMD) architecture. The GPU-based SBUYLIN scheme will be compared to a CPU-based single-threaded counterpart. The implementation achieves 213x speedup with I/O compared to a Fortran implementation running on a CPU. Without I/O the speedup is 896x.
Calculating the electromagnetic scattering of vegetation by Monte Carlo and CUDA
Electromagnetic scattering of vegetation is represented by a double-layer model comprising of vegetation layer and ground layer. The vegetation layer is composed of discrete leaves which are approximated as ellipsoids. The ground layer is modeled as a random rough surface. Investigation of the scattering field of a single leaf is carried out first. Then the leaves are divided into different groups depending on their orientation. Considering the incoherent addition property of Stokes parameters, the Stokes matrix and the phase matrix of every group are calculated, adding them eventually to get the total scattering coefficient. In the original CPU-based sequential code, the Monte Carlo simulation to calculate the electromagnetic scattering of vegetation takes 97.2% of the total execution time. In this paper we take advantage of the large-scale parallelism of Compute Unified Device Architecture (CUDA) to create and compute all the groups simultaneously. As a result, a speedup of up to 213x is achieved on a single Fermi-generation NVIDIA GPU GTX 480.