Proceedings Volume 7872

Parallel Processing for Imaging Applications

John D. Owens, I-Jong Lin, Yu-Jin Zhang, et al.
cover
Proceedings Volume 7872

Parallel Processing for Imaging Applications

John D. Owens, I-Jong Lin, Yu-Jin Zhang, et al.
View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 24 January 2011
Contents: 10 Sessions, 29 Papers, 0 Presentations
Conference: IS&T/SPIE Electronic Imaging 2011
Volume Number: 7872

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 7872
  • Parallel Imaging Systems
  • From Image to Structure I
  • From Image to Structure II
  • From Structure to Image I
  • From Structure to Image II
  • Speed vs. Accuracy Trade-off I
  • Speed vs. Accuracy Trade-off II
  • Imaging Applications
  • Interactive Paper Session
Front Matter: Volume 7872
icon_mobile_dropdown
Front Matter: Volume 7872
This PDF file contains the front matter associated with SPIE Proceedings Volume 7872, including the Title Page, Copyright information, Table of Contents, and the Conference Committee listing.
Parallel Imaging Systems
icon_mobile_dropdown
Using a commercial graphical processing unit and the CUDA programming language to accelerate scientific image processing applications
In the past two years the processing power of video graphics cards has quadrupled and is approaching super computer levels. State-of-the-art graphical processing units (GPU) boast of theoretical computational performance in the range of 1.5 trillion floating point operations per second (1.5 Teraflops). This processing power is readily accessible to the scientific community at a relatively small cost. High level programming languages are now available that give access to the internal architecture of the graphics card allowing greater algorithm optimization. This research takes memory access expensive portions of an image-based iris identification algorithm and hosts it on a GPU using the C++ compatible CUDA language. The selected segmentation algorithm uses basic image processing techniques such as image inversion, value squaring, thresholding, dilation, erosion and memory/computationally intensive calculations such as the circular Hough transform. Portions of the iris segmentation algorithm were accelerated by a factor of 77 over the 2008 GPU results. Some parts of the algorithm ran at speeds that were over 1600 times faster than their CPU counterparts. Strengths and limitations of the GPU Single Instruction Multiple Data architecture are discussed. Memory access times, instruction execution times, programming details and code samples are presented as part of the research.
Automatic distribution of vision-tasks on computing clusters
Thomas Müller, Binh An Tran, Alois Knoll
In this paper a consistent and efficient but yet convenient system for parallel computer vision, and in fact also realtime actuator control is proposed. The system implements the multi-agent paradigm and a blackboard information storage. This, in combination with a generic interface for hardware abstraction and integration of external software components, is setup on basis of the message passing interface (MPI). The system allows for data- and task-parallel processing, and supports both synchronous communication, as data exchange can be triggered by events, and asynchronous communication, as data can be polled, strategies. Also, by duplication of processing units (agents) redundant processing is possible to achieve greater robustness. As the system automatically distributes the task units to available resources, and a monitoring concept allows for combination of tasks and their composition to complex processes, it is easy to develop efficient parallel vision / robotics applications quickly. Multiple vision based applications have already been implemented, including academic, research related fields and prototypes for industrial automation. For the scientific community the system has been recently launched open-source.
Highly scalable digital front end architectures for digital printing
David Staas
HP's digital printing presses consume a tremendous amount of data. The architectures of the Digital Front Ends (DFEs) that feed these large, very fast presses have evolved from basic, single-RIP (Raster Image Processor) systems to multirack, distributed systems that can take a PDF file and deliver data in excess of 3 Gigapixels per second to keep the presses printing at 2000+ pages per minute. This paper highlights some of the more interesting parallelism features of our DFE architectures. The high-performance architecture developed over the last 5+ years can scale up to HP's largest digital press, out to multiple mid-range presses, and down into a very low-cost single box deployment for low-end devices as appropriate. Principles of parallelism pervade every aspect of the architecture, from the lowest-level elements of jobs to parallel imaging pipelines that feed multiple presses. From cores to threads to arrays to network teams to distributed machines, we use a systematic approach to move bottlenecks. The ultimate goals of these efforts are: to take the best advantage of the prevailing hardware options at our disposal; to reduce power consumption and cooling requirements; and to ultimately reduce the cost of the solution to our customers.
Parallel training and testing methods for complex image processing algorithms on distributed, heterogeneous, unreliable, and non-dedicated resources
Rubén Usamentiaga, Daniel F. García, Julio Molleda, et al.
Advances in the image processing field have brought new methods which are able to perform complex tasks robustly. However, in order to meet constraints on functionality and reliability, imaging application developers often design complex algorithms with many parameters which must be finely tuned for each particular environment. The best approach for tuning these algorithms is to use an automatic training method, but the computational cost of this kind of training method is prohibitive, making it inviable even in powerful machines. The same problem arises when designing testing procedures. This work presents methods to train and test complex image processing algorithms in parallel execution environments. The approach proposed in this work is to use existing resources in offices or laboratories, rather than expensive clusters. These resources are typically non-dedicated, heterogeneous and unreliable. The proposed methods have been designed to deal with all these issues. Two methods are proposed: intelligent training based on genetic algorithms and PVM, and a full factorial design based on grid computing which can be used for training or testing. These methods are capable of harnessing the available computational power resources, giving more work to more powerful machines, while taking its unreliable nature into account. Both methods have been tested using real applications.
Integrated parallel printing systems with hypermodular architecture
David Biegelsen, Lara Crawford, Minh Do, et al.
We describe here a system consisting of multiple, relatively inexpensive marking engines. The marking engines are interconnected using highly reconfigurable paper paths. The paths are composed of hypermodules (bidirectional nip assemblies and sheet director assemblies) each of which has its own computation, sensing, actuation, and communications capabilities. Auto-identification is used to inform a system level controller of the potential paths through the system as well as module capabilities. Motion control of cut sheets, which of necessity reside physically within multiple hypermodules simultaneously, requires a new abstraction, namely a sheet controller which coordinates control of a given sheet as it moves through the system. Software/hardware co-design has provided a system architecture that is scalable without requiring user relearning. Here the capabilities are described of an exemplary system consisting of 160 modular entities and four marking engines. The throughput of the system is very nearly four times that of a single print engine.
From Image to Structure I
icon_mobile_dropdown
Parallel processing considerations for image recognition tasks
Many image recognition tasks are well-suited to parallel processing. The most obvious example is that many imaging tasks require the analysis of multiple images. From this standpoint, then, parallel processing need be no more complicated than assigning individual images to individual processors. However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. Parallel processing by task allows the assignment of multiple workflows-as diverse as optical character recognition [OCR], document classification and barcode reading-to parallel pipelines. This can substantially decrease time to completion for the document tasks. For this approach, each parallel pipeline is generally performing a different task. Parallel processing by image region allows a larger imaging task to be sub-divided into a set of parallel pipelines, each performing the same task but on a different data set. This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. Finally, parallel processing by meta-algorithm allows different algorithms to be deployed on the same image simultaneously. This approach may result in improved accuracy.
GPGPU real-time texture analysis framework
M. A. Akhloufi, F. Gariepy, G. Champagne
This work presents a framework for fast texture analysis in computer vision. The speedup is obtained using General- Purpose Processing on Graphics Processing Units (GPGPU technology). For this purpose, we have selected the following texture analysis techniques: LBP (Local Binary Patterns), LTP (Local Ternary Patterns), Laws texture kernels and Gabor filters. GPU optimizations are compared to CPU optimizations using MMX-SSE technologies and Multicore parallel programming. The experimental results show an important increase in the performance of the proposed algorithms when GPGPU is used particularly for large image sizes.
A parallel implementation of 3D Zernike moment analysis
Daniel Berjón, Sergio Arnaldo, Francisco Morán
Zernike polynomials are a well known set of functions that find many applications in image or pattern characterization because they allow to construct shape descriptors that are invariant against translations, rotations or scale changes. The concepts behind them can be extended to higher dimension spaces, making them also fit to describe volumetric data. They have been less used than their properties might suggest due to their high computational cost. We present a parallel implementation of 3D Zernike moments analysis, written in C with CUDA extensions, which makes it practical to employ Zernike descriptors in interactive applications, yielding a performance of several frames per second in voxel datasets about 2003 in size. In our contribution, we describe the challenges of implementing 3D Zernike analysis in a general-purpose GPU. These include how to deal with numerical inaccuracies, due to the high precision demands of the algorithm, or how to deal with the high volume of input data so that it does not become a bottleneck for the system.
A novel parallel algorithm for airport runway segmentation in satellite images using priority directional region growing strategy based on ensemble learning
Fei Duan, Yu-Jin Zhang
This paper addresses the problem of airport runway segmentation in satellite images with complex background clutter. To this ends, we propose a novel ensemble learning based parallel runway segmentation algorithm. The contributions of our work can be summarized as follows: (a) we propose the concept of Priority Directional Region Growing. (b)We introduce the Bresenham's line generating algorithm into our segmentation task to better utilize the structural a priori. (c) we adopt a two-stage strategy to better segment the regions corresponding to the airport runway by applying the traditional region growing method and our priority directional (two orthogonal directions in our problem) region growing method sequentially. (d) In our runway segmentation algorithm, the ensemble-learning strategy is used to combine the growing results of each detected line segment. In addition, those thin side branches, which have significantly different width, are eliminated. To evaluate the effectiveness of our algorithm, extensive simulations are carried out on the testing images obtained from Google Map. Our experimental results show that the proposed algorithm can effectively and efficiently segmented the airport region, generate relatively neat boundaries of the runways, and have great superiority over the state-of-the-art methods.
Visualization assisted by parallel processing
B. Lange, H. Rey, X. Vasques, et al.
This paper discusses the experimental results of our visualization model for data extracted from sensors. The objective of this paper is to find a computationally efficient method to produce a real time rendering visualization for a large amount of data. We develop visualization method to monitor temperature variance of a data center. Sensors are placed on three layers and do not cover all the room. We use particle paradigm to interpolate data sensors. Particles model the "space" of the room. In this work we use a partition of the particle set, using two mathematical methods: Delaunay triangulation and Vorono¨ý cells. Avis and Bhattacharya present these two algorithms in. Particles provide information on the room temperature at different coordinates over time. To locate and update particles data we define a computational cost function. To solve this function in an efficient way, we use a client server paradigm. Server computes data and client display this data on different kind of hardware. This paper is organized as follows. The first part presents related algorithm used to visualize large flow of data. The second part presents different platforms and methods used, which was evaluated in order to determine the better solution for the task proposed. The benchmark use the computational cost of our algorithm that formed based on located particles compared to sensors and on update of particles value. The benchmark was done on a personal computer using CPU, multi core programming, GPU programming and hybrid GPU/CPU. GPU programming method is growing in the research field; this method allows getting a real time rendering instates of a precompute rendering. For improving our results, we compute our algorithm on a High Performance Computing (HPC), this benchmark was used to improve multi-core method. HPC is commonly used in data visualization (astronomy, physic, etc) for improving the rendering and getting real-time.
From Image to Structure II
icon_mobile_dropdown
A parallel impulse-noise detection algorithm based on ensemble learning for switching median filters
Fei Duan, Yu-Jin Zhang
In this paper, a highly effective and efficient ensemble learning based parallel impulse noise detection algorithm is proposed. The contribution of this paper is three-fold. First, we propose a novel intensity homogeneity metric- Directional Homogeneity Descriptor(DHD), which has very powerful discriminative ability and has been proven in our previous work. Second, this proposed algorithm has high parallelism in feature extraction stage, classifier training, and testing stage. And the proposed architecture are very suitable for distributed processing. Finally, instead of manually tune the thresholds for each feature as most of the works in this research area do, we use Random Forest to make decision since it has been demonstrated to own better generalization ability and performance comparable to SVM or Boosting in classification problem. Another important reason we adopt Random Forest is that it has natural parallelism structure and very significant performance advantage (e.g. the overhead of training and testing the model is very low ) over other popular classifiers e.g. SVM or Boosting. To the best of our knowledge, this is the first time ensemble learning strategies have been used in the area of switching median filtering. Extensive simulations are carried out on several most common standard testing images. The experimental results show that our algorithm achieves zero miss detection results while keeping the false alarm rate at a rather low level and has great superiority over other state-of-the-art methods.
From Structure to Image I
icon_mobile_dropdown
GPU color space conversion
Patrick Chase, Gary Vondran
Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a $500 NVIDIA GTX-580 GPU is 3x faster than a $1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.
Acceleration of the Retinex algorithm for image restoration by GPGPU/CUDA
Yuan-Kai Wang, Wen-Bin Huang
Retinex is an image restoration method that can restore the image's original appearance. The Retinex algorithm utilizes a Gaussian blur convolution with large kernel size to compute the center/surround information. Then a log-domain processing between the original image and the center/surround information is performed pixel-wise. The final step of the Retinex algorithm is to normalize the results of log-domain processing to an appropriate dynamic range. This paper presents a GPURetinex algorithm, which is a data parallel algorithm devised by parallelizing the Retinex based on GPGPU/CUDA. The GPURetinex algorithm exploits GPGPU's massively parallel architecture and hierarchical memory to improve efficiency. The GPURetinex algorithm is a parallel method with hierarchical threads and data distribution. The GPURetinex algorithm is designed and developed optimized parallel implementation by taking full advantage of the properties of the GPGPU/CUDA computing. In our experiments, the GT200 GPU and CUDA 3.0 are employed. The experimental results show that the GPURetinex can gain 30 times speedup compared with CPU-based implementation on the images with 2048 x 2048 resolution. Our experimental results indicate that using CUDA can achieve acceleration to gain real-time performance.
Performance evaluation of Canny edge detection on a tiled multicore architecture
Andrew Z. Brethorst, Nehal Desai, Douglas P. Enright, et al.
In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.
From Structure to Image II
icon_mobile_dropdown
Video transcoding using GPU accelerated decoder
Due to the growing popularity of portable multimedia display devices and wide availability of high-definition video content, the transcoding of high-resolution videos into lower resolution ones with different formats has become a crucial challenge for PC platforms. This paper presents our study on the leveraging of the Unified Video Decoder (UVD) provided by the graphics processor unit (GPU) for achieving high-speed video transcoding with low CPU usage. Our experimental results show off-loading video decoding and video scaling to the GPU can double transcoding speed with only half the CPU usage compared to in-box software decoders for transcoding 1080p (1920x1080) video content on an AMD Vision processor with an integrated graphics unit.
Real-time image deconvolution on the GPU
James T. Klosowski, Shankar Krishnan
Two-dimensional image deconvolution is an important and well-studied problem with applications to image deblurring and restoration. Most of the best deconvolution algorithms use natural image statistics that act as priors to regularize the problem. Recently, Krishnan and Fergus provide a fast deconvolution algorithm that yields results comparable to the current state of the art. They use a hyper-Laplacian image prior to regularize the problem. The resulting optimization problem is solved using alternating minimization in conjunction with a half-quadratic penalty function. In this paper, we provide an efficient CUDA implementation of their algorithm on the GPU. Our implementation leverages many wellknown CUDA optimization techniques, as well as several others that have a significant impact on this particular algorithm. We discuss each of these, as well as make a few observations regarding the CUFFT library. Our experiments were run on an Nvidia GeForce GTX 260. For a single channel image of size 710 x 470, we obtain over 40 fps, while on a larger image of size 1900 x 1266, we get almost 6 fps (without counting disk I/O). In addition to linear performance, we believe ours is the first implementation to perform deconvolutions at video rates. Our running times also demonstrate that our GPU implementation is over 27 times faster than the original CPU implementation.
Stitching giga pixel images using parallel computing
Rob Kooper, Peter Bajcsy, Néstor Morales Hernández
This paper addresses the problem of stitching Giga Pixel images from airborne images acquired over multiple flight paths of Costa Rica in 2005. The set of input images contains about 10,158 images, each of size around 4072x4072 pixels, with very coarse georeferencing information (latitude and longitude of each image). Given the spatial coverage and resolution of the input images, the final stitched color image is 294,847 by 269,195 pixels (79.3 Giga Pixels) and corresponds to 238.2 GigaBytes. An assembly of such large images requires either hardware with large shared memory or algorithms using disk access in tandem with available RAM providing data for local image operation. In addition to I/O operations, the computations needed to stitch together image tiles involve at least one image transformation and multiple comparisons to place the pixels into a pyramid representation for fast dissemination. The motivation of our work is to explore the utilization of multiple hardware architectures (e.g., multicore servers, computer clusters) and parallel computing to minimize the time needed to stitch Giga pixel images. Our approach is to utilize the coarse georeferencing information for initial image grouping followed by an intensitybased stitching of groups of images. This group-based stitching is highly parallelizable. The stitching process results in image patches that can be cropped to fit a tile of an image pyramid frequently used as a data structure for fast image access and retrieval. We report our experimental results obtained when stitching a four Giga Pixel image from the input images at one fourth of their original spatial resolution using a single core on our eight core server and our preliminary results for the entire 79.3 Gigapixel image obtained using a 120 core computer cluster.
Speed vs. Accuracy Trade-off I
icon_mobile_dropdown
GPU-completeness: theory and implications
This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe that the selection of architecture can be defined in terms of properties of GPU-Completeness. For a welldefined subset of algorithms, GPU-Completeness is intended to connect the parallelism, algorithms and efficient architectures into a unified framework to show that multiple layers of parallel implementation are guided by the same underlying trade-off.
A parallel error diffusion implementation on a GPU
Yao Zhang, John Ludd Recker, Robert Ulichney, et al.
In this paper, we investigate the suitability of the GPU for a parallel implementation of the pinwheel error diffusion. We demonstrate a high-performance GPU implementation by efficiently parallelizing and unrolling the image processing algorithm. Our GPU implementation achieves a 10 - 30x speedup over a two-threaded CPU error diffusion implementation with comparable image quality. We have conducted experiments to study the performance and quality tradeoffs for differences in image block sizes. We also present a performance analysis at assembly level to understand the performance bottlenecks.
Experience with imaging algorithms on multiple core CPUs
Richard Moore
With the release of an eight core Xeon processor by Intel and a twelve core Opteron processor by AMD in the spring of 2010, the increase of multiple cores per chip package continues. Multiple core processors are common place in most workstations sold today and are an attractive option for increasing imaging performance. Visual attention models are very compute intensive, requiring many imaging algorithms to be run on images such as large difference of Gaussian filters, segmentation, and region finding. In this paper we present our experience in optimizing the performance of a visual attention model on standard multi-core Windows workstations.
Speed vs. Accuracy Trade-off II
icon_mobile_dropdown
Evaluation of CPU and GPU architectures for spectral image analysis algorithms
Virginie Fresse, Dominique Houzet, Christophe Gravier
Graphical Processing Units (GPU) architectures are massively used for resource-intensive computation. Initially dedicated to imaging, vision and graphics, these architectures serve nowadays a wide range of multi-purpose applications. The GPU structure, however, does not suit to all applications. This can lead to performance shortage. Among several applications, the aim of this work is to analyze GPU structures for image analysis applications in multispectral to ultraspectral imaging. Algorithms used for the experiments are multispectral and hyperspectral imaging dedicated to art authentication. Such algorithms use a high number of spatial and spectral data, along with both a high number of memory accesses and a need for high storage capacity. Timing performances are compared with CPU architecture and a global analysis is made according to the algorithms and GPU architecture. This paper shows that GPU architectures are suitable to complex image analysis algorithm in multispectral.
Computational scalability of large size image dissemination
Rob Kooper, Peter Bajcsy
We have investigated the computational scalability of image pyramid building needed for dissemination of very large image data. The sources of large images include high resolution microscopes and telescopes, remote sensing and airborne imaging, and high resolution scanners. The term 'large' is understood from a user perspective which means either larger than a display size or larger than a memory/disk to hold the image data. The application drivers for our work are digitization projects such as the Lincoln Papers project (each image scan is about 100-150MB or about 5000x8000 pixels with the total number to be around 200,000) and the UIUC library scanning project for historical maps from 17th and 18th century (smaller number but larger images). The goal of our work is understand computational scalability of the web-based dissemination using image pyramids for these large image scans, as well as the preservation aspects of the data. We report our computational benchmarks for (a) building image pyramids to be disseminated using the Microsoft Seadragon library, (b) a computation execution approach using hyper-threading to generate image pyramids and to utilize the underlying hardware, and (c) an image pyramid preservation approach using various hard drive configurations of Redundant Array of Independent Disks (RAID) drives for input/output operations. The benchmarks are obtained with a map (334.61 MB, JPEG format, 17591x15014 pixels). The discussion combines the speed and preservation objectives.
Imaging Applications
icon_mobile_dropdown
Real-time 3D flash ladar imaging through GPU data processing
Chung M. Wong, Christopher Bracikowski, Brian K. Baldauf, et al.
We present real-time 3D image processing of flash ladar data using our recently developed GPU parallel processing kernels. Our laboratory and airborne experiences with flash ladar focal planes have shown that per laser flash, typically only a small fraction of the pixels on the focal plane array actually produce a meaningful range signal. Therefore, to optimize overall data processing speed, the large quantity of uninformative data are filtered out and removed from the data stream prior to the mathematically intensive point cloud transformation processing. This front-end pre-processing, which largely consists of control flow instructions, is specific to the particular type of flash ladar focal plane array being used and is performed by the computer's CPU. The valid signals along with their corresponding inertial and navigation metadata are then transferred to a GPU device to perform range-correction, geo-location, and ortho-rectification on each 3D data point so that data from multiple frames can be properly tiled together either to create a wide-area map or to reconstruct an object from multiple look angles. GPU parallel processing kernels were developed using OpenCL. Postprocessing to perform fine registration between data frames via complex iterative steps also benefits greatly from this type of high-performance computing. The performance improvements obtained using GPU processing to create corrected 3D images and for frame-to-frame fine-registration are presented.
Advanced MRI reconstruction toolbox with accelerating on GPU
Xiao-Long Wu, Yue Zhuo, Jiading Gai, et al.
In this paper, we present a fast iterative magnetic resonance imaging (MRI) reconstruction algorithm taking advantage of the prevailing GPGPU programming paradigm. In clinical environment, MRI reconstruction is usually performed via fast Fourier transform (FFT). However, imaging artifacts (i.e. signal loss) resulting from susceptibility-induced magnetic field inhomogeneities degrade the quality of reconstructed images. These artifacts must be addressed using accurate modeling of the physics of the system coupled with iterative reconstruction. We have developed a reconstruction algorithm with improved image quality at the expense of computation time and hence an implementation on GPUs achieving significant speedup. In this work, we extend our previous work on GPU implementation by adding several new features. First, we enable Sensitivity Encoding for Fast MRI (SENSE) reconstruction (from data acquired using a multi-receiver coil array) which can reduce the acquisition time. Besides, we have implemented a GPU-based total variation regularization in our SENSE reconstruction framework. In this paper, we describe the different optimizations employed from levels of algorithm, program code structures, and specific architecture performance tuning, featuring both our MRI reconstruction algorithm and GPU hardware specifics. Results show that the current GPU implementation produces accurate image estimates while significantly accelerating the reconstruction.
Accelerating image recognition on mobile devices using GPGPU
Miguel Bordallo López, Henri Nykänen, Jari Hannuksela, et al.
The future multi-modal user interfaces of battery-powered mobile devices are expected to require computationally costly image analysis techniques. The use of Graphic Processing Units for computing is very well suited for parallel processing and the addition of programmable stages and high precision arithmetic provide for opportunities to implement energy-efficient complete algorithms. At the moment the first mobile graphics accelerators with programmable pipelines are available, enabling the GPGPU implementation of several image processing algorithms. In this context, we consider a face tracking approach that uses efficient gray-scale invariant texture features and boosting. The solution is based on the Local Binary Pattern (LBP) features and makes use of the GPU on the pre-processing and feature extraction phase. We have implemented a series of image processing techniques in the shader language of OpenGL ES 2.0, compiled them for a mobile graphics processing unit and performed tests on a mobile application processor platform (OMAP3530). In our contribution, we describe the challenges of designing on a mobile platform, present the performance achieved and provide measurement results for the actual power consumption in comparison to using the CPU (ARM) on the same platform.
Multi-view stereo reconstruction via voxel clustering and optimization of parallel volumetric graph cuts
Yun-Feng Zhu, Yu-Jin Zhang
Traditional multi-view stereo reconstruction via volumetric graph cuts formulates the 3D reconstruction problem as a computationally tractable global optimization using graph cuts. It benefits from a volumetric scene representation and discrete photo consistency is defined on the edge cost with a weighted graph. As the independence between each discrete voxel, it is natural to do the parallel processing with multi-core CPUs or GPU, but after the photo consistency has been estimated, it still need to design a parallel optimized methods to get the optimized labeling results for each voxel. In our paper, we use the parallel volumetric graph cuts methods to solve the above problems. Our algorithm has two main steps, clustering step and parallel graph cuts optimization step. We also introduce an approach for enhancing accuracy and speeding up existing Multi-view 3D reconstruction methods, which based on volumetric graph cuts. The main idea is to decompose the collected photos into some overlapping sets, while the voxels are also be clustered. The voxels consistency estimating and surface labeling with graph cuts are processed in parallel, however, the labels of the overlapped voxels may in general have multiple label solutions. It will be constrained to be equal to obtain a unique solution in parallel graph cuts optimization step.
A GPU accelerated PDF transparency engine
John Recker, I-Jong Lin, Ingeborg Tastl
As commercial printing presses become faster, cheaper and more efficient, so too must the Raster Image Processors (RIP) that prepare data for them to print. Digital press RIPs, however, have been challenged to on the one hand meet the ever increasing print performance of the latest digital presses, and on the other hand process increasingly complex documents with transparent layers and embedded ICC profiles. This paper explores the challenges encountered when implementing a GPU accelerated driver for the open source Ghostscript Adobe PostScript and PDF language interpreter targeted at accelerating PDF transparency for high speed commercial presses. It further describes our solution, including an image memory manager for tiling input and output images and documents, a PDF compatible multiple image layer blending engine, and a GPU accelerated ICC v4 compatible color transformation engine. The result, we believe, is the foundation for a scalable, efficient, distributed RIP system that can meet current and future RIP requirements for a wide range of commercial digital presses.
Interactive Paper Session
icon_mobile_dropdown
Infrared small target tracking based on SOPC
Taotao Hu, Xiang Fan, Yu-Jin Zhang, et al.
The paper presents a low cost FPGA based solution for a real-time infrared small target tracking system. A specialized architecture is presented based on a soft RISC processor capable of running kernel based mean shift tracking algorithm. Mean shift tracking algorithm is realized in NIOS II soft-core with SOPC (System on a Programmable Chip) technology. Though mean shift algorithm is widely used for target tracking, the original mean shift algorithm can not be directly used for infrared small target tracking. As infrared small target only has intensity information, so an improved mean shift algorithm is presented in this paper. How to describe target will determine whether target can be tracked by mean shift algorithm. Because color target can be tracked well by mean shift algorithm, imitating color image expression, spatial component and temporal component are advanced to describe target, which forms pseudo-color image. In order to improve the processing speed parallel technology and pipeline technology are taken. Two RAM are taken to stored images separately by ping-pong technology. A FLASH is used to store mass temp data. The experimental results show that infrared small target is tracked stably in complicated background.