Image Convolution Processing: a GPU versus FPGA Comparison
Lucas M. Russo, Emerson C. Pedrino, Edilson Kato Federal University of Sao Carlos - DC
Rodovia Washington Lus, km 235 - SP-310 13565-905 So Carlos - So Paulo - Brazil
[email protected]; emerson, [email protected]
Valentin Obac Roda Federal University of Rio Grande do Norte - DEE
Campus Universitrio Lagoa Nova 59072-970 Natal Rio Grande do Norte Brazil
AbstractConvolution is one of the most important operators used in image processing. With the constant need to increase the performance in high-end applications and the rise and popularity of parallel architectures, such as GPUs and the ones implemented in FPGAs, comes the necessity to compare these architectures in order to determine which of them performs better and in what scenario. In this article, convolution was implemented in each of the aforementioned architectures with the following languages: CUDA for GPUs and Verilog for FPGAs. In addition, the same algorithms were also implemented in MATLAB, using predefined operations and in C using a regular x86 quad-core processor. Comparative performance measures, considering the execution time and the clock ratio, were taken and commented in the paper. Overall, it was possible to achieve a CUDA speedup of roughly 200x in comparison to C, 70x in comparison to Matlab and 20x in comparison to FPGA.
Keywords- Image processing; Convolution; GPU; CUDA; FPGA
I. INTRODUCTION In 2006 Nvidia Corporation announced a new general
purpose parallel computing architecture based on the GPGPU paradigm (General-Purpose Computing on Graphics Processing Units): CUDA (Compute Unified Device Architecture) [1]. CUDA is an architecture classified as GPGPU, and it is a category of the SPMD (single process, multiple data; or single program, multiple data) parallel programming, the model is based on the execution of the same program by different processors, supplied with different input data, without the strict coordination requirement among them that the SIMD (single instruction, multiple data) model imposes. As a central point to the model are the so called kernels: C-style functions that are parallel executed through multiple threads and, when called from the application, dynamically allocate a hierarchy processing structure specified by the user. Interchangeably with the execution of the kernels, portions of sequential code are usually inserted in a CUDA program flow. For this reason, it constitutes a heterogeneous programming model.
The CUDA model was conceived to implement the so called transparent scalability effectively, i.e., the ability of the programming model to adapt itself in the available hardware in such a way that more processors can be scalable without altering the algorithm and, at the same time, reduce the development time of parallel or heterogeneous solutions. All aforementioned model abstractions are particularly suitable and
easily adapted to the field of digital image processing, given that many applications in this area operate in independent pixel by pixel or pixel window approach.
Many years before the advent of the CUDA architecture, Xilinx in 1985 made available to the market the first FPGA chip [2]. The FPGA is basically, a highly customizable integrated chip that has been used in a variety of science fields, such as: digital signal processing, voice recognition, bioinformatics, computer vision, digital image processing and other applications that require high performance: real time systems and high performance computing.
The comparison between CUDA and FPGA has been documented in various works in different applications domains. Asano et al [3] compared the use of CUDA and FPGAs in image processing applications, namely two-dimensional filters, stereo vision and k-means clustering; Che et al [4] compared their use in three applications algorithms: Gaussian Elimination, Data Encryption Standard (DES), and Needleman-Wunsch; Kestur et al [5] developed a comparison for BLAS (Basic Linear Algebra Subroutines); Park et al [6] analyzed the performance of integer and floating-point algorithms and Weber et al [7] compared the architectures using a Quantum Monte Carlo Application.
In this work, CUDA and a FPGA dedicated architecture will be used and compared on the implementation of the convolution, an operation often used for image processing.
II. METHODOLOGY All CPU (i.e., Matlab and C) and GPU (i.e., CUDA)
execution times were obtained from the following configuration:
Component Description
Hardware
Processor: Intel Core i5 750 (8MB cache L2), Motherboard: ASUS P7P55DE-PRO; RAM Memory: 2 x 2 GB Corsair (DDR2-800); Graphics Board: XFX Nvidia GTX 295, 896MB
Software Windows 7 Professional 64-bit; Visual Studio 2008 SP1
Drivers Nvidia driver video version: 190.38; Nvidia CUDA toolkit version: 2.3
FPGA
Cyclone II EP2C35F672 on Terasic DE2 board; Quartus II 10.1 Software with SOPC Builder, NIOS II EDS 10.1 and ModelSim 6.6d Simulation Tool, for the implementation of the algorithms.
Sponsors: FAPESP grants number 2010/04675-4 and 2009/17736-4; DC/UFSCAR; DEE UFRN
978-1-4673-0186-2/12/$31.00 2012 IEEE
The main comparison parameters presented in this article are the execution time and the number of clock cycles of the implemented algorithms. In order to obtain that, different approaches were used according to the architecture profiled.
On C, the Performance Counters were used through the functions: QueryPerformanceCounter() and QueryPerformance Frequency(). The former is used to extract the value of the counter until the function call.
On CUDA, the Event Management provides functionality to create, destroy and record an event. Hence, it is possible to measure the amount of time it took to execute a specific part of code, such as a kernel call, in the manner described in [1]. Concerning the clock cycles, the clock() function was used within the kernel to obtain the measure.
On Matlab, a simple approach is provided through the usage of a built-in stopwatch. It is possible to control it with the tic and toc syntax. The first starts the timer and the second stops it, displaying the time, in seconds, to execute the statements between tic and toc. The Matlab number of clock cycles was not measured since it was not found a simple way to do it.
At last, on the FPGA, it is possible to infer the execution time directly from the architecture implemented on it. With the knowledge of the clock rate, explicitly defined by the designer, and the number of clock cycles taken to process the input data, extracted from the waveforms or from the architecture itself, the following expression can be used: execution time = number of clock cycles/clock frequency
(1)
III. CONVOLUTION Mathematically, convolution can be expressed as a linear
combination or sum of products of the mask coefficients with the input function.
(2)
Where f denotes the input function and w the mask. It is implicit that equation (2) is applied for every point in the input function.
It is possible to extend the convolution operation to a 2-D dimension as follows:
(3)
There is, in convolution, a limitation in what refers to the boundaries of an input image, since the mask is positioned in such way that there are mask values which do not overlap with the input image. Thus, two approaches are commonly used in the context of image processing: padding the edges of the input image with zeros or clamping the edges of the input
image with the closest border pixel. In this work the first choice is used as in GONZALES [8].
Considering an image of size of MxN pixels, a mask of size SxT the multiplication is the more costly operation. Hence, (MN)(ST) operations are performed and, consequently, the algorithm belongs to O(MNST).
A mask w(x,y) can be decomposed in w1(x) and w2(y) in such a way that w(x,y) = w1(x) w2(y), where w1(x) is a vector of size (Sx1) and w2(y) is a vector of size (1xT), the 2D convolution can be performed as two 1D convolutions. In this way, it is said that the convolution is separable and the algorithmic complexity decays allowing for a more flexible implementation. Hence, the separable convolution formula can be expressed as in equation 4.
(4)
IV. IMPLEMENTATION The separable convolution was implemented in C, CUDA
and Matlab (built-in function) and the regular convolution was implemented in FPGA [Eq. 3]. The reason to implement the regular convolution in FPGA was due to performance limitations. The separable algorithm, although reducing the total number of operations performed [Eq. 4], requires the image data stream to be processed twice, one for lines and one for columns. Consequently only the column filter itself would take as much time as the regular convolution to process the entire image. The reason for that is due the time required to fill the shift register and the streaming interface, which can transmit only one pixel at clock cycle.
A. C Implementation The C implementation of convolution was based in [Eq. 4]
and it is fairly straightforward. Follows the sequential separable algorithm implemented.
The image was first loaded to memory with the OpenCV C
library. Later, for each input pixel, the column convolution (with mask w2 and size equal to 2*b+1) was applied to it.
B. Matlab Implementation For Matlab, the conv2() built-in function was used to
perform the convolution.
/* Line Convolution*/ for i 0 to number of lines-1 for j 0 to number of columns -1 g(i, j) 0 for l b to -b if j-l >= 0 and j-l < number of columns g(i, j) g(i, j) + f(i, j - l)*w2(b+l) end-if end-for end-for end-for
/* Column Convolution*/ for i 0 to number of lines-1 for j 0 to number of columns-1 o(i, j) 0 for k a to a if i-k >= 0 and i-k < number of lines o(i, j) o(i, j) + g(i-k, j)*w1(a+k) end-if end-for end-for end-for
C. CO
diffekernkernalgoreal
Line
Tor 4 2-D imagpixedoinreduguar
F(4x4dispmapwayregioS0,0+concloadload
Fig
Asyncare gIn o__sycorre
Icalcuthe o(Fig
CimagNUMROWNUMfetch
CUDA ImplemOn CUDA, therent kernels: nel and the secnel. The develrithm of Podimages as inp
e Convolution
The threads wlines by 16 cgrid with siz
ge. Each threals from the in
ng this, the uced, as long ranteed.
Figs. 1 and 2 4) instead of laying purposped to the fir, the thread0on), S0,0+block_s
+5*block_size (rigcerning the acded for each lded to shared m
gure 1. Example o
After the loadchronize their going to acceorder to do yncthreads() ectly.
In the final sulate four outpones that the . 2).
Concerning thges are notM_LOADS_PW_X is theM_LOADS_Phed from the
mentation he algorithm the first part i
cond one is imlopment of thedlozhnyuk [9]puts.
n Kernel
were grouped columns and, ze depending ad in the blocnput image toaccess to thas the mem
illustrate the f its real dimses. The first lrst line of th,0 is mapped
size, S0,0+2*block_ght apron rectual block sizine) x 4 (num
memory.
of a 2-D block of column
ding stage, alexecution, si
ess elements tthat, a call is issued an
stage, each tput pixels, whthread were
he flexibility t multiple
PER_THREADe number oPER_THREAD
main region.
is implemenis implemente
mplemented the convolution, extended to
in 2-D blocksin turn, they won the dimen
ck is responsibo per-block shhe Global De
mory coalescin
general idea mensions (4 line of the 2-De input imageto the pixel
_size, S0,0+3*blockgion) and s
ze, 384, 16x6 mber of block
size 16 (i.e., 16 tkernels.
ll threads witince, in the nhat were loadto the CUD
nd the progr
thread is assihich are in themapped to in
of the convolof (BLOCKD), in whichof columns D is the nIn order to s
nted through ed through thehrough the coln was based ono support any
s with size (4were grouped
nsions of the ible for fetchinhared memoryevice Memorng restrictions
with block sizx 16) for im
D block (Fig. e (Fig. 2). Inls S0,0 (left ak_size, S0,0+4*bloco on. Moreo(number of p
k lines), pixel
threads) for line a
thin a blocknext stage, thrded by other oDA API funram can pro
igned the tase same positionn the main re
lution filter, sK_SIZE_ROWh BLOCK_SI
per block number of psolve this, the
two e line lumn n the
y size
4x16) d in a input
ng six y. By ry is s are
ze of mage 1) is
n this apron ck_size, over,
pixels ls are
and
must reads ones.
nction oceed
sk to ns as egion
some W_X* IZE_
and pixels e line
kernbegi
Blast prevexacconsAddautoequa Colu
Twiththe respshar
conswarp8 linpixeaproreus
Inlaun(BLAD)of linumfollo
D. FF
on Fand
TA gMeminteg
SNIOwrapin [laye
nel is launcheinning of the i
rowOffset = w
By doing thiscolumn will
viously calculctly which wstruct the rowditionally, thomatically saal four and BL
umn ConvoluThe threads, ih size 8x16 orgrid size dep
ponsible for fred memory.
The decision sidering the
rp access to cones in the bloels loaded to son pixels/numse, since they w
n the same wnched againLOCK_SIZE_), in which Bines per block
mber of fetcheowing offset w
columnOffse height BL NUM_L
FPGA ImplemFor the FPGAFig. 3 was devVerilog HDL
The architecturayscale or bimory, is convger values bet
Such conversiOS II Fast Copper function[12]. Thereforer and is
ed again withimage:
width BLOC NUM_LOA
, every columbe calculated
ated. Hence, were the last wOffset, increahe memory tisfied for N
LOCK_SIZE_
ution Kernel in the column r 8 lines by 16pends on the ifetching six p
to 16 columnmemory coalontiguous memock size tend shared memor
mber of outpuwill be loaded
way as the linn for im
_COLUMN_YBLOCK_SIZEk and NUM_Ls per thread f
was used:
t=LOCK_SIZE_LOADS_PER_
mentation A implementaveloped with
L coding.
ure is responsiinary image inerted to RAWtween 0 (i.e., b
ion was perfoore [10], the n for decomprre, this const
controlled
h the followin
CK_SIZE_ROADS_PER_TH
mn between rd, even if soit is not nececalculated coasing the vericoalescing
NUM_LOADS_ROW_X equ
n filter, are div6 columns. Asinput image apixels from th
ns in this blolescing requirmory positionto reduce thery, reduce the
ut pixels) andd to others sha
ne kernel, themages notY*NUM_LOAE_COLUMN_LOADS_PERfrom the imag
_COLUMN_Y_THREAD
ation, the arcthe assistance
ible for the fon JPEG forma
W format, thatblack) and 25
ormed by the C library libj
ressing JPEGtitutes the ap
by the N
ng offset from
OW_X *HREAD
(5
rowOffset andome of them ssary to deterolumn in ordification overhrequeriments S_PER_THR
ual 16.
vided in 2-D bs in the line keand each threhe input imag
ock size was mrements (i.e.,s). Conversely
e number of ae ratio (numb
d increase meared memories
e column kernt multiple ADS_PER_TH
N_Y is the nuR_THREAD ige main. Thus
Y *
chitecture depe of SOPC Bu
ollowing functat, stored on Ft is, to a matr5 (i.e., white)
softcore procbjpeg [11] and
images, avaipplication softNIOS II
m the
5)
d the were
rmine der to head.
are READ
blocks ernel, ead is ge to
made half y, the apron ber of emory s.
nel is of
HREumber is the s, the
(6)
picted uilder
tions. Flash rix of .
cessor d the ilable
ftware EDS.
Figu
AbuffDMAwithremathe iwithconsFirstComconvdualbetwclockcont
Trest Inter
TstateINGEND
Fregis(DAclockshiftfloor
ure 2. Example of
Figure 3.
After that, thefer (SDRAM A (Pixel Buff
hout interruptinaining of the image is proceh its interfacestituting the o tly, the imag
mponent. Folverted to 30-b-clock queue
ween two clok, and the 25troller is used
The implemenof the arch
rface.
This module ies: DATA_FI
G_STATE_1, D_PROCESS
Firstly, upon ster was initTA_FILL_BUk rise, the inpt register (Fir (KS/2)*(1
f an image region
Pipeline Architec
e decompresseChip lower affer DMA Cong the procespipeline (Ima
essed through e called Avapipeline (Ima
ge is procesllowing, eachbit RGB (RGB
(Clock Crossock domains 5MHz, the VGto display the
nted convoluthitecture by
is based on aILL_BUFFERDATA_PRO
SING_STATE
the reset signtialized with UFFER_STAput interface ig. 4). This stIW), or in ot
n with 96 pixels m
cture for Image P
ed image is waddresses) andontroller) is sor and transmage Processinvarious stream
alon Streaminage Processinssed by the h pixel (8-bB Resampler)sing Bridge) (100MHz, th
GA clock). Ane processed im
tion module ismeans of A
a finite state mER_STATE, DOCESSING_SE.
nal, every pothe value 0
ATE) consistsand store thetate lasts untilther words, un
mapped to 16 thre
Processing.
written to the pd, in this wayable to acce
mit the pixel tg Pipeline). Tming compon
ng Interfaceg Pipeline Fig
User Streambit grayscale). Then, thereacting as a brhe general dend, lastly, a V
mage.
s interfaced toAvalon Stream
machine with DATA_PROC
STATE_2 DA
sition of the 0. The first on reading, value read in
l it has been ntil the first v
eads of line kerne
pixel y, the ess it to the Then, nents, [13], g. 3). ming e) is e is a ridge esign VGA
o the ming
four CESS ATA_
shift state each
n the read valid
pixeof
Figude
kernth
NDATstate
Fi(Lefmom
FDAT(Eq.cycl(i.e.mencomFig.DAT
el. Pixels with the
el is positionFig. 4.
ure 4. Layout of tenotes the size, innel 3x3, KS = 3; Iat means, for a 64
pixe
Next, the TA_PROCESe is depicted in
igure 5. Example ft) and the values ment of DATA_PR
From this statTA_END_PR. 3) will be ale, an output p, Avalon Stre
ntion that thimbinational ci
6). The TA_PROCES
e same color are m
ned in the c
he convolution mn one dimension, W (Image Width40x480 image, IWels used to the con
present SSING_STATEn Fig. 5.
of an image withassociated with tROCESSING_ST
value not c
e until the endROCESSING_applied to thepixel will be aeaming Sourcis calculationircuit sub mo
present SSING_STATE
mapped to the sam
center gray r
module shift regisof the used kerne
h) denotes de widW = 640. The grenvolution calcula
state is TE_1. The fir
h pixels values rathe shift register wTATE_1 state. Thconsidered.
d of the state m_STATE), thee gray area anavailable at th
ce Interface). n is performodule (Convostate will
TE_2 when al
me thread (see Fig
region coordi
ter. KS (Kernel Sel, that means, forth of the input im
ey area indicates tation.
modified st moment of
anging from 0 to 2with KS = 3 in thhe x symbol indic
machine (i.e.,e convolutionnd, at every che output inteIt is importa
med by a paolution Operal change ll input pixel
g. 1).
inate,
Size) r a
mage, the
to f this
255 e first cates a
state n sum clock
erface ant to arallel ation,
to ls are
readequa
Twithvaluand DATperfoneigpixethe oconsbotto
LstatestateregiscaseDAT
Iimagmodelemkernarch
TloadAfteclockinter
TexecmaskCUDpres
d, or more spals the number
This state is sih the exceptioue zero will be
similarly TA_PROCESSformed, since hborhood) wls or to valuesone depicted sidered are theom-right side)
Lastly, after te is modifiede. This last sster and the co). Next, the
TA_FILL_BU
It is importantges up to 640xdule. This is ments limitatinel and image hitecture itself.
Figure
The number oding period (ier this state, uk cycle, one orface.
V.The graphs focution time fok size of 15, DA speedup fented in figur
pecifically, whr of pixels of t
imilar to DATn that, as thee forced into t
to the fiSSING_ STAT
the border pwill have its v
s 0. The last min Fig. 5, wite ones located).
the calculationd to DATA_state is respoounters to its
state machiUFFER_STA
t to highlight x480 are supp
due the DEion. Thereforsizes were es.
e 6. Convolution
f clock cycles.e., state DAuntil the end output pixel is
. RESULTS Aor the convol
or various grayas well as th
for Matlab, C es 7, 8 and 9.
hen the numbthe input imag
TA_PROCESre are no pixethe shift regis
first input TE_1 state, a bpixels (i.e., wvalues convo
moment of thisth the exceptid near the end
n of the last _END_PROConsible for rdefault value
ine returns toATE.
that only kernported in the FE2 board mre, results instimated consi
module block dia
s were obtaineTA_FILL_BUof the state m
s available at
AND COMPARI
lution operatiy scale imagehe number ofand the FPG
ber of pixels ge.
SSING_ STATels to be readter. By doing pixels of
border treatmewithout a com
luted to neigs state is similon that the vaof the image
pixel, the preCESSING_STAresetting the s (i.e., zero ino its initial
nels up to 5x5FPGA convolu
memory and lnvolving diffidering the mo
agram
ed disregardinUFFER_STAmachine, at ethe output mo
ISON on comparings resolutions
f clock cyclesGA architectur
read
TE_1 d, the g this,
the ent is
mplete ghbor lar to alues (i.e.,
esent TATE
shift n this state
5 and ution logic
ferent odule
ng the ATE). every odule
g the for a
s and re are
Fthat behatendincrfor relat
Fig
Tspeeimpexceimagdue and Mat
Tservotheand imp
Twithparaboarparathaninter(Fig512xrelatCUD
Amulperflargincrthe C
For the convothe execution
aves as a expod to maintain teases. The eximage resoluted to cache p
gure 7. Average e
The speedup gedup considelementations eption to this ge samples. Tthe large amohigh resource
tlab and FPGAThe C imple
ved as a coners parallel alg
number of cllementations. The FPGA imh the parallelallelism apprord (DE2) rallelism calcun a true parresting to not
g. 13) was smx512. Besidetivity low (i.DA (i.e., 1242Another limittiplier block
formance for er FPGA witease the perfoCUDA implem
olution applicn time graph (Fonential (note the same grow
xception to thiutions 3300x2erformance.
xecution times ofand various im
graph (Fig. 9) ering CUDAand various
is the Matlab The reason foount of arithme utilization oA. ementation ditrol implemegorithms. Beclock cycles pe
mplementationconvolution
oach because resource limilation the exerallel implemtice that the F
mall, even lesses, the clock .e., 100MHz 2 MHz for the ation of the u
ks that woulthe convolut
th more resouormance of thementation.
cation, it is pFig. 7) for all the log scale
wth rate, as this is the Matl2400 and 409
f the convolutionmage resolutions.
) shows an appA in regard
image resolimplementati
or this good Cmetic operationof the GPU in
id not exploientation for ccause of that, erformed wor
n explored socalculation,
of the FPGAmitations. The
ecution timesmentation (i.eFPGA numbe
s than CUDArate used inin this desi
e processor cloused FPGA w
uld improve tion operationurces and a fahe FPGA and p
ossible to obthree architecon the y-axis
he image resolab implement96x4096, pos
n with mask size o
proximately std to the lutions. Againion for the twoCUDA speedns, high granun comparison
it parallelismcomparison toits execution
rse than the o
me parallelismbut lacked a
A and developerefore, due s were only we., CUDA). er of clock c
A for image sin the FPGA gn) comparin
ock). was the absen
significantlyn. Hence, usifaster clock shpossibly overc
serve ctures s) and lution tation ssibly
of 15
teady other n, an o last
dup is ularity
to C,
m and o the
n time others
m, as a true pment
the worse It is
cycles ize of
was ng to
nce of y the ing a hould come
Figu
Figu
Ain awhilcerta
IC, MBasethe cyclimplin imexplresomultband
re 8. Number of c
ure 9. Speedup of
A positive poia small boardle the GPU bainly much lar
In this paper, Matlab and FPed on results pbest performes and speedlemented FPGmage resolutioore better mlution imagetiple pipelinesdwidth [14] .
clock cycles of thvarious imag
f the Convolution resol
int in favor of d, with the pboard needs arger and more
VI. CONwe presented
PGA for the copresented, it i
mance in execdup in comp
GA architecturon. That is duemassive amos, based on s, high theoret
he convolution wige resolutions.
with mask size olutions.
f the FPGA is eripheral inte
a PC to be coe power consu
NCLUSIONS a comparison
onvolution of s inferable thacution time, parison to C,re and increasee to the fact thounts of data
its inherent tical peak of G
ith mask size of 1
of 15 and various
that in can operfaces integronnected whicuming.
n between CUf grayscale imaat CUDA prenumber of c
, Matlab andes with the gr
hat CUDA tena, such as features suc
GFLOPS and
15 and
image
perate rated, ch is
UDA, ages. sents clock d the rowth nds to
high ch as
high
Rgrapkeptcyclavaicerta
FFPGin trandatanummorimp
T2010ComDepRio
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11][12]
[13]
[14]
Regarding thephs that it perft a steady groles. It must bilable that caainly increase
Finally, it is GA algorithmsvarious squasmitted to pa
a streams, canmber of convre FPGA die aractical.
The authors 0/04675-4 an
mputer Sciencepartment of E
Grande do No
Nvidia Corpora[Online]. Avadownloads.htmlXilinx, C. www.xilinx.comS. Asano, T. Mof FPGA, GPUConference on 2009, Prague, 20S. Che, J. Li, Compute-Intenson Application 101-107. S. Kestur, J.D. CPU and GPU,(ISVLSI), LixourS. J. Park, D.RFPGA and GPUDOD HPCMP UR. Weber, A.Comparing HaStudy, in IEEE2011, Vol. 22, nR. C. Gonzales Domain, in DigV. PodlozhnyukAvailable: http://x86_64_websiteble.pdf Altera CO. NIcom/devices/proIndependent JPEAltera CO. Niowww.altera.comAltera CO. (20Available: wwwD. B. Kirk and WParallel Process2010.
e FPGA archiformed well, aowth in execube noticed thaan operate ine or even surpa
possible to ims even more. Dares and prorallel convolun improve tholution moduarea is consum
ACKNOWLare grateful
nd 2009/177e, Federal Uni
Electrical Engiorte for the su
REFERation. (2009). Nailable: http://
(2010) Our m/company/histor
Maruyama and Y. U and CPU in Field Programm
009, pp. 126-131.J.W. Sheaffer, Kive Applications Specific Process
Davis and O. W in IEEE Compuri and Kefalonia, . Shires and B.JU, in DoD HPCUGC, Seattle, 200
Gothandaramanrdware AcceleratE Transactions o. 1, pp. 58-68. and R. E. Wood
gital Image Procek. (2007, jun.). Im//developer.downe/projects/convol
IOS II Processoocessor/nios2/ni2-EG Group. libjpegos II System A
m/support/example011). Avalon St
w.altera.com/literaW. W. Hwu, Intsors: A Hands-on
itecture, it canalthough worsution time andat there are mn higher clocass the GPU p
mprove the pDividing the ioviding that ution moduleshe performancules. Howevemed and coul
LEDGMENTS l to FAPESP736-4, to theiversity of Sao
gineering, Fedupport through
RENCES Nvidia CUDA /developer.nvidia
History. [ry Yamaguchi, PeImage Process
mable Logic and.
K. Skadron and Jwith GPUs and F
sors - SASP 200
Williams, BLAS uter Society Annu 2010, pp. 288-29
J. Henz, CoprocPCMP Users Gro08, pp. 366-370. an, R.J. Hinde ators in Scientific
on Parallel and
ds, Image Enhaessing, 3rd ed. Prmage Convolutionnload.nvidia.com/lutionSeparable/d
sor. [Online]. A-index.html g. [Online]. Avai
Architect Designes/nios2/exm-systreaming Interfaature/manual/mnl
ntroduction, in Pn Approach, 1st e
n be seen fromse than CUDAd number of
more dense FPck rates that performance.
performance oinput image re
each regions through diffce roughly byer, by doing d possibly ma
P, grants nue Departmeno Carlos and t
deral Universihout this work
Programming a.com/object/cuda
[Online]. Ava
erformance Compsing, in Internad Applications -
J. Lach, AcceleFPGAs, in Symp8, Anaheim, 200
Comparison on Fual Symposium on93. cessor Computingoup Conference,
and G.D. Petc Applications: Ad Distributed Sy
ancement in the Srentice Hall, 2008n with CUDA [O/compute/cuda/1.doc/convolutionS
Available: www.
lable: www.ijg.or. [Online]. Ava
stem-architect.htmace, Cap. 5. [Ol_avalon_spec.pd
Programming Mased. Morgan Kauf
m the A and clock PGAs
will
of the egion n be ferent y the
this, ake it
umber nt of to the ity of k.
Guide. a_2_3_
ailable:
parison ational
FPL
erating posium 08, pp.
FPGA, n VLSI
g with 2008.
terson, A Case ystems,
Spatial 8.
Online]. 1-Beta epara
.altera.
rg ailable: ml
Online]. df ssively fmann,
/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False
/CreateJDFFile false /Description >>> setdistillerparams> setpagedevice