High Level Programming for FPGA Based Image and Video Processing

using Hardware Skeletons

K Benkrid¨, D Crookes¨, J Smith§ and A Benkrid¨

¨School of Computer Science, The Queen’s University of Belfast, Belfast BT7 1NN, UK

§VisiCom Laboratories, 10052 Mesa Ridge Court San Diego, CA, 92121 USA

 


Abstract

 

In this paper, we present a new approach to developing a general framework for efficient FPGA based Image Processing algorithms. This approach is based on the new concept of Hardware Skeletons. A hardware skeleton is a parameterised description of a task-specific architecture, to which the user can supply parameters such as values, functions or even other skeletons. A skeleton contains built-in rules that will apply optimisations specific to the target hardware at the implementation phase. The framework contains a library of reusable skeletons for a range of common Image Processing operations. The library also contains high level skeletons for common combinations of basic image operations. It normally contains a range of alternative skeletons for the same task, perhaps tailored for different data representations. Given a complete algorithm description in terms of skeletons, an efficient hardware configuration is generated automatically. We have developed a library of hardware skeletons for common image processing tasks, with optimised implementations specifically for Xilinx XC4000 FPGAs.  This paper presents and illustrates our hardware skeleton approach in the context of some common image processing tasks.  It demonstrates our approach to the broader problem of achieving optimised hardware configurations while retaining the convenience and rapid development cycle of an application-oriented, high level programming model.

          The effectiveness of our approach has been successfully demonstrated on VISICOM’s VigraVisionTM FPGA based video board.

 

1. Introduction

 

Many modern Image Processing (IP) applications (such as processing video and very large images) are so computationally demanding that special purpose hardware solutions need to be considered.  Reconfigurable hardware in the form of FPGAs can offer the performance advantages of a custom hardware solution, while their inherent reprogrammability feature makes them multi-purpose and reusable.  However, a big disadvantage is the low level, hardware-oriented programming model needed to fully exploit the FPGA’s potential performance.

Despite the great amount of research done on FPGAs, many FPGA-based applications have been algorithm specific. An environment for developing applications needs more than just a library of static FPGA configurations, perhaps parameterisable (e.g. in terms of input data wordlength), since it should allow the user to experiment with alternative algorithms and develop his/her own algorithms. There is a need for bridging the gap between high level application-oriented software and low level FPGA hardware. Many behavioural synthesis tools have been developed to satisfy this requirement [1][2][3]. These tools allow the user to program FPGAs at a high level (e.g. in a C-like syntax) without having to deal with low level hardware details (e.g. scheduling, allocation, pipelining etc.). However, although behavioural synthesis tools have developed enormously, structural design techniques often still result in circuits that are substantially smaller and faster than those developed using only behavioural synthesis tools [4][5].

This paper presents a framework for developing efficient hardware solutions specifically for image processing applications. This framework gives the benefits of an application-oriented, high level programming model, but does not sacrifice significantly the performance of the solution.  Our approach to this is to use a concept which has proved relatively successful in developing software for parallel machines, namely skeletons [6][7][8].  Skeletons are reusable, parameterised fragments or frameworks to which the user can supply components (e.g. functions).  It is common for skeletons to include functions as parameters which are applied by the skeleton to a data set.  The implementation of a skeleton is normally optimised for a specific target machine. 

In this paper we introduce the concept of hardware skeletons. A hardware skeleton is a parameterised description of a task-specific architecture, to which the user can supply parameters such as values, functions (parameterised functional blocks) or even other skeletons. Certain combinations of basic skeletons can form the basis of additional, higher level skeletons. To present the concept, the paper first identifies a useful high level model for describing image processing operations. The common basic tasks, which we identify, will form the basis of a Hardware Skeleton Library. Next, we outline the strategy which the system employs to generate efficient FPGA configurations from a given operation description. A layered implementation of the hardware skeleton library is then presented. Finally, the implementation of our system on a commercial FPGA video board is presented.

 

2. A high level description model for IP operations

 

Many image processing algorithm can be described in terms of a Directed Acyclic Graph (DAG), where vertices represent IP tasks, and the directed edges represent the data flow.

Figure 1.  A hypothetical image processing algorithm modelled as a DAG

 

Nodes are typically simple tasks such as adding two input images, or an image convolution. Common IP tasks can be classified in terms of the locality of their data access requirements into three categories:

·         Point operations: The same operation is applied to each individual pixel of one or many source images to produce a corresponding result pixel in the new image. These include: relational operations (e.g. ‘³’, ’£’, ‘=’), arithmetic operations (e.g. ‘+’, ‘-‘, ’*’, ‘¸), logical operations (e.g. ‘AND’, ‘OR’) and Look-Up tables.  The operation could either be between two images or between an image and a scalar value.

·         Neighbourhood operations: In neighbourhood operations, a new pixel value is calculated using only the pixel values in the neighbourhood of the original pixel and the weights in a window (e.g. convolution). This is done for all image pixels, and results in a new image. Neighbourhood operations are completely defined by a local operation between corresponding pixels and window values (e.g. multiplication), a global operation (e.g. accumulation) which reduces the window of intermediate results to a single result pixel, and a window (with given shape and coefficients) [9].

·         Global operations: These operations operate globally on the whole image. We can distinguish two common types of simple global operations:

-          Reduction to Scalar (RS): These operate on the whole image to produce a scalar as a result. Examples include count, global maximum, global minimum and global accumulation (S).

-          Reduction to Vector (RV): This operation operates on the whole image to produce a vector as a result. Examples include histogramming and cumulative histogramming.

 

The properties of an item of data (represented by an edge in the DAG) are of two kinds:

·         Data type

This is defined by two properties:

-          Structure: could be an image, a vector or a scalar.

-          Pixel type: which, for the purpose of this work, could be either an integer or a boolean.

·         Data representation

A particular data representation is defined by three properties:

-          The data could be in bit serial, or in bit parallel with an associated word size or, in digit serial representation, with a particular digit and word sizes.

-          If data is in bit serial (or digit serial), it can then be processed either MSB (or MSD) First or LSB (or LSD) First.

-          Number System which, for the purpose of this work, could be one of unsigned integer, 2’s complement, or Signed Digit (SD) number representation [10].

Note that Binary representation corresponds to bit parallel with a word size one (denoted as parallel(1)).

 

A node with a particular set of logical Inputs/Outputs could be implemented by a range of different possible implementations as illustrated, for example, for the ‘Absolute value’ operation in Figure 2.  It is normal (but not compulsory) for the input and output representations to be the same.

 

Figure 2. A DAG node (a) with several possible implementations (b), (c) and (d)

 

The Hardware Skeleton Library will contain parameterised descriptions of architectures not only for the full range of basic operations (nodes), but possibly with different versions for different data representation combinations.

 

3. Implementation strategy

 

The user’s first task will be to represent the algorithm in terms of a DAG, without initially being concerned with data type or data representation considerations. Once this is done, an analysis of the properties of the input and output data formats of the nodes will identify a range of possible implementations of each node.  For instance, the result of an N-bit integer image comparison operation could be either an N-bit integer image or a (1-bit) binary image.  The choice will depend on subsequent processing of the result image, and on what skeletons are available.  As a first step, the set of all possible implementations should first be considered by the user. The library of Hardware Skeletons (e.g. neighbourhood operations, point operations, etc.), in which each component has a set of different implementations (e.g. bit serial, bit parallel), is the basis of this phase. The implementations of the library components are optimised for specific target architectures (e.g. bit parallel adder units based on dedicated fast carry logic on Xilinx 4000). The range of possible implementations generated for a particular IP algorithm depends on the extent of this library.

                To select the optimum skeleton from the set of possible choices, the cost of each choice of optional skeleton needs to be found, in terms of speed and area. This involves estimating the expected performance or effectively generating the FPGA configuration for each option, including the application of the optimisations associated with each skeleton. This cost based analysis enables the user to settle on a final DAG with all attributes (data type and representation) defined.  The corresponding FPGA implementation is finally generated, in the form of EDIF netlist, for the chosen solution. This is performed by a Prolog-based [11] Hardware Description Environment, called HIDE4k, developed at Queen’s University Belfast [5][12][13]. The latter enables highly scaleable and parameterised component descriptions to be written, and generates pre-placed configurations in EDIF format for Xilinx XC4000 series [14]. The resulting EDIF file is finally fed to Xilinx Placement And Routing (PAR) tools to generate the FPGA configuration bitstream (see Figure 3).

                Note that during the process of implementing a DAG, the following issues arise:

·         Data representation conversion

Since many data representations might be used within the DAG, data representation converters may be needed to convert between different representations (e.g. from bit serial to bit parallel, or from Signed Digit to two’s complement etc.)

·         Data synchronisation

When there are two or more inputs to a DAG node (vertex), any branch that arrives earlier than the others should be forced to wait for the slowest branches by adding appropriate delays to the fastest branches. This is performed automatically by our system so that the user does not have to deal with low level data synchronisation issues.

 

As a result, the user’s programming model is essentially the set of hardware skeletons provided by the Hardware Skeleton Library. These skeletons can be accessed either textually or, even more conveniently, by interacting with a GUI.

 

4. Implementing the Hardware Skeleton Library

 

We have implemented our Hardware Skeleton Library as a hierarchy of three levels of hardware block descriptions. At the bottom level lies the arithmetic cores library (see Figure 4.). This provides basic arithmetic units (e.g. adders, multipliers) parameterised for different number representations (e.g. bit serial, bit parallel, 2’s complement, unsigned etc.). Immediately on top of this level, we find the basic image operations library. The latter provides implementations for the basic image operations presented in section 2 above (e.g. basic neighbourhood operations). Finally, the top level provides implementations for high level (compound) skeletons.


 Figure 3. Implementation strategy

 


Figure 4. Hierarchical implementation of the Hardware Skeleton Library

 

The following section considers each of these three levels in more detail.

 

4.1 Arithmetic cores library

This library provides the basic building blocks required for image processing operations (and signal processing in general). It includes adders, multipliers, dividers, shifts and delays. Note that the basic functions required for nearly any signal processing operation include addition/subtraction, shifts and delays. These blocks can then be used to construct the more complicated structures such as multipliers, dividers and maximum/minimum selectors.

                Versions of these cores are provided for different number representations. At the time of writing, the following number representations are supported:

 

 

· Bit parallel (N bits)

2’s complement

-

· Bit serial

2’s complement

MSBF

· Bit serial

2’s complement

LSBF

· Bit serial

Signed Digit

MSDF

 

 

The implementation of these cores is optimised for a specific target architecture (Xilinx XC4000 FPGAs for our particular case study). This should take advantage of the particular features of the target architecture (e.g. 4 input LUTs, synchronous RAMs, dedicated fast carry logic for XC4000). The core descriptions are held in HIDE4k with rules for core-specific optimisations as part of the core. For instance, a constant coefficient multiplication will apply CSD coding of the multiplier coefficient to reduce the consumed hardware [15][16]. Such optimisations are often not performed by behavioural synthesis tools.

 

4.2 Basic image operations library

This library provides implementations of the basic image operations presented in section 2.

                Consider the case of basic neighbourhood operations. As mentioned in section 2, a neighbourhood operation is defined in terms of a local and global operation. Local operations include multiplication and addition. Global operations include accumulation, maximum and minimum. These form the five basic Image Algebra neighbourhood operations as shown in Table 1 [9].

 

Neighbourhood Operation

Local Op.

GlobalOp.

Convolution

*

S

Multiplicative maximum

*

Max

Multiplicative minimum

*

Min

Additive maximum

+

Max

Additive minimum

+

Min

Table 1.  Image Algebra core operation set

The architecture of a generic PxQ neighbourhood operation (with a local operation L and a global one G) requires (Q-1) line buffers, PxQ replicated local operation blocks, and a single PxQ-input global operation block, implemented as a tree of two-input reduction blocks when bit parallel arithmetic is used, as shown in figure 5.

Figure 5. A general 2D PxQ neighbourhood operation

 

This architecture is parameterisable or scaleable in terms of [17]:

-          The window size (PxQ)

-          The window coefficients

-          The image size (line buffer size dLB)

-          The pixel wordlength

-          The local and global operations (L and G)

-          The number representation (arithmetic type)

A generic description of a neighbourhood operation would then be given by:

 

neighbourhood_op(Arithmetic_type, Local_op, Global_op, Window, Pixel_wordlength, Image_Size)

 

Our HIDE4k system is capable of generating pre-placed FPGA architectures in EDIF format from such generic description. A ~30K line EDIF description is generated in 1~2 sec. The resulting architectures are tailored to the particular neighbourhood operation in hand (e.g. specific window coefficients). Their performance (speed and area) rivals those obtained with a careful hand design [5].

A common skeleton used at this level is the reduce skeleton which reduces a set of N inputs  into one result using a tree of 2-operands operations (Op) as shown in Figure 6.

Figure 6. Reduce skeleton

 

At the time of writing, the supported 2-input reduction operations (Op) are addition (+), subtraction (-), maximum (max) and minimum (min).

 

4.3 High level (compound) skeletons library

This library contains efficient implementations of a set of compound skeletons. These compound skeletons result from the process of identifying, by experience, common ways of assembling primitive operations and providing optimised implementations of these. To demonstrate this concept, we will present two examples of such compound skeletons.

 

Pipeline skeleton

In this type of operations, two or more IP operations are cascaded in a pipeline as shown in Figure 7. The input of each pipeline stage is provided by the output of the previous pipeline stage.

 

Figure 7. Pipeline skeleton

 

This structure is described by:

 

Op_description = pipeline(

                                                [Op1_desc, Op2_desc,……, OpN_desc]

                                              )                                                                                                 (1)

 

where OpI_desc {I = 1,2,…, N} is the high level description of each operation in the pipeline.

                For instance, an ‘Open’ operation (see Figure 8) applied to 256x256 images of 8-bits/pixel, using 2’s complement bit parallel arithmetic, would be described by:

 

Open = pipeline(

[neighbourhood(tc_par, add, min, [[0,0,0],[0,0,0],[0,0,0]], 8, 256),

   neighbourhood(tc_par, add, max, [[0,0,0],[0,0,0],[0,0,0]], 8, 256)]

                              )

 

Figure 8.Open’ operation

 

 

Parallel skeleton

A number of common image processing algorithms comprise several concurrent neighbourhood operations (simple or compound) which share the same input image, and whose templates have the same size and shape.  The results of these parallel operations then used in a reduce operation. Sobel, Prewitt, Roberts and Kirsch edge detectors [18], are examples of such operations.

Figure 9. Parallel skeleton

The high level description (Par_desc) of this operation would be defined as follows:

 

Par_desc = Op(Par_desc1, Par_desc2)    (2)

 

Par_desc1 and Par_desc2 are defined either recursively as compound operations of the form (2) itself, or as (‘terminal’) pipeline skeletons of the form (1). Note that the prefixes +and-can, for readability, be written in infix form. For instance, a Sobel operation (see Figure 10) will be described by:

 

pipeline([neighbourhood(tc_par, mult, accum, [[1,2,1],[0,0,0],[-1,-2,-1]], 8, Buf_size), abs])

+

pipeline([neighbourhood(tc_par, mult, accum, [[1,0,-1],[2,0,-2],[1,0,-1]], 8, Buf_size), abs])

 

 

Figure 10. Sobel operation

 

In such operations, only one set of line and pixel buffers is needed to synchronise the supply of pixels for all parallel operations instead of allocating separate line buffers for each neighbourhood operation. This is because all neighbourhood operations are applied to the same image.

Note that the skeletons (1) and (2) can be nested to any depth and interchangeably (FPGA area permitting).

 

5. Hardware implementation on a commercial FPGA based video board

 

To assess the effectiveness of our skeleton-based approach, we have implemented our system on a commercial FPGA based video processing board. The latter is a single slot PCI card which combines video acquisition, FPGA based real-time processing, and display [19]. A functional block diagram of the VigraVision board is given by figure 11.

Figure 11. Block diagram of the VigraVision video board

 

Bit parallel arithmetic has been chosen to implement the IP operations on the onboard FPGA (XC4013E-3). This choice is motivated by the fact that bit parallel architectures often lead to a better time-hardware product than bit serial ones. This is mainly due to the existence of dedicated fast carry logic on Xilinx FPGAs [5]. However, in the context of processing real time video, the VigraVision board influences the choice of the arithmetic.  If bit serial arithmetic is to be used, there is a need to generate a bit clock from the pixel clock. The bit clock frequency is ‘N’ times the pixels’ clock (for an ‘N’-bit pixel). For practical real time video processing, the luminance pixel sampling rate is 13.5 MHz. This implies a bit clock frequency of 108 MHz for 8-bit length pixel processing, and 216 MHz for 16-bit length pixel processing. The XC4013E-3 cannot operate at these frequencies. Thus the architectures used will be implemented from bit parallel-based skeletons. Note that a trade-off in the form of digit serial arithmetic is still possible. However, this implies additional hardware for the digit clock frequency generation, and extra care for data synchronisation. A parallel implementation is easier to implement and can be efficiently implemented using dedicated fast carry logic [16].

Due to the limited memory resources on the FPGA chip, the line buffers have been implemented using the off-chip DRAM. Part of the FPGA is configured as an interface to the onboard DRAM (FIFOs), while the other part is configured to perform the required image processing operation as shown below:

Figure 12. Block diagram of the FPGA chip configuration

 

If the user wants to generate a complete configuration, including all the low level hardware details, he or she merely has to provide the required high level description. The latter description must conform to the format in (1) and (2) and can be input textually or even graphically. Based on the skeleton library presented above, our HIDE4k system is capable of generating the corresponding efficient FPGA configuration in seconds in the form of an EDIF netlist.

Due to the irregularity of the resulting architectures, the generated EDIF netlist is only partially placed. Once the EDIF description is generated, it is then fed to the Xilinx PAR tools to generate the FPGA configuration bitstream. This may take a long time (~1hr on a Pentium 233 running Windows 95 with 32M of RAM). This is partly due to the fact that the EDIF netlist is only partially placed. Another reason is the small area of the target FPGA (24x24 CLBs only). Nonetheless, the whole process is transparent to the user.

At the application level, the user interfaces to the VigraVision board through a C-callable library (VigraVision ToolBox- VTB DLL). The ToolBox includes hardware initialisation and register control functions, image acquisition functions and image processing functions. For instance, the application developer downloads a particular FPGA configuration by invoking the following function:

 

u_loadXilinx(Xilinx_Chip_ID, Configuration_ filename)

 

He or she can then copy the processed image from the video buffer to the host processor, for further analysis, by:

 

u_readRect(LPRECT lpImgRect, LPVOID imgData)

 

where lpImgRect specifies the rectangle in the frame buffer where data is to be read from, and imgData is a pointer to the image transferred to the host memory.

The Tool Box is supported in Microsoft Visual C/C++ 5.0. Its functions can be accessed by application software which has been linked with the VTB import library and accessed via Windows as a DLL library.  The resulting video coprocessor (see Figure 13) is based on a library of bitstream configurations ready to use (download to the FPGA) from a high level language (VC++ in our case). This library is extensible over time using our HIDE4k system. Thanks to our skeleton oriented approach, this task is relatively easy to perform and requires little FPGA hardware knowledge. This has been illustrated in this paper by two high level skeletons. Other skeletons can be designed using a similar approach and added to the library.

 

6. Conclusion

 

In this paper, we have presented a framework for FPGA based Image Processing. Central to this framework is the Hardware Skeleton Library which contains a set of high level descriptions of task-specific architectures specifically optimised for Xilinx XC4000 FPGAs. Although extensible, the library is based on a core level containing the operations of Image Algebra.  The library also contains high level skeletons for compound operations, whose implementations include task-specific optimisations. Skeletons are parameterisable, and different skeletons for the same operation can be provided, for instance for different arithmetic representations. This gives the user a range of implementation choices.  This in turn supports experimentation with different implementations and choosing the most suitable one for the particular constraints in hand (e.g. speed and area).  We are investigating the possibility of doing some of this experimentation automatically, but for now we do it manually.  Given a complete algorithm description in terms of skeletons, an efficient hardware configuration is generated automatically by our system.

          Our approach was assessed successful by a real hardware implementation of a video coprocessor on a commercial FPGA based video board giving real time processing of video data. This video coprocessor allows for rapid generation of FPGA architectures from very high level, algorithmic, descriptions and opens the way to enabling image processing application developers to exploit the high performance capability of a direct hardware solution, while programming in an application-oriented model.

Note that the skeleton oriented approach is not tied to a particular FPGA chip. Moreover, it may have some applicability for VLSI design. Furthermore, other application domains where there is an established algebra such as numerical processing can also benefit from the skeleton approach.

Full system development will in practice inevitably hit the problem that some particular task is not readily expressed in terms of the skeletons currently in the library.  It will always be necessary to have an ongoing process of skeleton development.  This will of course require a skilled architecture designer, although less efficient solutions might be possible using existing skeletons; but the advantage of our approach is that system builders themselves do not require detailed hardware description skills.

                Future directions include upgrading the system to handle other FPGA series (particularly Xilinx Virtex chips). The extension of the hardware skeleton library, both in supporting more arithmetic types and providing other skeletons for more sophisticated image processing operations (wavelet transform in particular), is being investigated.

 



Figure 13. Overall view of the VigraVision based video coprocessor



7. References

 

 [1]    Synopsys Inc., ‘Behavioural Compiler’, Software documentation, 1998. http://www.synopsys.com/products/beh_syn/

[2]     C Level Design Inc, ‘C/C++ Synthesis System Compiler’, Product overview, 1998 http://www.cleveldesign.com/products/

[3]     The Embedded Solutions Limited, ‘Handel C information sheets’, 1999

          http://www.embeddedsol.com

[4]     Hutchings B, Bellows P, Hawkins J, Hemmert S, Nelson B and Rytting M, ‘A CAD suite for High-Performance FPGA design’, FCCM’99, Preliminary Proceedings.

[5]     Benkrid K, ‘Design and Implementation of a High Level FPGA Based Coprocessor for Image and Video Processing’, PhD Thesis, Department of Computer Science, The Queen's University of Belfast, 2000.

[6]     Cole M, ‘Algorithmic Skeletons: structured management of parallel computation’, MIT Press, 1989.

[7]     Darlington J, Ghanem M, and To H W, 'Structured Parallel Programming', In Programming Models for Massively Parallel Programming Computers, IEEE Computer Society Press, pp. 160-169, Sept 1993.

[8]     Michaelson G J, Scaife N R, and Wallace A M, 'Prototyping parallel algorithms in Standard ML', Proceedings of British Vision Conference, Sep 1995.  ftp://ftp.cee.hw.ac.uk/pub/funcprog/msw.bmvc95.ps.Z

[9]     Ritter G X, Wilson J N and Davidson J L, ‘Image Algebra: an overview’, Computer Vision, Graphics and Image Processing, No 49, pp 297-331, 1990.

[10]   Avizienis A, ‘Signed Digit Number Representation for Fast Parallel Arithmetic, IRE Transactions on Electronic Computer, Vol. 10, pp 389-400, 1961.

[11]   Clocksin W F and Melish C S, ‘Programming in Prolog’, Springer-Verlag, 1994

[12]   Crookes D, Alotaibi K, Bouridane A, Donachy P and Benkrid A, ‘An Environment for Generating FPGA Architectures for Image Algebra-based Algorithms’, ICIP98, Vol.3, pp. 990-994, 1998.

[13]   Benkrid K, Crookes D, Bouridane A, Corr P and Alotaibi K, ‘A High Level Software Environment for FPGA Based Image Processing’, Proc. IPA'99, IEE Seventh International Conference on Image Processing and its Applications, Manchester, July 1999.  pp. 112-116.

 [14]  Xilinx Ltd, XC4000E and XC4000X Series Field Programmable Gate Arrays -Product Specification, 1999. http://www.xilinx.com/partinfo/4000.pdf

[15]   Koren I, ‘Computer arithmetic algorithms’, Prentice-Hall, Inc, pp. 99-126, 1993.

[16]   Benkrid K, Crookes D, Smith J, Benkrid A, 'High Level Programming for Real Time FPGA Based Video Programming', Proc. ICASSP'2000, IEEE International Conference on Acoustic, Speech and Signal Processing, Istanbul, June 2000. Volume VI, pp. 3227-3231.

[17]   Crookes D, Benkrid K, Bouridane A, Alotaibi K and Benkrid A, ‘Design and Implementation of a High Level Programming Environment for FPGA Based Image Processing’, IEE proceedings Vision, Image and Signal Processing, Vol. 147, No. 7, pp. 377-384.

[18]   Ross J, ‘The Image Processing Handbook’, CRC Press, 1995.

[19]   Visicom Laboratories, ‘The VigraVision PCI video board: user’s manual’, 1998.

          www.visicom.com