Rant on OpenCL on FPGAs From a Point of View of a Software Developer

This is why I haven’t post any ChaiScipt tutorial recently. I’m busy on a project to bring OpenCL into OpenCV on a Cyclone V SoC Development Board.

There are a lot of reports and articles about how FPGAs are so powerful that they can outperform GPUs. And that OpenCL can unleash FPGA’s full performance in a way that every developer do it.

I’m telling you that.  FPGAs are great. They perform really well. However, they are NOT easy to program(from a software developer’s view) and you need a totally different mindset then programming a GPU to optimized code on a FPGA.

 

My FPGA development board running SVM(Its slow because overhead of converting CV data into OpenCL readable form, not caused by FPGA)

First of all. How can OpenCL C, which is a language that is designed to be compiled into machine code run on a FPGA? It is done by converting your OpenCL code into Verilog, then synthesis it to create the binary the FPGA needs.

What’s the problem then? Easy, It takes forever for the synthesiser to run. The following kernel takes a bit more then an hour to compile. Which is forever for the standard of software developing. And the compiler needs more than 3.5G of RAM to run.

__kernel void compute(int hisGramSize, __global float* blockRes, __global float* blocks, __global float* svm)
{
int size = get_global_size(0);
int id = get_global_id(0);

float result = 0.f;
int offset = id*hisGramSize;
for(int i=offset;i<offset+hisGramSize;i++)
result += blocks[i]*svm[i];
blockRes[id] = result;
}

There is no way anyone can rapidly develop and debug OpenCL for a FPGA. Trial and error won’t work here. That’s a down side. Even more, the kernel can only be compiled offline instead of the usual online compiling we do in OpenCL to ensure portability across different CPU/GPUs. 😦

Setting up OpenCL on a FPGA is a bit of pain too. Programming of the FPGA is done by calling a command in the command line. Annoying. Also, as the kernel is compiled offline. If the runtime version is different from the offline compiler used to compile. Everything fails with no warring. Face-palm.

Besides all the above. There are sill one huge downside of a FPGA. You can’t put too much logic into a kernel. Else one or more of the following can happen.

  1. The synthesiser gives up
  2. No enough space on the FPGA to store all the logics

If the compile passes. You still have the problem that you can’t pack too much compute units on to the FPGA because the total respurce on the FPGA is limited.

Despite these downsides of FPGA. The upsides of FPGAs are that

  1. High Performance per Watt(even compared to a GPU)
  2. Very high memory through put
  3. Can do lots of things in parallel if possible even we only have on Compute Unit
Hacking OpenCL code into OpenCV on a hackathon.

 

 

10 thoughts on “Rant on OpenCL on FPGAs From a Point of View of a Software Developer

Add yours

    1. I ended up bringing my own OpenCL kernel code into OpenCV. The kernel code in OpenCV is way too large to fit on the Cyclone V FPGA.
      My original plan was to implement HoG and SVM in OpenCL. But I ended up only implementing SVM on OpenCL. I simply can’t fit HoG on the FPGA.

      Like

  1. Thanks for your reply 🙂 I wanted to implement the SURF-Algorithm which comes with OpenCL-Code already. So it´s probably not possible to use it? Would it be possible to tell the steps you did to use opencv with your own kernel code?
    Did you rewrite the opencl-code which comes with OpenCv or did you search in the lib-code code-parts which can be implemented effizicent in Opencl C ? Thanks 🙂

    Like

    1. What platform are you developing for? The only reason why I care about code size is that I’m using a small FPGA. If you are developing for GPUs. There are virtually no code size limit. Or that you have a high end FPGA where you can put a lot of logic without filling it entirely.

      What I have done is stupid. – since I’m at a hackathon. I modify the original class (HOGDescriptor for me)’s constructor to initialize OpenCL(load kernel code and compile it if on a GPU, load binary if on a FPGA) and other stuff. Then modify the `compute` method to use the kernel created in the constructor. The proper way to do this though is duplicate the class you are OpenCL-ifying in the ocl namespace. Then work on that copy.

      Na, I write my own OpenCL kernel. As I have said. Ones in OpenCV are too huge to fit in my FPGA. I have to write a miniature version of it. I started out profiling OpenCV. Find where most the CPU time is spent. Then write a kernel for it.
      Also, there is an important catch. OpenCV is designed from the ground up for CPUs. I need to undo some optimizations for OpenCL to work. (ex: There are manual data caching (not prefetching) in OpenCV’s HoG-SVM code. I need to undo that otherwise implementing OpenCL kernel is impossible.)

      Like

      1. @TADRIAN

        I got 4x faster on the SVM compared to doing it on CPU. Unfortunately overall it’s 30x sloer then CPU only due to how I undo OpenCV’s manual caching. The way I did it is CPU intensive. Idealy I should remove the caching completely and create a plan array. But I’m out of time to implement that in the hackathon. 😦
        Theoretically (If theres no overhead to remove the data caching) It should be ~2.5x faster. (or at most 4x by applying asynchronous computing)

        Like

  2. Is it possible to look up your modified OpenCV Code i struggel to fullfill the task to use the Kernels from orb.cl (changed to orb becuase it runs much faster on an embedded device, ex with the neon flag enabled in OpenCV) with the Cyclone V, because I have trouble to find a way to start. I hope that look up your code will show me a way how to implement my fpga accelerated code

    Like

    1. Hello. Sorry for the delayed reply
      Here is the part we changed in hog.cpp to make it run on FPGA and some comments on how it works.
      https://gist.github.com/marty1885/75ca2e236cd5938146ec09f2a56eed0b

      For anyone in any time using this. Please do not use it directly in any commercial project. But it can be use as a reference implementation of commercial code. (My friend want to add this restriction because this is a hackathon project. He believe it should not be used commercially.)

      Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑