Porting Intel Embree to ARM.


Intel Embree is a high performance Ray Tracing kernel for x86/AMD64 architectures.In my experience it can be up to 10x faster than the typical BVH ray-scene intersection tester implemented in PBRT. But it only works on x86 CPUs. Why not porting so it runs on a ARM processor? It sounds fun, doesn’t it?

Picture 1. Embree’s dynmatic scene demo running an ARM board.

Start of the Journey

Reading trough Embree’s README we know that Embree is implemented and optimized with SIMD and requires at least the support of SSE to work.

So, I started by trying to compile Embree without any modification and force it to use SSE only(so I don’t have to deal with AVX and stuff) on a ARM development board and see what happens. Sure enough, GCC can’t find “immintrin.h” and “xmmintrin.h”, which are the headers that defines the SSE intrinsics. After some googling I found a header file called SSE2NEON.h that implements some SSE over NEON intrinsics. It sounds promising. Therefor I downloaded it and stick it into the project directory and find all includes to the SSE headers and replace them to include SSE2NEON.h

It’s not enough

After that, I compiled Embree again. Unfortunately, I’m missing a lot of SSE functions(not only intrinsics, but also memory allocation/free/CPU control functions). The only thing I can do is to implement the missing functions over NEON intrinsics and POSIX/C APIs.  MSDN’s SSE documentation is really great and well documented. And GCC’s NEON intrinsics documentation is good enough for my purpose. They help a lot in this phase of porting.

It’s worth noticing that SSE supports double-precision floating point numbers. But NEON doesn’t. Therefor those SSE intrinsics have to be implement over plan C. Fortunately there are only 2 such intrinsic used in Embree.

CPU Model Detection

After implementing the missing SSE intrinsics and recompile the entire project. I ran into another problem. Embree has a feature builtin that it will detect which instruction set the CPU its running on(not compiled on) supported. This is done by calling the “cpuid” command on x86 platforms. But ARM doesn’t support this in the way x86 does it. Because I’m a lazy person. I simply add a new instruction set called NEON into Embree’s instruction set list.

static const int CPU_FEATURE_NEON = 1 << 31;
static const int NEON = CPU_FEATURE_NEON;

and return CPU_FEATURE_NEON what so ever whenever “getCPUFeatures” in common/sys/sysinfo.cpp is called. Also add the string “NEON” to other functions that converts CPU feature lists to CPU architecture/name.

Ambiguous Overload?

I ran into a problem right after hacking my way trough CPU Model Detection. The compiler shows that I have 2 functions in common/sys/intrinsics.h which have the same signature. How could that be? After hours of staring at the source code. I finally have a look at it’s #ifdef logic. It seems that it assumes if X86_64 is not defined. It must being compiled for a 32bit system. But since I’m compiling it for a 64bit ARM processor. Surely X86_64 is not defined while it’s a 64bit system. (facepalmed my self). Adding -D__X86_64__ to the C++ compiler flag fixes this.

Assembly Intrinsics

For optimal performance, Embree uses x86 assembly intrinsics which are definitely not supported on ARM. Fixing this is easy. Implement them over plan old C. GCC Builtin Fiunctions such as “__builtin_clz” came handy that I don’t have to write ARM assembly directly.


After that. Everything compile. But, everything compiles does not automatically turns into everything runs.

Trying to run a sample program Embree provides. It prints “Unsupported CPU” and leave. I recall that in it’s READEME file. It says that

RTC_UNSUPPORTED_CPU The CPU is not supported as it does not support SSE2.

Hmm… Maybe I need to also return CPU_FEATURE_SSE2 when “getCPUFeatures” is called? Let me have a try… Bingo! It runs. Apparently just returning CPU_FEATURE_NEON won’t make Embree think that it’s running on a CPU that supports SSE2. What an idiot am I.


Embree runs on top of a ARM now. WOW, I never tough I’ll make it.

I have also compiled my own renderer on ARM and tried to render some stuff. Although Embree is designed for Intel CPUs. But the performance is still amazing on ARM.

Picture 2. My renderer running on ARM with Embree as accelerator.


Q1: How fast is it?
A1: Running Embree on an ARM Cortex A53 at 1.2GHz with 4 cores with 800MHz DDR3L memory  (If I recall correctly.) is around 5~6 times slower than my I5 4210H laptop with DDR3L dual channel @ 1600MHz. Which is impressive for such a small device.

Q2: Which Development Board are You Using?
A2: A DragonBoard c410 with Debian 8 and GCC 4.9. GCC 6.1 works too.


Hair geometry and intersection filters arn’t working. Maybe there are some bugs in my assembly intrinsics or my NEON implementation of SSE. But the basic features still work properly.


Here is the source code of Embree for ARM if anyone wants to have some fun with it. https://github.com/marty1885/embree-arm

Also, please note that while compiling Embree, it could took up to 520M of memory on a single process of GCC. Be really careful if you are compiling it directly on your ARM development board. You might run out of memory if you only have 1G o launched too much processes.

I might go in and fix the bugs and properly implement parts I hacked in the future. But that in the future.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: