EDIT: This article was originally published on my personal blog in March ’09, and applies to Generation 1 and 2 hardware. I’m republishing this article (and its predecessor on UtopiaGL) here to serve as a base for future posts on UtopiaGL.
Fine tuning 3D code can be a black art, particularly if you are new to it. If you’re not familiar with the hardware, it may not be immediately obvious why certain calls are expensive and others aren’t. To make matters more complicated, each system has little quirks and gotchas that surface along the way while developing 3D code. The iPhone is no exception. Here are some optimizations that may save you some head scratching, particularly if you’re new to OpenGL.
EDIT: I have updated some of the points below to reflect the iPhone benchmarking done by Patrick Hogan. Thanks again Patrick. For more details, check out the comments at the bottom.
Timers are Great, Threads are Better
The ‘Hello World’ iPhone OpenGL ES project sets up a simple timer based run loop, which is a very convenient way of executing your app. You don’t have to worry about synchronization issues, there’s no overhead involved in context switches (your app is executing in a single thread) and debugging is trivial. So why would you want to rework your code to operate in a separate thread? The reason is simple: With timers there will potentially be a gap of ‘dead time’ where your engine is doing nothing, waiting for the next timer event to fire. This simply defeats the potential parallelism that you can achieve between your app and OpenGL. The GPU has enough to cause it to stall when interacting with your app, without the additional delay of your app also waiting for a specific time in which to execute. As soon as there are CPU cycles available to your app, you should take them immediately, get all frame processing out of the way and issue your calls to OpenGL. From a performance point of view, ideally neither the CPU, nor the GPU should ever be stalled if it can be avoided. My engine, UtopiaGL, runs around 15% faster using a thread based frame update instead of a timer based approach.
Avoid thread synchronization issues by engineering your code to simply not need synching.
EDIT: My current project has turned up some new details here. When I tested on generation 1 hardware, the threaded approach resulted in sporadically choppy update. I was able to address it by introducing a short sleep per frame, but it ended up being the same as the timer method in terms of performance. It looks like the OS background tasks get starved of CPU cycles and then all of a sudden decide to shut your app out for a relatively huge chunk of time while they look after themselves. This feels very Brew 1.1!! The timer method is far simpler, so for generation 1 hardware I opt for the timer approach.
Allow OpenGL to run in Parallel
The OpenGL ES pipeline executes in its own thread. When you issue calls to the API, they are not executed immediately, and instead get placed into a command buffer which eventually gets flushed. Calls to the API return immediately, allowing your app to proceed with whatever it’s doing. This means that in theory you should be able to execute for a considerable chunk of time in parallel with the graphics hardware. This is as you would expect, but there are certain calls you can make that effectively break this parallelism, and with them, the frame rate of your app comes spinning down. The usual suspects are glReadPixels, glTex(Sub)Image, glBuffer(Sub)Data etc. If you need to use them per frame, be aware they have quite a performance hit.
If your app runs within a single thread (i.e. the renderer) then avoid calls to glFlush and never, ever use glFinish. glFinish is particularly bad. It flushes the command buffer and blocks the calling thread while it does it. glFlush does the same thing, less the block. Both are used to synchronize with the GL driver, but you rarely need to do that. An example of when you might need it is if you had two rendering threads, both of which were rendering into the same GL context. If you do need synchronization, use glFlush. Pretend glFinish doesn’t exist – you should literally never call it on the iPhone.
Use Vertex Buffer Objects or Not!
Make good use of Vertex Buffer Objects for any static geometry. In theory this allows you to store geometry (including index data) in fast video memory. The iPhone uses shared memory i.e. the CPU and PVR hardware share the same memory, so in this case it simply means you save having to upload your geometry to GL every time you call glDrawElements etc. The savings here can be very significant. You would be amazed how much geometry you can throw at the iPhone with VBOs before it breaks a sweat.
EDIT: After reading Patrick’s comments below, I went back and looked at my VBO code. Unfortunately the gl driver doesn’t take advantage of VBOs and uses a full copy operation in both the VBO and non-VBO case. Up until now I’ve been working off observations with my own shader pipeline. I see an increase in speed using VBOs but from examining it more closely, the speed increase doesn’t come from the fact that the vertices are in a VBO – it comes from the fact that my non-VBO path does more work on the CPU side. This surprised me because I saw a drop of around 8fps with a stress-test scene in the non-VBO case – quite a bit more than I expected from the extra CPU work alone – I expected it would be in line with VBOs being effective, and that things would run slower without them. My shader pipeline works like this. Geometry is queued for rendering, and when the scene is complete it tries to minimize the number of calls to glDrawElements by rendering from shader buckets. Effectively each shader has a geometry bucket and any geometry in the same reference frame is packed together and issued to gl in a single call to glDrawElements. Most of the geometry (anything not dynamically generated per frame) lives in its own VBO. In this case I don’t add it to the shader bucket; I just render it on its own with a call to glDrawElements. Since VBOs don’t seem to offer a speed increase, when I turn them off and fall back to my packing scheme, it should be fairly similar in speed – but it’s not. It’s doing the same amount of work as the VBO case plus the extra work I’m doing to pack everything. There seems to be no advantage in minimizing calls to glDrawElements – at least not with the scenes I’ve been testing. It seems it’s better to just issue multiple glDrawElements calls. I didn’t look at it too closely originally because my typical scenes were running at 60fps – I saw what I was expecting to see: that VBOs were faster – in my case they were, but not for the reason I assumed.
The tests I did above were not exhaustive and are very particular to my engine and current app, but they agree with Patricks findings below. When I remove the code to minimize glDrawElements calls, and issue the same geometry with and without VBOs I see no real difference in speed. Vertices have xyz, tc and color attributes – lighting is precached.
Be Cache Aware
Use indexed striped triangles for geometry, and render through the glDrawElements call. Sort tris to maximize vertex cache usage, then sort vertices to be in sorted-tri-order. Don’t worry too much about having the tris in strip order – make sure cache usage is your sorting metric. Where it makes sense, interleave vertex attributes, for example all static geometry should have interleaved attributes.
EDIT: I haven’t confirmed it in my engine yet but it seems strips do outperform lists by a small amount.
Read the Apple and PVR performance guidelines
In short: RTFM. There’s a lot of good advice in those docs. Surprisingly, when I last checked, Vertex Buffer Objects weren’t mentioned in the Apple guidelines – an odd thing to omit.
EDIT: In light of the VBO implementation, this isn’t an odd thing to omit at all. Also, the Apple recommendation to use lists instead of strips doesn’t seem to hold water, although it is fairly close. See the comments below from Patrick. The best advice is: Read the guidelines, but profile your code.
Instruments is an excellent app that comes with Xcode and allows you to profile your app running on-device. You can examine all kinds of details about how your app executes, and zero in on bottle necks. With respect to OpenGL, you can use it to see which calls are being issued a lot, and/or take the most time. This is an invaluable tool, and should be your guide for any optimizations.
Every app is different, and even the same engine could perform wildly differently with different content, so when you hit a bottleneck you really need to profile to find out where your app+content is spending most of it’s time, and focus your optimizations there. That said, many of the above points apply globally, and your app will almost certainly avoid performance pitfalls if you are careful to adhere to them. The iPhone has some serious horse power under its hood. Initially I was skeptical about just how powerful it was, and after fine tuning my engine I’m extremely pleased with it. Compared to other mobile 3D devices, the iPhone and iPod Touch are in a class of their own.