iPhone OpenGL ES Performance

Development

EDIT: This article was originally published on my personal blog in March ’09, and applies to Generation 1 and 2 hardware.  I’m republishing this article (and its predecessor on UtopiaGL) here to serve as a base for future posts on UtopiaGL.

Fine tuning 3D code can be a black art, particularly if you are new to it.  If you’re not familiar with the hardware, it may not be immediately obvious why certain calls are expensive and others aren’t. To make matters more complicated, each system has little quirks and gotchas that surface along the way while developing 3D code.  The iPhone is no exception.  Here are some optimizations that may save you some head scratching, particularly if you’re new to OpenGL.

EDIT: I have updated some of the points below to reflect the iPhone benchmarking done by Patrick Hogan. Thanks again Patrick. For more details, check out the comments at the bottom.

Timers are Great, Threads are Better

The ‘Hello World’ iPhone OpenGL ES project sets up a simple timer based run loop, which is a very convenient way of executing your app.  You don’t have to worry about synchronization issues, there’s no overhead involved in context switches (your app is executing in a single thread) and debugging is trivial.  So why would you want to rework your code to operate in a separate thread?  The reason is simple:  With timers there will potentially be a gap of ‘dead time’ where your engine is doing nothing, waiting for the next timer event to fire.  This simply defeats the potential parallelism that you can achieve between your app and OpenGL.  The GPU has enough to cause it to stall when interacting with your app, without the additional delay of your app also waiting for a specific time in which to execute.  As soon as there are CPU cycles available to your app, you should take them immediately, get all frame processing out of the way and issue your calls to OpenGL.  From a performance point of view, ideally neither the CPU, nor the GPU should ever be stalled if it can be avoided.  My engine, UtopiaGL, runs around 15% faster using a thread based frame update instead of a timer based approach.

Avoid thread synchronization issues by engineering your code to simply not need synching.

EDIT: My current project has turned up some new details here. When I tested on generation 1 hardware, the threaded approach resulted in sporadically choppy update.  I was able to address it by introducing a short sleep per frame, but it ended up being the same as the timer method in terms of performance.  It looks like the OS background tasks get starved of CPU cycles and then all of a sudden decide to shut your app out for a relatively huge chunk of time while they look after themselves.  This feels very Brew 1.1!!  The timer method is far simpler, so for generation 1 hardware I opt for the timer approach.

Allow OpenGL to run in Parallel

The OpenGL ES pipeline executes in its own thread.  When you issue calls to the API, they are not executed immediately, and instead get placed into a command buffer which eventually gets flushed.  Calls to the API return immediately, allowing your app to proceed with whatever it’s doing.  This means that in theory you should be able to execute for a considerable chunk of time in parallel with the graphics hardware.  This is as you would expect, but there are certain calls you can make that effectively break this parallelism, and with them, the frame rate of your app comes spinning down.  The usual suspects are glReadPixels, glTex(Sub)Image, glBuffer(Sub)Data etc.  If you need to use them per frame, be aware they have quite a performance hit.

If your app runs within a single thread (i.e. the renderer) then avoid calls to glFlush and never, ever use glFinish.  glFinish is particularly bad.  It flushes the command buffer and blocks the calling thread while it does it.  glFlush does the same thing, less the block.  Both are used to synchronize with the GL driver, but you rarely need to do that.  An example of when you might need it is if you had two rendering threads, both of which were rendering into the same GL context.  If you do need synchronization, use glFlush.  Pretend glFinish doesn’t exist – you should literally never call it on the iPhone.

Use Vertex Buffer Objects or Not!

Make good use of Vertex Buffer Objects for any static geometry. In theory this allows you to store geometry (including index data) in fast video memory.  The iPhone uses shared memory i.e. the CPU and PVR hardware share the same memory, so in this case it simply means you save having to upload your geometry to GL every time you call glDrawElements etc.  The savings here can be very significant.  You would be amazed how much geometry you can throw at the iPhone with VBOs before it breaks a sweat.

EDIT: After reading Patrick’s comments below, I went back and looked at my VBO code. Unfortunately the gl driver doesn’t take advantage of VBOs and uses a full copy operation in both the VBO and non-VBO case. Up until now I’ve been working off observations with my own shader pipeline. I see an increase in speed using VBOs but from examining it more closely, the speed increase doesn’t come from the fact that the vertices are in a VBO – it comes from the fact that my non-VBO path does more work on the CPU side. This surprised me because I saw a drop of around 8fps with a stress-test scene in the non-VBO case – quite a bit more than I expected from the extra CPU work alone – I expected it would be in line with VBOs being effective, and that things would run slower without them. My shader pipeline works like this. Geometry is queued for rendering, and when the scene is complete it tries to minimize the number of calls to glDrawElements by rendering from shader buckets. Effectively each shader has a geometry bucket and any geometry in the same reference frame is packed together and issued to gl in a single call to glDrawElements. Most of the geometry (anything not dynamically generated per frame) lives in its own VBO. In this case I don’t add it to the shader bucket; I just render it on its own with a call to glDrawElements. Since VBOs don’t seem to offer a speed increase, when I turn them off and fall back to my packing scheme, it should be fairly similar in speed – but it’s not. It’s doing the same amount of work as the VBO case plus the extra work I’m doing to pack everything. There seems to be no advantage in minimizing calls to glDrawElements – at least not with the scenes I’ve been testing. It seems it’s better to just issue multiple glDrawElements calls. I didn’t look at it too closely originally because my typical scenes were running at 60fps – I saw what I was expecting to see: that VBOs were faster – in my case they were, but not for the reason I assumed.

The tests I did above were not exhaustive and are very particular to my engine and current app, but they agree with Patricks findings below. When I remove the code to minimize glDrawElements calls, and issue the same geometry with and without VBOs I see no real difference in speed. Vertices have xyz, tc and color attributes – lighting is precached.

Be Cache Aware

Use indexed striped triangles for geometry, and render through the glDrawElements call.  Sort tris to maximize vertex cache usage, then sort vertices to be in sorted-tri-order.  Don’t worry too much about having the tris in strip order – make sure cache usage is your sorting metric.  Where it makes sense, interleave vertex attributes, for example all static geometry should have interleaved attributes.

EDIT: I haven’t confirmed it in my engine yet but it seems strips do outperform lists by a small amount.

Read the Apple and PVR performance guidelines

In short: RTFM.  There’s a lot of good advice in those docs.  Surprisingly, when I last checked, Vertex Buffer Objects weren’t mentioned in the Apple guidelines – an odd thing to omit.

EDIT: In light of the VBO implementation, this isn’t an odd thing to omit at all. Also, the Apple recommendation to use lists instead of strips doesn’t seem to hold water, although it is fairly close. See the comments below from Patrick. The best advice is: Read the guidelines, but profile your code.

Use Instruments

Instruments is an excellent app that comes with Xcode and allows you to profile your app running on-device.  You can examine all kinds of details about how your app executes, and zero in on bottle necks.  With respect to OpenGL, you can use it to see which calls are being issued a lot, and/or take the most time.  This is an invaluable tool, and should be your guide for any optimizations.

Conclusion

Every app is different, and even the same engine could perform wildly differently with different content, so when you hit a bottleneck you really need to profile to find out where your app+content is spending most of it’s time, and focus your optimizations there.  That said, many of the above points apply globally, and your app will almost certainly avoid performance pitfalls if you are careful to adhere to them.  The iPhone has some serious horse power under its hood. Initially I was skeptical about just how powerful it was, and after fine tuning my engine I’m extremely pleased with it.  Compared to other mobile 3D devices, the iPhone and iPod Touch are in a class of their own.

Filed under: , , .

16 Responses to iPhone OpenGL ES Performance

  1. Patrick says:

    Great tips Kevin, thanks! You mention optimizing for cache. Do you happen to know what the cache size is on the iPhone? I can’t seem to find any info anywhere on that. I’m using NVTriStrip right now and I think it defaults to 10. I’m a little confused by what you mean by sorting vertices in sorted-tri-order. Do you mean in the order the list of vertex-cache-sorted tris index them? One of the Apple docs seems to suggest using sorted triangle lists instead of triangle strips. What are your thoughts/experiences on this?

  2. Kevin Doolan says:

    @Patrick

    Good question. The honest answer is I don’t know what the exact number is. I’m planning on doing some proper tests to see if I can find a sweet spot. I’ll post whatever results I find.

    You can also think in terms of typical cache lines (this idea is actually primarily what I’m working off, based on what’s suggested in the PVR docs I’ve read). With cache lines you access memory location X and all of Xs immediate neighbours get read into the cache line along with X. This is the idea behind interleaving vertex attributes instead of having separate arrays for each attribute. With separate arrays you don’t make good use of cache lines because you may overwrite a cache line with each attribute you read, meaning your cache could be in a perpetual state of thrash!

    Re sorted-tri-order:

    Exactly as you said. Once the tris are sorted you go through each one and add its vertices to a new vertex list, marking each vertex as added and skipping any that have already been added.

    Re strips vs lists:

    The idea behind strips is to make a compact topology: the connectivity info isn’t explicitly specified – it’s baked into the vertices. You simply look at the order the vertices are specified in and you can extract the triangles. This means you only have to issue a single call to glDrawArrays – no triangles necessary.

    It looks like the strip method is a winner until you see real world cases. The problem is that general meshes (for example a mesh that has been decimated, and transformed in various ways, and isn’t a regular grid) won’t yield long strips. You end up with very short strips that have to be joined together by inserting dead tris (actually dead vertices) that don’t rasterize, but do eat transform bandwidth etc, otherwise you’ll be issuing huge numbers of calls to the API to render each small strip individually.

    The question is now this: Is the overhead involved in those dead tris greater than the overhead involved in just specifying triangles in a list. In general I think the answer is yes for a few reasons. Extra vertices means bandwidth and processing time is eaten needlessly. Any caches involved are also affected because you’re dragging these dead vertices along for the ride. Vertices take a lot more space than triangle indices, especially if you’re using uint16 for indices.

    You never get the ideal situation with strips. You would have to profile on a case by case basis, but I prefer a simpler, general approach.

    So we’re back to sorted triangles lists, with no dead vertices necessary to stitch strips together.

  3. Patrick says:

    I’ll look forward to seeing your conclusions on the cache question. I’ll likely run some tests myself when I have my engine in a state where I can switch between options easily.

    Hmm. Maybe I’m not looking at this right re: strips vs. lists, but as I understand it there are no additional vertices involved in stitching strips together to form one giant strip. All there are are extra indices. So there’s no extra T&L overhead, and everything I’ve read suggests that eliminating degenerate triangles by indices is an extremely efficient hardware operation so no processing of their vertices are even attempted by the hardware. Even a very fragmented mesh would probably have less indices for strips than for lists. Now, I’m sure some meshes would be contrary, but in general strips seem to have 30-50% the indices that lists do. Of course there are cache considerations, but again I’d imagine by the very nature of strips it’s pretty cache friendly since more often than not the first 2 indices (and thus vertices) of the current triangle are the last 2 of the previous triangle. Although I could also imagine some cache unfriendly situations.

    Please correct me if I’m just being obstinately dense. :)

  4. Kevin Doolan says:

    Apologies (I’m the one being dense!), I think I see what you mean now. The two ways of using strips I’m familiar with are:

    1) The traditional, explicit method which uses glDrawArrays. You sent up your various vertex arrays in strip order, complete with any necessary stitching, and then you call glDrawArrays using GL_TRIANGLE_STRIP. The overhead here for stitching is any additional vertices that have to be added to the vertex arrays which account for degenerate tris. When the Apple recommendation refers to strips vs triangle list, I understood strips in this context to represent glDrawArrays strips.

    2) The implicit method which uses triangle lists where you set up your various vertex arrays without any stitching being required and you call glDrawElements passing in GL_TRIANGLE_LIST, together with the index array which specifies all triangles. Within the index array you order tris in strip order, but you don’t need any stitching. You leave it to the cache/driver to leverage the fact you have multiple strips in there. This is what I understand to be the ‘triangle list’ method mentioned in the Apple recommendation.

    From what you’ve said you use a variant of method (2). You set the usual vertex arrays (no degenerates) and issue a call to glDrawElements using GL_TRIANGLE_STRIP passing in the indices which do have degenerates. Is that right? That sounds cool – I completely overlooked it. Have you seen a performance gain over lists using this method?

    I had put this issue to bed long ago and resigned myself to GL_TRIANGLE_LIST – now I’m not so sure. There is a possibility that the driver is optimised for triangle lists rather than stripified indices, but it’s definitely worth a stress test…

  5. Patrick says:

    You made me go look it up since I can never get glDrawArrays and glDrawElements straight. :)

    Yes, you’re correct. I am using glDrawElements as you describe. But of course that means passing in a list of indices which takes up some bandwidth (but might be worth it anyway).

    Then there’s the option of using the above completely in VBOs which would eliminate even that overhead for static meshes. Additionally, since the iPhone uses shared memory for graphics memory, I suspect the GL_OES_mapbuffer extension can be used to map the VBOs onto client memory allowing one to update vertex/tex/normal information in-place for animation without much performance penalty.

    Unfortunately I have not yet had the time to test the various options with benchmarks. But it’s worth doing a pretty comprehensive suite to put this little device through the paces. When I have some numbers I’ll pass them on.

  6. Kevin Doolan says:

    Excellent. Agreed on OES_mapbuffer – if you avoid mapping the buffer while gl is reading from it, given the shared memory model you should be able to enjoy the same benefits as static VBO geometry.

  7. Patrick says:

    Well, I finally ran some numbers and found a rather consistent benefit to strips. It seems I gain around 5-6% speed increase for strips over lists. This is with a mesh rendered with vertex, normals and one set of texture coordinates.

    I was rather surprised by something else though. VBOs offered no noticeable speed increase. At first I was puzzled, but then I found a comment on one of the iPhone Developer forums stating that the iPhone VBO implementation is terrible and offers no benefit whatsoever. Since it uses shared memory anyway, mapping on to client memory is pretty much equivalent. Other comments confirmed that strips trump lists.

    Also, I found out that the iPhone has:
    16K/16K data/instruction cache
    32 bytes cache-line size
    no post T&L cache
    no GPU index buffer support

    See also:
    https://devforums.apple.com/thread/10200?tstart=0
    https://devforums.apple.com/thread/9781?tstart=15

  8. Kevin Doolan says:

    Lists are out, strips are in!

    Re VBOs: Wow, that’s not at all what I was expecting. So that’s why they’re not mentioned in the Apple recommendations! I need to go back and find out why my code runs faster with VBOs than without. The non-VBO pipeline does more work on the CPU side but I wouldn’t have expected it to account for the difference. I attributed it gl having to copy the vertices over, which it shouldn’t have to do in the VBO case. From reading the comments on the apple forum it looks like the driver is doing a copy even with VBOs.

    No vertex cache – that adds up I guess. The docs all seem oriented around cache lines (although they don’t say it) so all their suggestions apply to the CPU side. Cache order now becomes important for the redundant Copy operation the driver uses.

    Excellent work Patrick – I’ve got some homework to do on my pipeline – now I just need some time to squeeze it in. I’ll make some updates to the blog article based on what you’ve found. Thanks for passing this on – much appreciated.

  9. Kevin Doolan says:

    I went back and did a quick test with my VBO code and confirmed your VBO findings – no advantage. The reason I saw a difference was because of the code in my shader pipeline to minimize calls to glDrawElements (for non-VBO geometry). If I disable that code, and just issue glDrawElements calls without trying to reduce them, it runs at exactly the same speed as the VBO case. There seems to be no advantage in trying to reduce draw calls.

  10. csuzv says:

    Hi Kevin, nice article, very useful. Regarding running OpenGL ES in a different thread, I was wondering about using the following pattern. Run the simulation in thread A. Run the renderer in thread B. The sim "publishes" a bunch of information into a buffer for the renderer, just the minimal data that the renderer needs to do its stuff – objects positions etc. So my question would be, would thread A be able to run during the period where thread B is waiting for OpenGL to return? In particular, if you are not using such functions as glReadPixels, but are doing some more vanilla rendering? Just wondering about your thoughts on this. (And Patrick too, and anyone else who may offer any experience.)

  11. Kevin Doolan says:

    I haven’t experimented with this, but in principal it sounds like it may offer a speed up, although on a single core device I’m not sure how big the win would be, if any. It will depend on how much processing you’re doing, I imagine. Also, if thread A is publishing information into thread B, you can set it up so that there’s no synch required by using a double buffer – while A feeds into B.buffer1, B is reading from B.buffer0 etc

    I have set my engine up similarly – I have a GraphicsPipeline object which is a buffered pipeline like your thread B. Right now it executes in the sim thread, so I don’t split it off – it’s single buffered, and single threaded. The idea is that if a dual-core iPhone shows up (or I port the engine somewhere with multi-core) I can recompile with a double buffer and split it off to it’s own renderer thread, where it can execute on one of those cores, while the sim executes on the other.

    Best bet is to experiment with it and see how it works out for your specific app.

  12. csuzv says:

    Yes agreed, with two hardware threads the advantages are clear, with a single thread it may not be so apparent. I am seeing _semwait_signal_nocancel showing up at around 3.5% in my Instruments CPU sampler coming from EAGLContext presentRenderBuffer, so perhaps this time could be used. Will try it out soon.

  13. bryan says:

    Nice article Kevin. I’m having some problems setting up a threaded rendering loop in my application. Do you have any good tips, know a good resource or have an example I can take a look at? Trying to bump up my framerate a bit, and any help would be appreciated.

  14. Kevin Doolan says:

    I replied to Bryan’s question above privately via email (I’m up to my eyes right now and didn’t have time to log into my blog etc). Anyway, here is my reply since others have also asked for more info:

    The code for the threaded approach is almost identical to the timer approach. Here is what my code looks like:

    // Timer based update method
    – (void) update
    {
    // tick + render
    UtopiaAppWrapper::Update();
    [_glView swapBuffers];
    }

    // Threaded based update method
    – (void) update_threaded: /**/ (UtopiaGLAppDelegate*)sender
    {
    NSAutoreleasePool* pool = [ NSAutoreleasePool new ];
    [NSThread setThreadPriority:1.0];
    [_glView setCurrentContext ];
    while( !g_end ) // UtopiaGL system sets this when it wants to exit
    {
    // tick + render
    UtopiaAppWrapper::Update();
    [_glView swapBuffers];
    }
    [pool release];
    [NSThread exit];
    }

    In applicationDidFinishLaunching I do this…

    – (void) applicationDidFinishLaunching: /**/ (UIApplication*)application
    {
    // Various init stuff here…
    // Various init stuff here…
    // Various init stuff here…
    // Various init stuff here…

    // init app
    if( UtopiaAppWrapper::Init( rect.size.width, rect.size.height ) )
    {
    if( !g_threaded )
    {
    // create rendering timer
    [NSTimer scheduledTimerWithTimeInterval: /**/ (1.0 / kFPS)) target:self selector:@selector( update ) userInfo:nil repeats:YES];
    }
    else
    {
    // threaded update
    _thread = [[NSThread alloc] initWithTarget:self selector:@selector( update_threaded: ) object:nil];
    [_thread start];
    }
    }
    else
    {
    // Error handling…
    }
    }

    So by setting a global variable (g_threaded) it can either run using a Timer or a Thread using the same core code i.e. UtopiaAppWrapper::Update().

    There is an issue with syncronizing incoming messages, but I worked around it by making a queue that gets eaten by the engine thread, and fed by the main App thread – very low tech, and works well.

  15. Rich says:

    Hi Kevin.

    I was googling around for info on vertex buffer cache sizes and happened upon your site here and got to reading. :) I had actually attempted in my own iPhone engine a while back to adapt the methodology from my PC engine of threading out all game code and only stalling for the game thread when translating data from “server” to “client” (this is a handy model for me, since my game code is already all run as a server, with the engine supporting full server-client multiplayer). But I found that this did not help performance on the iPhone at all, because the main stall was actually outside of my entire game/render loop.

    So this led me to do what you’ve suggested in the comments here, and thread out the whole freakin’ renderer. I was a bit frightened at this prospect initially, but it seemed to work great, and similar to your results, I noticed a few more frames per second.

    The huge downside that made me not go with the method was that I noticed touches appeared quite sporadic, and that the main thread was not getting the messages through quite as smoothly as before. This, I figured, was due to my new rendering thread choking the whole process.

    It seems like any measure of lowering thread priority or sleeping the thread would either not provide adequate performance improvements, or just end up with jagged touch messages still. I used a mutex between my game/render thread checking of touches from a message queue and the main thread putting messages into the queue, and it typically worked out so that touches were delivered during the render and read immediately on the loop around, so I can’t imagine any kind of queue/latency issue was responsible. Seemed to genuinely be a starvation problem.

    So I’m curious, did you not run into this problem in your implementation? Or did you find a magic solution?

  16. Kevin Doolan says:

    Hi Rich – I too found the same artifacts with threading – namely the slight stutter with touch events. It was something that was eased slightly by introducing a small sleep per frame, but as the projects I dealt with grew in complexity the advantages of threading decreased, and I ended up falling back to a timer method. Right now I either use a timer or DisplayLink, depending on the OS version. This seems to work out better, especially for very processor intensive apps. It may be possible to work around it but I didn’t find a way quickly, and moved on.

Leave a Reply

Your email address will not be published. Required fields are marked *