tinySceneGraph Home Images Tech Guides Contact

Instanced Rendering

There has been a lot of interest in AMD's new Mantle API recently. One of the major benefits of Mantle is it's ability to submit an order of magnitude more draw batches to the GPU than traditional APIs like OpenGL or DirectX due to reduced driver overhead. OpenGL drivers have to do a significant amount of validation work whenever state has changed, which raises the danger for the driver to be CPU bound. As a consequence, fast GPUs Moai easily run into idle mode, waiting for the CPU to sent over rendering tasks.

However, OpenGL offers a pile of extensions to deal with this problem as well. As an example, a neat extension called GL_ARB_draw_instanced is even part of the core specification from version 3.1 on. Instanced rendering provides a way to render many copies or similar versions of a primitive with a single draw call. As opposed to rendering the same object multiple times by looping over glDrawArrays(), the call glDrawArraysInstanced() just takes the loop count as an additional parameter and - with the help of a shader program - can often achieve the same effect as an application loop with just a single call to OpenGL.

Instanced rendering is just one brick in the wall of possible overhead reductions. If you aim for reducing overhead, you may also want to provide as much state as possible in advance (using i.e. uniform buffers or texture arrays) before you start to submit draws.

Using Instanced Rendering

The host side

The OpenGL API provides two new functions for instanced rendering, which extend both glFrawArrays() and glDrawElements() with an instance count parameter: glFrawArraysInstanced() and glDrawElementsInstanced() work exactly the same way as their non-instanced counterparts - only that they send n copies of the primitive down the pipeline.

This is what the OnRender() method of tinySG's csgIndexedTriset node looks like:

      void csgIndexedTriset::OnRender (csgRenderAction *A)
      {
        A->ApplyDeferredState ();

        // Bind VBOs (or set vertex array pointers if VBOs are disabled)
        if ( !BindBuffers(A->GetPipeID()) )
          return;

        if (m_numInstances <=1 ) {
            glDrawArrays(GL_TRIANGLES,                      // prim type
                         0,                                 // start index in VBO
                         m_indices[CSG_VA_VERTEX].size());  // numVerts
        } else {
        
            glDrawArraysInstanced (GL_TRIANGLES,            // prim type
                         0,                                 // start index in VBO
                         m_indices[CSG_VA_VERTEX].size(),   // numVerts
                         m_numInstances);                   // Pass instance count
        }
        UnbindBuffers (A->GetPipeID());
      }
    
Note the else-branch: Compared to legacy glDrawArrays(), the changes are minimal. If the node has an instance count larger than 1, the triangles of the triangle set are passed to glDrawArraysInstanced(). Even the if-else could be omitted when an instance count of one is used for regular draws.

Up until now, you may still wonder what is good about rendering the very same triangles 100 times, without different transformations or materials. Well - obviously nothing. To make glDrawArraysInstanced() useful, you need a shader that treats each instance differently.

The shader side

The GLSL vertex shader has access to a new, build-in variable called gl_InstanceID. Starting a zero, it is increased with every instance. The nice thing is that it is up to the shader to do anything based on the instance ID:
  • The shader may set different transformations based on the instanceID, causing a plant to be rendered at different positions (a simple example is shown below).
  • The instanceID may select a texture from a texture array or serve as an offset into an array/atlas texture.
  • The pixel color may be calculated based on the instanceID (the fur images have used this to control transparency and fake shadowing).
The following GLSL vertex shader is an example for rendering a forest composed of 10000 trees, using the instanceID to move each tree to a different position. It is intentionally kept simple - no lighting, no random/noise - just a regular grid of trees, all utilising the same texture:

   // ** Vertex Shader **
   uniform float dist=1.5;

   void main ()
   {
      // break instanceID into 2D raster coords:
      float x = gl_InstanceID / 100;
      float z = gl_InstanceID % 100;

      vec4 vertex = gl_Vertex;
      vertex.xyz += vec3 (x,y,z) * dist;
      gl_Position = gl_ModelViewProjectionMatrix * vertex;
      gl_TexCoord[0] = gl_MultiTexCoord0;
      vec3 normal = gl_NormalMatrix * gl_Normal;
   }
	 

Examples

For the sake of simplicity, the forest looks a bit regular. Nevertheless, it is is easy to add some randomness to the positions using a noise function. The fragment shader could also index different components of an array texture to create different trees.

If you are still sceptical about the benefits of instanced rendering, I'd suggest to lean back and think of more use cases. The images below benefit from instanced rendering, although it may not be obvious. Click on the images to see larger versions.
teddy teddy
Fur rendering:
The teddy bear is composed of 16 textured shells, like a russian matryoshka. The texture is opaque and coloured at hair positions and transparent in between. Layers are blended together. Click to enlarge.
Minecraft-like landscape:
The landscape has been created from a tinyTerrain-generated mesh, using the voxeliser plugin. It is composed of over 500,000 cubes, rendering at interactive framerates.
The entire scene geometry is composed of a single instanced cube (12 triangles). The shader fetches color- and position information from a shader storage buffer and applies some random variance.

grass Velocity field
Particle simulation:
Screenshot created with the upcoming tinyParticles system, running a n-body gravity simulation.
The physics are calculated using an OpenGL compute shader. Visualisation is done directly on the particle buffers, rendering a point and a line strip with an instance count set to the number of particles.

Particles may merge when they collide, joining their masses. Thus, instanced rendering is useful to avoid updating any geometry (just update the instance count).

Velocity field:
The vector field is just one line of geometry, rendered as 4096 instances. The instanceID serves as an index into a 3D texture keeping all length/azimuth/elevation data for direction vectors.
An even better example would be to animate a bunch of sphere particles, flying through the vector field. A single, instanced sphere provides hundreds of particles, a texture serves as the positions-buffer and is updated per frame on the host side.

Additional ideas on tinySG's future project list include

  • Cloud volumetric rendering: There are algorithms based on shader magic applied to a bunch of overlapping ellipsoids to produce pretty realistic clouds.
  • Graphtals: Plant growth based on grammars. The building blocks of each plant are self-similar and could be well suited for instances.
  • City details: Render thousands of street lights, buildings, moving cars, trees, etc. in a SimCity-like environment.
  • Minecraft: Sceneries like in Minecraft seem to be a perfect match for instanced rendering.
  • Crowd animations: Have warrior creatures that use the same mesh, but read individual bone animations from a set of textures or buffer objects using the instanceID as an index.

Pitfalls

Be careful when rendering instances of individual polygons or sub-objects: The instances of one primitive are rendered sequentially before proceeding to the next primitive. This influences both blending and Z-buffer contents.

An example of what can happen if individual polygons are rendered is shown below. The wire frame torus on the left shows the mesh structure as well as the layering. If you look real closely, you can also see color gradients that imitate self-shadowing (inner shells have darker colors).
The middle image is rendered with quads-trips along the ring, looping over 18 strips to close the torus. All instances of one strip are rendered before proceeding to the next strip. This causes writes to the Z-buffer, introducing artifacts on inner strips rendered after adjacent outer strips.

The artifacts are easily avoided by rendering an entire torus shell with just one primitive (i.e. all triangles in one shot using glDrawArraysInstanced with GL_TRIANGLES as shown in the code above). The rightmost image shows this - click on the images to see larger versions.
Torus rendered with GL_LINES and glDrawArraysInstanced Both triangulation and shell layering are visible. Artifacts caused by stacking individual strips instead of entire shells. Image rendered with a csgQuadmesh node. Each skin is rendered all at once in a single instance using a csgTriset node. Instances create shells from inside out.

Scenegraph operations - i.e. picking - may be an issue on instances as well, because they usually work on a scene database. If geometry is created on the fly by a shader, operations on the scenegraph naturally work for the 1st instance only.
tinySG does not handle these cases, yet. One strategy to deal with this problem is to utilise transform/feedback buffers and get access to all geometry created by the shader stages. This also takes care of regular shader-based geometry manipulations.

Performance and conclusions

Finally, lets have a look at performance. The introduction states that instanced rendering is all about reducing the number of draw batches submitted to OpenGL. So, what happens if we draw the forest with a loop? tinySG offers a LoopGroup node, which traverses it's children n times. Using two nested LoopGroups, Transform nodes can achieve the same offset structure as the shader example above.

If you prefer to work with indexed VBOs, no problem: GL_ARB_draw_instanced comes with glDrawElementsInstanced(), which allows to use indexed vertex data. However, tinySG always "flattens" it's indexed vertex attributes before downloading them into a VBO anyway and goes with glDrawArraysInstanced() instead, for reasons explained in the notes on index management.

Keep rendering,
Christian


Acknowledgements:

  • Instance rendering is specified in this ARB extension.
  • The GLSL functions used to create procedural noise were written by Stefan Gustavsson and posted on OpenGL.org.
  • The Moai dataset is available at grabcad for non-commercial use.
  • The teddy has been found at code.google.
  • The Tycho star catalogue, based on data from the Hipparcos satellite, is by courtesy of ESA.


Copyright by Christian Marten, 2014
Last change: 29.11.2014