Getting rid of multiple draw calls using modern OpenGL

Filed in GPU | OpenGL Leave a comment

Hi All.I haven’t posted here for quite a long time due to loads of work mostly on a new vector text engine for Idomoo’s(maybe will shed some light on it later) rendering engine.This weekend gave me an opportunity to finally write something new.I will talk about how using modern OpenGL we can reduce number of draw call to a bare minimum.
If you haven’t been living in a cave for the last couple of years you have probably heard of some of the new core extension shipped with OpenGL 4.3 .Specifically,indirect drawing commands like glMultiDrawArraysIndirect and glMultiDrawElementsIndirect bring quite revolutionary possibilities which were outlined in great details in GDC2014 talk by NVidia and AMD dudes and by one of the authors of “OpenGL SuperBible” 6th edition – Graham Sellers here.So I won’t go into boring technical details of usage of these new routines,but I rather want to showcase a concrete real-life problem which I encountered and which I finally solved using indirect drawing.
So let’s begin from the simple fact upon which we all probably agree – there is a driver overhead when issuing many draw calls.And we,as graphics programmers, are in constant search for techniques which enable us reducing this overhead to a minimum.Unfortunately with the older API we didn’t have much choice than baking similar geometries into single vertex buffer as much as possible,which was nice optimization but not sufficient when our scene had a great multitude of different materials and transformations.Newer OpenGL API (3.3) came with instanced rendering routines which added an ability to “clone” draw calls on GPU for the same geometry.It was pretty cool as using instanceID we basically could assign different transforms for each instance using UBOs.But those methods still had one important feature missing – ability to use unique geometry for each instance primitive draw inside the whole batch.Indirect drawing adds that missing ring in the chain.The feature is hard to underestimate as it basically allows us,with proper architecture design,to submit the whole scene into a single draw call.Yeah,it is not a joke,this is exactly what glMultiDrawArraysIndirect and glMultiDrawElementsIndirect are designed for.In my case I needed something less ambitious to solve a problem.I had a case where to draw a specific geometry I needed to issue 2 draw calls.The reason for that was that the geometry came from 2 separate sources which composed the final triangle mesh.Well,you would probably ask,why not to merge those into a single VBO?The problem was that shader processing of those 2 geometries was different as well.Let’s call Geometry data from source 1 DataA while the data from the second source – DataB.So,to successfully render that type of mesh I needed to issue first draw call using DataA with its unique fragment shader and afterwards issue a draw call with DataB using also second unique fragment shader program.A typical instance drawing wouldn’t help here as both geometries are different.Moreover,if the both geometries are batched into the same VBO I still needed to figure out somehow how to switch fragment shader functions when the second block(DataB) starts being processed by the pipeline.Initially I started with Atomic counters.The plan was to set an atomic counter equal to zero at the draw start.Then increment it on each geometry shader invocation.Why geometry shader?That’s because in some source it was state that counting of vertices in vertex shader isn’t reliable because of vertex cache.That’s,if you have 100 vertices in the VBO,the atomic counter won’t nessessarly get invoked 100 times.(Please let me know if this data is not correct).In geometry shader I would pass a uniform with the number of vertices for the DataA.Comparing that number with the current value of the atomic counter should have given me(will talk about it in a moment) an ability to know when the vertex stream of dataB began entering the pipeline.Something like this:


#version 430 core

layout(triangles) in;

layout (triangle_strip ,max_vertices = 3) out;

layout(binding=0, offset=0) uniform atomic_uint ac;

in VertexData
{
  vec3 v_uvs;

}vertex_data[];

uniform  int dataASize;
out           vec3 v_uvs;
out flat float isDataB; //0 - process dataA , 1 - process dataB//
void main(){

    memoryBarrierAtomicCounter();

	uint counter = atomicCounter(ac);
	float switcher = 0.0;
	if( (counter * 3)   > dataASize )
	{
		 switcher = 1.0;

	}else{

	     atomicCounterIncrement(ac);

	}
     isDataB = switcher;

     ////here just emitting the primtive
     ////with the data from the previous stage

}

As you can see in this geometry shader I try to compare current atomic counter value to “dataASize” uniform,which equals number of vertices of dataA block.If the comparison evaluates to “true” “switcher” varying gets set to 1 and being sent to fragment shader.In the fragment shader,if switcher = 0 I call method A and if it equals 1 I call method B.That was the plan but it didn’t really work.I am still not really sure why,and I submitted a question about this problem on StackOverflow ,but for some reason the atomic counter value that was being compared to dataASize was never containing an exact value ( value == dataASize) when the expression had evaluated to “true”.In other words,when if( (counter * 3) > dataASize ) became true, counter*3 was already holding a number far bigger than dataASize.All this was resulting in incorrect fragment processing.I thought it had something to do with atomic counter synchronization.As you see I even tried to use memory barrier but it didn’t help.The chance are high I have missed something in how geometry stage gets invoked on GPU compared to fragment shader and I will be glad if someone reading this could clear out where I was wrong.But eventually I recalled reading in OpenGL SuperBible 6 about indirect drawing.

After close examining how glMultiDrawArraysIndirect works I came to a conclusion it’s a perfect match to squeeze my special mesh rendering into single draw call.The methods accepts a buffer known as GL_DRAW_INDIRECT_BUFFER which essentially is a command buffer that supplies a description on every primitive in the batch submitted to glMultiDrawArraysIndirect.The structture per primitive instance looks like this:


  struct DrawArraysIndirectCommand
{
	GLuint count;/// - number of vertices to draw
	GLuint primCount;/// - number of primitives to draw(more than one results in instanced draw)
	GLuint first;/// -from which vertex to start (offset)
	GLuint baseInstance; //which instance data to use.
};

To submit a unique geometry into indirect draw call we need to supply the command from above per data block.In my case I had dataA and dataB residing in a single VBO with the following layout:

|———— DataA vertex data ——–|———– DataB vertex data ————|

So the commands for such an VBO would be as follows:


        DrawArraysIndirectCommand Command[2];
		Command[0].count     =  dataA_verticesCount  ;
		Command[0].primCount = 1;
		Command[0].first     = 0;
		Command[0].baseInstance = 0;

		Command[1].count     =   dataB_verticesCount;
		Command[1].primCount = 1;
		Command[1].first     =    dataA_verticesCount;//start at the end of dataA
		Command[1].baseInstance = 1;

The first part of the problem is solved – now the draw call “knows” what geometry data to assign to each primitive.Now what has left is to find the way to tell the fragment shader when the dataB is being processed in the vertex shader so that it knows to switch functions for execution.The solution is to create another VBO which would contain only the indices of the batch primitives.So,for example,because I am going to draw 2 primitives (dataA and dataB) in this indirection draw call, the so called drawID buffer would hold just {0,1} .But how do we cause that buffer to be processed one element per instance and not per vertex(which is default)? glVertexAttribDivisor(); comes to rescue.Setting it to be glVertexAttribDivisor(attribute_index,1) means that the next element in the buffer will be fetched when the current primitive instanced is done being processed.Which means that zero is fetched when dataA is rendered and 1 is fetched when dataB is rendered.These drawIDs I would pass as varying into fragment shader which looks something like this:


    #version 430 core
    in  flat int dataID;

   void main(){

        if(dataID == 0){

                //process fragments for dataA primitive

        }else{

               //process fragments for dataB primitive

        }

   }

As you can see,now my fragment shader is able to distinguish when it processes primitive of dataA and when it work on dataB fragments.I can’t give out the actual source code for all the described above mostly because it is not agnostic OpenGL API code but it is heavily abstracted “inhouse” API which would be hard to follow up.Also some of the code is commercial and cannot be shared.But for those who need a fast kickoff with indirection draws should take a look at the demos in the OpenGL Sample pack (430) of G-Truc,which has pretty nice yet simple demo of indirect drawing.In my case glMultiDrawArraysIndirect helped to solve a whole bunch of problems which I would have had in the different render passes in our engine if I didn’t succeed to get rid of double draw calls for that specific mesh type.

, ,

Fast JavaScript with Emscripten SDK

Filed in Performance 1 Comment

This time will be something different.I must admit I am not JavaScript guy at all.I used to program on Flash platform for several years,also coded some PHP and ASPNET,but general web development,especially JavaScript is not something that ever attracted me.But as we all know,during past several years HTML5 buzz,Flash platform declination led many people to return to the realms of JavaScript.My major interest was and is 3D graphics and therefore WebGL technology is definitely something I consider as web 3d platform of the near future.I started playing with raw WebGL API and also with Mr.Doob’s three.js.After have been writing 3D APIs and engines with C/C++ during the last 2 years this stuff looks really enjoying but also somehow irritating.What I find irritating is JavaScript language.I have 2 major problems with it:1)Lack of proper OOP model 2)Bad performance.I started to look into different approaches to code JavaScript without writing JavaScript so that I could have C++/Java like OOP design + better performance.I tried things like GWT ,TypeScript,Haxe.All those platforms definitely solve the problem of OOP but not really the performance as in fact they all generate the same JavaScript,sometimes with a lot of junk code.
Then I stumbled upon Emscripten and ASM.js .I won’t dive into detailed explanation on how it works, but basically it makes possible to cross-compile C/C++ codebase to highly optimized JavaScript(ASM.js).The pipeline roughly looks like this:the source code is compiled by Clang compiler into LLVM bitecode which is then cross-compiled (in most cases) to ASM.js by Emscripten.The SDK looks really impressive.There is a pretty descriptive docs for the starter and active google group.There are ready to deploy installers which setup the whole toolchain(python,LLVM,Clang etc.. automatically).It took me roughly 20 minutes to compile my first “Hello World”.Emscripten has a good set of features.It allows,for example, bidirectional communication between C/C++ and JavaScript objects in very easy way.It supplies several different ways of doing it.I liked very much embind which is ripped off Boost::python scripting API and allows really easy and flexible exposure of C/C++ data structures and methods to JavaScript interface.
So I decided to write some test to get an idea of how useful the SDK is and how much possible performance gain can be.I decided to cross-compile GLM math lib.I like this lib and use it in many projects.It is header only template library.Maybe it is not the fastest math library in the world(lib author’s words) ,but it is fast compared to many other high performance math libraries.I benchmarked against Eigen library and it outperforms it like x6-8 times on square matrix operations.Well,yeah,not really important for this story… So first I tried to get an idea of how big is the overhead of frequent calls to Emscripten generated JavaScript code compared to the normal JS calls.(Remember,Emscripten generated JavaScript which is ASM.js.Read more about it at Emscripten wiki pages).I was curious about such a test as my first immediate idea was to port GLM library to JavaScript.On the JavaScript side I took THREE.js engine’s vector library.

First I wrote a simple method in C++ which makes 2 components vector addition and returns the result to the caller :


#include
#include
#include
#include "glm.hpp"
#include <emscripten\bind.h>
using namespace emscripten;
using namespace glm;

vec2 glm_add(const vec2& a ,const vec2& b){

	return  a + b;
}

EMSCRIPTEN_BINDINGS(my_module) {

	 value_array("vec2")
        .element(&vec2::x)
        .element(&vec2::y)
        ;

    function("glm_add", &glm_add);
}

When it is compiled to JavaScript the input arguments to glm_add methods are just arrays.The same is for the return value type.
Then on JavaScript side I ran that method in a loop for 1 million times:


	var startTime = new Date().getTime();
    var tv = new THREE.Vector2( 0, 0 );
	 var tv1 = new THREE.Vector2( 2, 2 );
	 var tv2 = new THREE.Vector2( 3, 3 );
	 var tv3 = new THREE.Vector2( 0, 0 );
	for(i = 0 ; i <  1000000; ++i){

	      tv3.add( Module.glm_add([2,2],[3,3]));

	}
	document.write('Time:' +  (new Date().getTime() - startTime) );

The average result was 470ms

Then I did the same but using THREE.js API:


	var startTime = new Date().getTime();

	 var tv = new THREE.Vector2( 0, 0 );
	 var tv1 = new THREE.Vector2( 2, 2 );
	 var tv2 = new THREE.Vector2( 3, 3 );
	 var tv3 = new THREE.Vector2( 0, 0 );
	for(i = 0 ; i < 1000000; ++i){

	    tv3.add(tv.addVectors( tv1, tv2 )) ;

	}
	document.write('Time:' +  (new Date().getTime() - startTime) );

The result was ~7 ms!!! What’s going on here???? Well,I recall,several years ago doing something similar with Adobe Alchemy SDK (later renamed to FLASCC ) which allowed to cross-compile C/C++ code to ActionScript3.The problems  was the interface layer call overhead.The same issue is here.Calling of Module.glm_add([2,2],[3,3]); in a loop causes huge overhead to the interpreter.I can’t really explain how it happens as frankly have no idea of how JavaScript engine calls ASM.js routines but to prove that I am right I decided to make another test.This time 1 million loop of vector additions I moved into C++ side of glm__add().So C++ code started looking like this:


vec2 glm_add(const vec2& a ,const vec2& b,const int numRuns){

      for(int i = 0;i <numRuns;++i}
          c+= a + b;
      }
      return c
}

I parametarized number of loops (numRuns) to make sure the compiler won’t optimize it out.

On the JavaScript size I call the method just once!.


for(i = 0 ; i < 1; ++i){

	       Module.glm_add([2,2],[3,3],1000000);

}

In this benchmark Emscripten version result was ~ 2ms which is more than 3 times faster than the THREE.js counterpart!

What can be concluded from these tests?The conclusion is simple.If you wish to squeeze maximum from Emscripten and ASM.js keep regular JavaScript -> to ASM code interactions to the minimum.As I see it,the regular JavaScript should probably only serve as container for the whole app with extremely minimal communication with the ASM modules.Performance critical sections must be completely executed on the Emscipten generated code side and have no interference from outside.That’s, no external JavaScript routines should be responsible for frequent ASM performance critical code blocks execution.For the test above,I think,if there is a need to compute a big number of vectors fast,they all can be packed into array which can be passed as argument into Module.glm_add().This way,we perform only a single function call and the buffer data is marched very fast by ASM module by fast consecutive access.

By the way,if you think some of my assumptions here are wrong or the tests are inaccurate I am open for your feedbacks.

, , ,

OGLPlus tutorial:Deferred Renderer

Filed in 3D | GPU | OGLPlus | OpenGL Leave a comment

This time we will make something more challenging using OGLPlus as OpenGL API.Well,not really challenging but still not as simple as drawing a rectangle.We will try simple deferred renderer. First,for the amateurs,brief explanation of what Deferred Renderer actually means.Imagine, you decide to render a scene with let’s say 100 light sources using a typical (called forward rendering) approach.Many years ago,when there was only fixed pipeline,you would end up with just 8 lights at most.That was the limit of what OpenGL API exposed.Then,when came new API with shaders and it had become possible to render as many lights as amount of instruction your shader model supported and of course how much your hardware was capable to process at acceptable frame rate.With SM4 and SM5 you can really get away with huge number of lights.I can’t say exactly how much but there is enough room for 100 for sure.With this advance we are till stuck with the second problem – the performance.With forward rendering you would process lights in a loop per object draw call.This way,if you have 100 primitives to draw with 100 lights located in the scene, you would loop over each rendered object 100 times per fragment,probably in the fragment shader if you care for quality.It’s enormous overhead and just after 20-30 lights you will start noticing the performance drop.Deferred renderer solves this issue in the following way:it breaks the render loop into 2 major passes.1)The geometry is drawn into custom Frame Buffer (sometimes called G-Buffer).In that pass geometry info like position,normals,texture coordinates and tangents are stored into texture render targets as this data will be used in the second pass.2)Second pass,executed in screen space, uses the textures from the previous one as inputs.In this pass the geometry info of the whole scene is extracted from the textures and used by lightning algorithms to shade the pixels.This way we compute lights for all the scene just once per render loop saving many precious GPU cycles.There are some drawbacks using deferred rendering, such as MSAA , transparency and more.Most of these are solvable with more sophisticated algorithms(Light pre-pass rendering is one of those).Here,just for the sake of proof of concept I used as my reference OpenGL SuperBible 6th edition’s Deferred Renderer demo.I picked it as it showcases a fresh and compact approach which is possible with GL4.2 API.For example,traditionally G-Buffer would use at least 3 color attachments to meet the needs for all the geometry data.But in in GLSL 420 numeric packing.unpacking was introduced which allows us to “squeeze” several numbers into one.This trick can save us additional texture attachments.Later you’ll see how we pack position,color,normals,uvs.

I am not going to explain line by line.You can read it in greater detail in the book.I will comment on some critical parts only.

Application setup:

We begin from the application setup.There are a couple of dependencies you should take care of.These are OpenGL context creation and Image loading.For the first one see my previous posts where I explain how to configure GLFW(I used it also in this demo).Now for the second – image loading.OGLPlus contains methods for  PNG image loading.But it expects the user has libpng linked.So you should have libpng and zlib on your machine and configure it the same way you would do with context creation library.To use this demo you must configure libpng as I use it for textures loading.And of course you will need GLEW as well.


      #include
      #include <GL/glew.h>
      #include "GL/glfw.h"
      #include <oglplus/all.hpp>
      #include "DeferredRenderer.h"
      using namespace oglplus;

  int main(int argc, char* argv[])
  {

	/// init window
	if(!glfwInit()){
		throw ;
	}
	glfwOpenWindowHint(GLFW_OPENGL_VERSION_MAJOR, 4);
	glfwOpenWindowHint(GLFW_OPENGL_VERSION_MINOR, 2);
	glfwOpenWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_COMPAT_PROFILE);
	glfwOpenWindowHint(GLFW_FSAA_SAMPLES,0);
	if(!glfwOpenWindow(764,468,8,8,8,8,24,8,GLFW_WINDOW)){

		throw;
	}

	glfwSetWindowTitle("Deferred Renderer" );
	glfwSetWindowPos(900, 300);
	glfwSwapInterval(1);

	/// init context
	if(glewInit() != GLEW_OK)
	{
		glGetError();

		return 0;
	}

	DeferredRenderer *deferredTest = new DeferredRenderer(764,468);

	Context gl;

	try{

		while(true){

			deferredTest->Render();

			if(glfwGetKey(GLFW_KEY_ESC)||false == glfwGetWindowParam(GLFW_OPENED)){

				break;

			}
			glfwSwapBuffers();

		}

	}catch(oglplus::Error& err)
	{
		std::cerr <<
			"Error (in " << err.GLSymbol() << ", " <<
			err.ClassName() << ": '" <<
			err.ObjectDescription() << "'): " <<
			err.what() <<
			" [" << err.File() << ":" << err.Line() << "] ";
		std::cerr << std::endl;
		err.Cleanup();
	}

	delete deferredTest;

	glfwTerminate();
	exit(EXIT_SUCCESS);
	return 0;
}

This is the main entrance point for the application.All it does is creating OpenGL context,init GLEW and spawn DeferredRenderer object which is then called in the rendering loop.

Deferred Renderer:

DeferredRenderer class makes use of two utilities I wrote for the sake of convenience:

ShadersInline.h – contains all the shaders as strings.
OGLPlane.h – is a class wrapping plane geometry with simple interface for rendering and transformation of plane geometry.We will used for drawing our scene geometry as well as for the full screen quad.

DeferredRenderer.h


#pragma once
#ifndef SAS_DEFERRED_RENDERER_H
#define SAS_DEFERRED_RENDERER_H

#include <oglplus/gl.hpp>
#include <oglplus/all.hpp>
#include <oglplus/bound/texture.hpp>
#include <oglplus/bound/framebuffer.hpp>

#include "OGLPlane.h"
namespace oglplus{
	class DeferredRenderer{

	public:

		DeferredRenderer(int width , int height);

		void Render();
		~DeferredRenderer(void);

	private:

	inline	float RandomFloat(float min, float max)
		{
			float r = (float)rand() / (float)RAND_MAX;
			return min + r * (max - min);
		}

		Context gl;
		AutoBind _gfboTex0;
		AutoBind _gfboTex1;
		AutoBind _gfboTexDepth;
		AutoBind _floorTex;

		AutoBind_gfbo2;

		////////Shapes   ////////////

		//Floor rendering plane :

	   OGLPlane *_floorPlane;
	   OGLPlane *_screenQuad;

		/////////  Buffers   ////////////
	   Buffer _lightUBO;

		////////// Shaders  ////////////////
		VertexShader   _geomPassVertShader;
		FragmentShader _geomPassFragShader;

		VertexShader   _resolvePassVertShader;
		FragmentShader _resolvePassFragShader;

		Program _geomProg,_resolveProg;

		////////////   Math  /////////////////

		LazyUniform _projection_matrixUniform, _camera_matrixUniform ;

		GLint _viewportW;
		GLint _viewportH;

#pragma pack (push, 1)
		struct light_t
		{
			Vec3f         position;
			unsigned int        : 32;       // pad0
			Vec3f         color;
			unsigned int        : 32;       // pad1
		};
#pragma pack (pop)

	};
}

#endif

DeferredRenderer interface is pretty simple.We declare FrameBuffer and its attachments,vertex/fragment shaders and their respective programs.We also declare pointer to two OGLPlane objects which will be instantiated dynamically in the class body.At the bottom, struct light_t is used to fetch multiple lights data into uniform buffer object (UBO).Note the padding.It’s needed in this case as the buffer uses std140 layout in GLSL which enforces some padding rules for different data types(See OpenGL SuperBible 6 for more details)

DeferredRenderer.cpp

Now let’s go step by step over DeferredRenderer.cpp.I will try to explain all the major parts.

First we initiate the constructor with defaults:


using namespace oglplus;
DeferredRenderer::DeferredRenderer(int width , int height)
	:_viewportW(width),_viewportH(height),

	_gfboTex0(Texture::Target::_2D,0),
	_gfboTex1(Texture::Target::_2D,0),
	_floorTex(Texture::Target::_2D,0),
	_gfboTexDepth(Texture::Target::_2D,0),
	_gfbo2(Framebuffer::Target::Draw),
	_projection_matrixUniform(_geomProg,"ProjectionMatrix"),
	_camera_matrixUniform(_geomProg,"CameraMatrix")

{ ...

The constructor accepts viewport width and height as params and initiates the constructors of the textures ,g-buffer and matrix uniforms.

Next we setup our G-buffer:


	_gfbo2.Bind();
	// Tex 0:
	//_gfboTex0.Image2D(0,PixelDataInternalFormat::RGBA32UI , _viewportW ,_viewportH , 0 , PixelDataFormat::RGBAInteger,PixelDataType::UnsignedInt,nullptr);
	_gfboTex0.Storage2D(1,PixelDataInternalFormat::RGBA32UI , _viewportW ,_viewportH  );
	GLuint tid = Expose(_gfboTex0).Name();
	assert(tid);
	_gfboTex0.MinFilter(TextureMinFilter::Nearest);
	_gfboTex0.MagFilter(TextureMagFilter::Nearest);
	_gfboTex0.WrapS(TextureWrap::Repeat);
	_gfboTex0.WrapT(TextureWrap::Repeat);

	//Tex 1:
	//	_gfboTex1.Image2D(0,PixelDataInternalFormat::RGBA32F , _viewportW ,_viewportH , 0 , PixelDataFormat::RGBA,PixelDataType::Float,nullptr);
	_gfboTex1.Storage2D(1,PixelDataInternalFormat::RGBA32F , _viewportW ,_viewportH );

	assert( Expose(_gfboTex1).Name());
	_gfboTex1.MinFilter(TextureMinFilter::Nearest);
	_gfboTex1.MagFilter(TextureMagFilter::Nearest);
	_gfboTex1.WrapS(TextureWrap::Repeat);
	_gfboTex1.WrapT(TextureWrap::Repeat);

	//Depth:

	//	_gfboTexDepth.Image2D(0,PixelDataInternalFormat::DepthComponent32F , _viewportW ,_viewportH , 0 , PixelDataFormat::DepthComponent,PixelDataType::Float,nullptr);
	_gfboTexDepth.Storage2D(1,PixelDataInternalFormat::DepthComponent32F ,  _viewportW ,_viewportH  );
	assert( Expose(_gfboTexDepth).Name());
	_gfboTexDepth.MinFilter(TextureMinFilter::Nearest);
	_gfboTexDepth.MagFilter(TextureMagFilter::Nearest);

	/// Init GBUFFER:

	_gfbo2.AttachTexture(FramebufferAttachment::Color,_gfboTex0,0);
	_gfbo2.AttachTexture(FramebufferAttachment::Color1,_gfboTex1,0);
	_gfbo2.AttachTexture(FramebufferAttachment::Depth,_gfboTexDepth,0);

	assert(_gfbo2.IsComplete());
	_gfbo2.Unbind(Framebuffer::Target::Draw);

G-Buffer frame buffer has 3 texture attachments – 2 color and one for depth buffer.Pay attention on internal formats.First attachment uses RGBA32UI which means each component is 32bit(4bytes) unsigned integer.This format allows us packing.We will pack color(albedo),normals and material_id(not used in this demo as I use only a single mesh object) into single RGBA32UI component by compressing 2 16 bit numbers into one 32 bit for each property except material_id which occupies a whole channel.
See gbuffer_pass_frag shader in ShaderInline.h how it done.Another detail is the usage of Storage2D.I intentionally leaved old-school Image2D commented to depict 2 ways of texture init.Storage2D wraps glTexStorage2D, which is much shorter than glTexImage2D but you must also be warned that using glTexStorage makes the texture’s format and dimensions immutable.That means,if you plan resizing the texture in runtime then don’t use glTexStorage.

Next we setup lights UBO.It will contain info for each light in the scene wrapped in the light_t struct:


  
    _lightUBO.Bind(Buffer::Target::Uniform);
	Buffer::Data(Buffer::Target::Uniform,  NUM_LIGHTS ,(light_t*)0,BufferUsage::DynamicDraw);
	
	BufferRawMap buffMap(Buffer::Target::Uniform,0,NUM_LIGHTS,BufferMapAccess::Write|BufferMapAccess::InvalidateBuffer);


	light_t * lights = reinterpret_cast<light_t *>(buffMap.RawData());
	assert(lights);
	for (int i = 0; i < NUM_LIGHTS; i++)
	{
		float i_f = ((float)i - 7.5f) * 0.1f + 0.3f;
		float rX = RandomFloat(-250.0f,250.0f);  
		float rY = RandomFloat(-250.0f,250.0f); 
		
		lights[i].position =Vec3f(rX,rY,-750.0f);
		
		lights[i].color =
			Vec3f(cosf(i_f * 14.0f) * 0.5f + 0.8f,
			sinf(i_f * 17.0f) * 0.5f + 0.8f,
			sinf(i_f * 13.0f) * cosf(i_f * 19.0f) * 0.5f + 0.8f);


	}
	buffMap.Unmap();
	_lightUBO.Unbind(Buffer::Target::Uniform);



Essentially,UBO allows us to pass an array into GLSL.The same can be achieved with 1 dimensional textures.I have never benchmarked one against other but I found UBO easier to use and update. Here we initiate the buffer first then we map it to pointer in order to access it’s content.In for..loop we fill light_t struct for each of the lights with light position and color.All this data will be access in the second pass during phong shading computation. Now we are done with g-Buffer and lights UBO.Next step is shader loading:

    


	_geomPassVertShader.Source(gbuffer_pass_vert);
	_geomPassFragShader.Source(gbuffer_pass_frag);
	_resolvePassVertShader.Source(light_pass_vert);
	_resolvePassFragShader.Source(light_pass_frag);

	try{
		_geomPassVertShader.Compile();
		_geomPassFragShader.Compile();

		_geomProg.AttachShader(_geomPassVertShader);
		_geomProg.AttachShader(_geomPassFragShader);
		_geomProg.Link();

		_resolvePassVertShader.Compile();
		_resolvePassFragShader.Compile();
		_resolveProg.AttachShader(_resolvePassVertShader);
		_resolveProg.AttachShader(_resolvePassFragShader);
		_resolveProg.Link();

	}catch(Error &er){
		throw er;
	}


All the shader strings are located in ShadersInline.h .I won’t put them out here as it’s too much code.Basically we have got 4 shaders, 2 for each program.Th first program (called _geomProg) executes the first pass and the second,as you already guessed,executes the second one. Next comes geometry.We create one big plane (500×500) to render as primitive in a world space and another unit plane which is served to render into screen space in second pass.

 
    _floorPlane = new OGLPlane(500,500, _geomProg);
    _floorPlane->Init();
	_screenQuad = new OGLPlane(1,1,_resolveProg);
	_screenQuad->Init();

Both planes are supplied with their respective programs(programs with which they are used).I don’t like this kind of coupling but OGLPlus API constructors for Uniforms require it.

Last thing left is to load a texture for our plane primitive:

    auto   pngImage  =images::PNGImage("demo1.png");

	assert(pngImage.Height() > 0) && pngImage.Width() > 0);//assert the texture dims are valid

	_floorTex.Storage2D(10,PixelDataInternalFormat::RGB8,	pngImage.Width(),	pngImage.Height());
	_floorTex.SubImage2D(0,0,0,pngImage.Width(),	pngImage.Height(),PixelDataFormat::RGB ,PixelDataType::UnsignedByte,pngImage.RawData());
	_floorTex.MinFilter(TextureMinFilter::LinearMipmapLinear);
	_floorTex.MagFilter(TextureMagFilter::Linear);
	_floorTex.WrapS(TextureWrap::ClampToEdge);
	_floorTex.WrapT(TextureWrap::ClampToEdge);
	_floorTex.GenerateMipmap();
	_floorTex.Active(0);

You can load any other texture but beware of number of channels.Here I used RGB as the texture has no alpha channel.If you load PNG 32 don’t forget to change texture internal format to RGBA8 and PicelDataFormat to RGBA.

Now let’s put it all together in a render loop:


float rot = 0.0f ;
void DeferredRenderer::Render(){

	static const GLuint uint_zeros[4] = { 0, 0, 0,0 };
	static const GLfloat float_zeros[4] = { 0.0f, 0.0f, 0.0f, 0.0f };
	static const GLfloat float_ones[4] = { 1.0f, 1.0f,1.0f, 1.0f };

	  Context::ColorBuffer draw_buffs[2] = {
		FramebufferColorAttachment::_0,
		FramebufferColorAttachment::_1
	};

	//////  draw geom in world space:

	Bind(_gfbo2,	Framebuffer::Target::Draw);
	gl.DrawBuffers(draw_buffs);
	gl.Viewport(_viewportW,_viewportH);
	gl.ClearColorBuffer(0,uint_zeros);
	gl.ClearColorBuffer(1,float_zeros);
	gl.ClearDepthBuffer(float_ones[0]);

	_geomProg.Use();
	_projection_matrixUniform.Set(CamMatrixf::PerspectiveY(Degrees(45.0f),(float) _viewportW /(float) _viewportH,0.1f,10000));
	_camera_matrixUniform.Set(CamMatrixf::LookingAt(Vec3f(0.0f),Vec3f(0.0f,0.0f,-1.0f),Vec3f(0.0f,1.0f,0.0f)));

	_floorPlane->Rotate(90.0f,rot+=1.5f,0.0f);
	_floorPlane->Translate(0.0f,0.0f,-850.0f);

	Texture::Active(0);
	Bind(_floorTex,Texture::Target::_2D);

	gl.Enable(Capability::DepthTest);
	gl.DepthFunc(CompareFunction::LEqual);

	_floorPlane->Draw();

	///--------------------   resolve to screen quad -------------------------------------///

	_resolveProg.Use();

	//Bind default FrameBuffer
	Framebuffer::BindDefault(Framebuffer::Target::Draw);
	gl.Viewport(_viewportW,_viewportH);
	gl.DrawBuffer(ColorBuffer::Back);

	//Bind first GBUFFER attachment to a sampler:
	Texture::Active(0);
	Bind(_gfboTex0,Texture::Target::_2D);
	//Bind second GBUFFER attachment to a sampler:
	Texture::Active(1);
	Bind(_gfboTex1,Texture::Target::_2D);

	gl.Disable(Capability::DepthTest);

	_lightUBO.BindBase(Buffer::IndexedTarget::Uniform, 0);
	BufferRawMap buffMap(Buffer::Target::Uniform,0,NUM_LIGHTS,BufferMapAccess::Write|BufferMapAccess::InvalidateBuffer);


	light_t * lights = reinterpret_cast<light_t *>(buffMap.RawData());
	for (int i = 0; i < NUM_LIGHTS; i++)
	{
		float i_f = ((float)i - 7.5f) * 0.1f + 0.3f;
		// t = 0.0f;
		float rX = RandomFloat(-250.0f,250.0f);  
		float rY = RandomFloat(-250.0f,250.0f); 

		lights[i].position =Vec3f(rX,rY,-800.0f);

		lights[i].color =
			Vec3f(cosf(i_f * 14.0f) * 0.5f + 0.8f,
			sinf(i_f * 17.0f) * 0.5f + 0.8f,
			sinf(i_f * 13.0f) * cosf(i_f * 19.0f) * 0.5f + 0.8f);


	}
	buffMap.Unmap();
    _screenQuad->Draw();

	_lightUBO.UnbindBase(Buffer::IndexedTarget::Uniform, 0);

	Texture::Unbind(Texture::Target::_2D);

}

The render loop is divided into 2 stages.First we render scene geometry normally into G-Buffer.During this stage G-BUffer attachments are filled with the data regarding rendered geometry.In the second pass the scene geometry is replaced with a simple unit plane screen quad.The quad takes the whole screen so we can make a so called post-processing pass.In this pass the data from the previous stage helps to “reconstruct” the geometry properties in screen space and calculate the scene lightning correctly.During the second pass we also update our lights UBO.You don’t have to do it if the lights in your scene are static(in this case you ca use just light maps and forget about this tutorial ;) ) .A free tip – don’t forget to unmap the UBO after updating as the access to it from the shader is blocked as long as it stays mapped to CPU.

That’s all folks.Here you can see the result and download the sources.And don’t forget to visit OGLPlus.org and grab the API

The code.

Result:

, ,

OGLPlus tutorial

Filed in 3D | OpenGL Leave a comment

I already mentioned in some old posts here that I started  playing around with OGLPlus library.The lib is really cool piece of soft.No need in procedural C style OpenGL coding,Everything is wrapped into template classes.I recommend it anyone who seeks to use OpenGL API in OOP way.In this article I am going to show  rather a simple demo, but it  sheds some light on how one sets up a single vertex buffer to hold several attributes,in this case position and color.I haven’t found an example on this in OGLPlus examples pack so I decided to write it myself.For those who are new to OpenGL I will explain.The main reason to pack different attributes into single buffer is the transfer bandwidth.The more different attrib buffers you have to use the more bandwidth is required to transfer the data.Also,the update of the data in one single buffer is probably more efficient than doing it across several ones.This is called  memory locality  .Now,there is an ongoing argument about what is the most efficient way to pack the data in a single  buffer.Basically there are two ways: 1) Using interleaved vertex data   2)  Data blocks.(I use it in this demo) .Most docs suggest the first option,but based on many experts opinion it may vary on different hardware and OpenGL implementations.Sure thing is  - using a single data buffer is much better  for performance than splitting to several VBOs.In this demo I will create a plane with position and color vertex attributes packed into a single vertex array as 2 consecutive blocks.Let’s see how it’s done.

1.Setup OGLPlus examples solution.

OGLPlus SDK has a prebuilt VisualStudio 2012 solution residing in “ oglplus\etc\msvc11\ ” directory.Before using it you should config GLEW and window management library which can be FREEGLUT or GLFW.I personally prefer the second one.One of the reasons I prefer GLFW over GLUT is the ability to setup Full Screen Anti Aliasin (FSAA).You can’t do it with GLUT.Anyhow, in this post I have explained how it is done.

2.The code.

Once everything is setup, we create an empty .cpp file and get to business.I am using OpenGL 4.2 API for this tutorial but anything from 3.3 onward will do fine.

We start from Window and OpenGL context creation:


int main(int argc, char* argv[])
{

/// init window ///////////////////
 if(!glfwInit()){
 throw ;
 }
 glfwOpenWindowHint(GLFW_OPENGL_VERSION_MAJOR, 4);
 glfwOpenWindowHint(GLFW_OPENGL_VERSION_MINOR, 2);
 glfwOpenWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_COMPAT_PROFILE);
 glfwOpenWindowHint(GLFW_FSAA_SAMPLES,8);
 if(!glfwOpenWindow(764,468,8,8,8,8,24,8,GLFW_WINDOW)){

  throw;

 }

 glfwSetWindowTitle("Demno1" );
 glfwSetWindowPos(900, 300);
 glfwSwapInterval(1);

/// init context //////////////////
 if(glewInit() != GLEW_OK)
 {
 glGetError();

return 0;
 }

Here we first create a window using GLFW library.Once the window is up we initiate GLEW lib to retrieve function pointers to OpenGL API methods.

Next,let’s declare several important OGLPlus objects:


    //OGL setup  /////////
	Context gl;
	VertexShader vs;
	FragmentShader fs;
	Program prog;
	VertexArray triangle;
	Buffer verts;

Most of these are self – explanatory.Context class contains many core OpenGL methods like glClear() ,glEnable/glDisable ,etc.In fact most of OpenGL general purpose methods that change OpenGL state are accessed via Context.Buffer object is also pretty interesting because it can be any type of buffer:VBO,UBO,TBO,transform feedback buffer etc.

Now let’s add vertex and fragment shaders for our program.Those can be loaded from text file but for the sake of simplicity I put then inline.One design flow I can point out in this respect is the existence of VertexShader,FragmentShader ,GeometryShader etc.. I think it could be possible to unify all these types into single ShaderProgram class.But well,maybe I am missing something.

static const char* vert_shader =

		"#version 420 core \n"
		"layout(location = 0) in vec3 Position;"
		"layout(location = 1) in vec3 Color;"
		"uniform float deltaTime = 0.0;"
		"uniform mat4 model;"
		"uniform mat4 view;"
		"uniform mat4 proj;"
		"smooth out vec3 vColor;"

		"void main(void){"

		"mat4 MVP = proj * view * model;"
		"vColor = Color * abs( cos(deltaTime));"
		"gl_Position = MVP * vec4(Position,1.0);"
		"}"
		;

	static const char* frag_shader=
		"#version 420 core \n"
		"out vec4 fragColor;"
		"smooth in vec3 vColor;"
		"void main(void){"

		" fragColor = vec4(vColor.rgb,1);"
		"}"
		;

Nothing fancy is going on in these shaders.Vertex shader has position and color as input attributes.It also has 4 uniforms.3 of them are transform matrices which in real life scenario could be unified under final MVP matrix to save uniform calls.Additional uniform is “deltaTime” which is passed on each frame to do some fancy color animation across mesh’s surface.The vertex shader outputs varying which is interpolated color to be used in the fragment shader.

The fragment shader is the simplest in the world.It receives the varying vColor from the previous stage and outputs it to the frame buffer.That’s it.

We need to compile and link the shaders from above with our program:

vs.Source(vert_shader);
	fs.Source(frag_shader);

	try{
		vs.Compile();
		fs.Compile();
		// attach the shaders to the program
		prog.AttachShader(vs);
		prog.AttachShader(fs);
		// link and use it
		prog.Link();

	}catch(Error &er){
		throw er;
	}

	prog.Use();

As you can see, the code is pretty straightforward.The shaders are compiled,then attached to the program and eventually are linked.

Before we get to the buffers let’s first finish off with math stuff:

	//=======================  Setup matrices =========================//
	LazyUniform<Mat4f> *projection_matrixUniform, *camera_matrixUniform, *model_matrixUniform;
	projection_matrixUniform = new LazyUniform(prog,"proj");
	camera_matrixUniform = new LazyUniform<Mat4f>(prog,"view");
	model_matrixUniform = new LazyUniform<Mat4f>(prog,"model");

	projection_matrixUniform->Set(CamMatrixf::PerspectiveY(Degrees(45.0f), 764.0f/468.0f,0.1f,10000));

	camera_matrixUniform->Set(CamMatrixf::LookingAt(Vec3f(0.0f),Vec3f(0.0f,0.0f,-1.0f),Vec3f(0.0f,1.0f,0.0f)));

	Mat4f modelMatr = ModelMatrixf::Translation(0.0f,0.0f,-200.0f);

	modelMatr       =  modelMatr * ModelMatrixf::Scale(50.0f,50.0f,50.0f);

	model_matrixUniform->Set(modelMatr);

We create 3 lazy uniforms for perspective , camera(view) and model (object) matrices.LazyUniform is “lazy” because it doesn’t query uniform location untill it being used.Afterwards we create those matrices.The code is simple and resembles math APIs like GLM.If you have no idea what all this matrix stuff about I would strongly suggest to learn it as it is the most fundamental knowledge any graphics programmer should posess.We submit matrix data to the uniform calling Set() method.I update all 3 uniforms already at this stage.Later,as you will see,in the render loop,only the model matrix will be updated to rotate the plane.

Let’s move next:


	// bind the VAO for the triangle
	triangle.Bind();

	GLfloat triangle_verts[24] = {
      /*  positions */
		1.0, 1.0, 0.0,
		-1.0, 1.0, 0.0,
		1.0,-1.0, 0.0,
		-1.0,-1.0, 0.0,
        /* colors  */
		1.0f, 0.0f, 0.0f,
		0.0f, 1.0f, 0.0f,
		0.0f, 0.0f, 1.0f,
		0.0f, 0.0f, 1.0f
	};

We bind our VAO as we will need to bind to it our VBO soon and next thing you can see it the actual vertex array.It has 24 numbers in total.As I said in the beginning we are going to pack both positions and colors into single array.So first let’s do the math.Our plane is going to have 4 vertices in total.If you think it should be 6 you are right.In most naive techniques you would create 2 triangles ,each one with its own vertices and render it using GL_TRIANGLES command.But you can actually spare 2 extra vertices by sharing 2 vertices of the previous triangle with the next one.This kind of draw is called GL_TRIANGLE_STRIP.So now when it is clear let’s see what we have: 4 vertices ,each has 3 numbers for position (x,y,z) and 3 numbers for color (r,g,b).All in all each vertex has 6 numbers which is 24 in total.As you can see from the array above, we put positions block first and colors block right after it.In interlieved scenario we would insert a color line after each position line.

Now, let’s setup our single VBO:

    // bind the VBO for the triangle vertices
	verts.Bind(Buffer::Target::Array);
	// upload the data
	Buffer::Data(
		Buffer::Target::Array,
		24,
		triangle_verts
		);
	// setup the vertex attribs array for the position:
	VertexAttribArray vert_attr(prog, "Position");
	vert_attr.Enable();
	vert_attr.Pointer(3,DataType::Float,false,0,(void*)0);

// setup the vertex attribs array for the color:
	size_t colorOffset = sizeof(GLfloat) * 3/* num compontsnt per vertex */ * 4 /* num vertices */;
	VertexAttribArray color_attr(prog,"Color");
	color_attr.Enable();
	color_attr.Pointer(3,DataType::Float,false,0,(void*)colorOffset);

If you ever wrote VBO setup with ‘raw’ OpenGL you would immediately see what’s going on here.First we bind the VBO.Next we get the array data into it.Now comes the interesting part.OpenGL has somehow to know where in the array is position data and where resides color data.It has no way to figure it out automatically as it “sees” just an array of numbers.So we have to point it to the blocks.

First we setup vertex attribute pointer for the positions block:

    VertexAttribArray vert_attr(prog, "Position");
	vert_attr.Enable();
	vert_attr.Pointer(3,DataType::Float,false,0,(void*)0);

vert_attr constructor gets prog and attribute name in vertex shader needed to retrieve attribute location.In fact in OpenGL 4.2 we don’t need this as we have a direct access to the attrib locations once we specify it explicitly in the shader, as is shown in the shaders above.vert_attr.Pointer wraps glVertexAttribPointer() taking the following params: values per vertex,data type,normalized data or not,stride,and offset in bytes to where the attrib pointer should point.In our case the 2 last params are zero.That’s because we don’t use stride at all as it is needed when setting interleaved vertex data.The last one is zero too as position block begins at zero offset(from the first element in the array).

Next we set color array pointer and here we do use offset to point to the beginning of the color block.The colorOffset is calculated as follows:because we know that the color block starts at the end of the position block we must figure out how many bytes position block it(to find its end).Position data is 12 numbers in total (4 vertices * 3 numbers) and because we need to convert it int bytes we multiply that number by 4 as that’s number of bytes in GLFloat data type.

Now OpenGL knows exactly where to look for position and color data when fetching it into vertex shader attributes.

Let’s move on:


	gl.Enable(Capability::DepthTest);
	gl.ClearDepth(1.0f);
	gl.ClearColor(0,1,0,1);
	gl.PolygonMode(PolygonMode::Fill);
	float counter = 0.0f;
	try{

		while(true){

			gl.Clear().ColorBuffer().DepthBuffer();

			modelMatr =  modelMatr * ModelMatrixf::RotationA(Vec3f(0.0f,1.0f,0.0f),Degrees(0.5f));
			model_matrixUniform->Set(modelMatr);

			Uniform(prog,"deltaTime").Set(glfwGetTime());

			gl.DrawArrays(PrimitiveType::TriangleStrip, 0, 4);

			if(glfwGetKey(GLFW_KEY_ESC)||false == glfwGetWindowParam(GLFW_OPENED)){

				break;

			}
			glfwSwapBuffers();

		}

	}catch(oglplus::Error& err)
	{
		std::cerr <<
			"Error (in " << err.GLSymbol() << ", " <<
			err.ClassName() << ": '" <<
			err.ObjectDescription() << "'): " <<
			err.what() <<
			" [" << err.File() << ":" << err.Line() << "] ";
		std::cerr << std::endl;
		err.Cleanup();
	}

	// exit:

	//glfwTerminate();
	exit(EXIT_SUCCESS);
	return 0;
}

This is ,as you have already guessed , is render loop.Pay attention how I setup GL states such as depth test,clear color and depth values using Context instance.Thoen at the beginning of the render loop I clear the buffers with just one line of code:

gl.Clear().ColorBuffer().DepthBuffer();

Then update model matrix with rotation and finally issuing the draw call:

	gl.DrawArrays(PrimitiveType::TriangleStrip, 0, 4);

Notice the PrimitiveType equals TriangleStrip so we can draw 2 triangles with just 4 vertices.

Here is the result:

In the next time we will learn how to create 3d shapes and manipulate them using OGLPlus.
Happy new year!

,

Wrapping up the year 2013

Filed in 3D | Uncategorized Leave a comment

So the last night another year had passed away.And now we start a crazy  horse ride into 2014.In retrospect of the last year I could probably  summarize it all as one of the toughest ,challenging and fascinating years of my life in terms of career as well as in  personal life.It was also one of the fastest years to pass before my eyes.In January 2013 I joined Idomoo where all my effort since then was put into development of company’s rendering technology.There was a huge learning curve which I really enjoyed and benefited a lot from the whole process in terms of knowledge.The past year was extremely challenging because for me it was a sharp and deep switch from the world of Flash , Java,C# (high level) to the dark area of “close to the metal” programming that has been all around low level graphics  development with C and C++ .Though  I coded C/C++ occasionally in the past,it was probably  during the last year that I got seriously involved in full scale software engineering with these languages.The initial  feeling was as if you are dropped during the storm from fast flying jet with the main parachute screwed up.It was so much to learn and to figure out about “the C/C++ way” of doing things,about the standards,differences in compilers,cross platform issues,performance optimization techniques ,useful libs etc etc…Sometimes being stuck for hours on nasty “access violation” or segmentation fault” bugs, which are probably the hardest for the noobs,I was spitting blood and recalling the merry days of ActionScript3 and other pleasant languages. Today most of the hardest stuff,so I believe, is behind me ,but every day, it seems like the deeper I get the more stuff I realize I have no idea of.C++ is very deceptive.At some point you start being confident in what you do and how,but then every now and again you get it right in the face , realizing how much is still to learn and understand about the language. Entering 2014 I am mostly concerned with technological challenges related to soft and hardware performance,but feel also excited as my 2014 roadmap is full of new challenges in the field of Graphics programming and my TODO list is full of interesting cool stuff which I have been planning already for a long time to put my hands on.

,

TOP