It is a completely different to think that the code you are writing is not generating bad code and it not actually generating bad code. You want to be sure so you have to give the code real data to transform. I am not one of those people with great imaginations so I decided that I will write a software triangle rasterizer.
Having written a few in my younger days my options were more limited than if I had not; the world wasn't my oyster - I was forced to designing one around the things that I 'knew' so I started my design around the number one bottleneck in modern CPUs - memory access. The other thing I wanted to do is to leverage SIMD instructions as that is the task I was set out to do in the first place.
I would be computing multiple pixels, or fragments if you will, simultaneously. I did choose to use 2x2 quad as my primitive for 4-wide SIMD vectors, this makes sense as GPU hardware works like this for various reasons. This is all nice when we scale to wider SIMD vectors. 8 wide can do two side-by-side quads simultaneously and 16 wide can do a whole 4x4 block at once. We want to avoid doing super-wide spans like 16x1 pixels because that would be wasted for most of the triangles.
4x4 is also nice for 32 bit buffers: 16 pixels times 4 bytes each is 64 bytes, which happens to be a L1 cache line size on many contemporary CPU architectures. If we align our buffers and thus the 4x4 blocks to 64 bytes our memory access just got reasonably efficient.
Now we are taking a good advantage of the vector units in our CPU. The next problem to solve is how to use the multiple CPU cores. The obvious solution is called 'binning'; the framebuffer is split into number of tiles which are processed individually. 128x128 is a good size as it is not too small and still fits into the L2 cache of most CPUs leaving some cache for textures and other input.
When the vertices are transformed the resulting coordinates can be used for binning the resulting triangles. Binning can be done in either clip or screen coordinates. The screen coordinate binning should not require explanation so writing out a few words about clip coordinate binning. In clip coordinates the ratio of x/w and y/w determine the bin, or bins the triangle belongs to.
The last step is called 'resolve', where each bin is resolved by discrete CPU thread. This has a nice effect on CPU cache as a lot of triangles processed in the same CPU thread end up written in the same area of memory. One CPU core thus accesses the same L2 cache and does not need to share it between other threads which reduces on-chip processing overhead significantly.
Enough theory! Screenshot!
As can be observed, the number of features isn't so great at this time. There is depth buffering going on, some perspective correct gradients and stuff like that. There is texture mapping as well (not shown) and it is quite trivial to add more gradients and use them in different creative ways. The inner loops are still hand-written but if ever get serious about this they should be compiled from higher level shading language. I will never get serious about this, though, as there is no place for software rendering these days. I wrote this for fun and to test the math library. :)
Here's what the prototype inner loop looks like:
// compute which fragments are inside the triangle
mask32x16 colorMask = (i0 & i1 & i2) < zero;
// depth test
float32x16 depth = (c0 * tri.depth + c1 * tri.depth + c2 * tri.depth) * w;
mask32x16 depthMask = (depth < *depthBuffer);
mask32x16 writeMask = colorMask & depthMask;
// TODO: skip the block when the writeMask is zero (the block is not visible)
// convert barycentric coordinates into floating-point
float32x16 c0 = convert<float32x16>(i0);
float32x16 c1 = convert<float32x16>(i1);
float32x16 c2 = convert<float32x16>(i2);
// compute w coordinate for perspective correction
float32x4 w = 1.0f / (c0 * tri.w + c1 * tri.w + c2 * tri.w);
// compute (r, g, b)
float32x16 red = (c0 * tri.color.x + c1 * tri.color.x + c2 * tri.color.x) * w;
float32x16 green = (c0 * tri.color.y + c1 * tri.color.y + c2 * tri.color.y) * w;
float32x16 blue = (c0 * tri.color.z + c1 * tri.color.z + c2 * tri.color.z) * w;
// compute fragment color as 32 bit packed pixel format
int32x16 r = convert<int32x16>(red);
int32x16 g = convert<int32x16>(green);
int32x16 b = convert<int32x16>(blue);
int32x16 a = 0xff;
int32x16 color = b | (g << 8) | (r << 16) | (a << 24);
// write depth and color into the framebuffer
*colorBuffer = select(writeMask, color, *colorBuffer);
*depthBuffer = select(writeMask, depth, *depthBuffer);
This code will compute 16 pixels per iteration. The nice property of using SoA (structure-of-arrays) layout for the vertex attributes and uniforms is that the code can be written in same way a scalar code would be written. Observe how the red color component is computed above; the code looks completely scalar but since the computation is done in vector registers the "scalar" computation just happens to many pixels simultaneously. This makes it very easy to write more complicated shaders and scales perfectly to available SIMD register width. The key is data layout in memory!
The code above is a mock-up since wanted to expose the internal working in more detail; the actual code uses helpers to hide some of the complexity, for example, computing interpolated variable does not write out the barycentric equation every time as we do here. The color packing could be abstracted as well but we don't want to function call in our inner processing loop so the correct code would have to be resolved at compilation time. The best approach to this is to generate the binary code in runtime using either simple JIT compiler or even better, just use existing tooling like LLVM. I am not too keen on going there at this time since this is a non-profit hobby fun project but that's what I'd do if would ever want to take this prototype to the next level. For now the complied code runs nearly 100% in registers and doesn't do any unnecessary spilling; primary goal achieved - the math library doesn't at least suck!
One neat feature I must add is early-z. The 'heart' of the rasterizer can easily classify the 4x4 (or any other size) of blocks; fully inside triangle, trivially rejected as outside of triangle and crossing the triangle edge. When a block is fully inside the minZ can be stored in coarse depth buffer and then any block that is about to be rasterized can be tested against the coarse depth to reject blocks that are not visible. That will be fun feature but I need more complicated test scenes for this, seriously.
Other optimizations: when block is fully inside the triangle there is no need to compute the colorMask, which is used to mask out writes to the color buffer. The code does not write pixels out one-by-one, we process 16 pixels simultaneously so we write them all out simultaneously - remember - the cost is same for 1 or 16 pixels because they reside in same L1 cache line. That is the smallest unit the CPU can write into memory across the memory bus anyway.
Performance? 500K triangles on 1920x1080 buffer render 60 fps easily on i7 4770 CPU. 3840x2160 can render 200K triangles at 60 fps, too, on the same CPU. I don't have comprehensive charts or anything like that at this time since coding is still on-going but the results are promising.
The effect of resolution is actually smaller than anticipated. The number of triangles is more limiting factor as the transformation and binning code is still not optimized at all (those operations run in multiple threads but that's it). The triangle setup code is still scalar; we could at least setup 4, 8 or 16 triangles simultaneously. Also, on these resolutions and 500K-1M triangles the triangles are so small, only 2-4 pixels in size and the block we are processing is 4x4. But reducing the block size doesn't give any benefits (tested) so we are going with those dimensions.
I compiled Linux 64 bit demo for SSE4.1. It can be found here.
Update: on i9 7900x the performance is nearly doubled when using AVX-512 over AVX2 when the fragments are expensive enough. With only depth test and gouraud shading the memory bandwidth is limiting factor on performance; the AVX-512 is only 25% faster. This means the AVX-512 leaves more headroom for more expensive shaders. I read the early reports about Skylake-X CPU's thermal throttling when AVX-512 is in heavy use but I did not encounter this effect with my setup; I have AIO liquid cooling with enough cooling on the case and the all CPU cores run at 100% utilization consistently w/o throttling. Amazing CPU for the price even if it is a bit steep but for once you get what you pay for. Intel did not pay me to advertise their products but they could wink wink