Discover advances in Metal for A15 Bionic - Tech Talks - Videos - Apple Developer

Download

Dave Roberts: Hi. I’m Dave Roberts from the GPU Software team at Apple.

I’m really excited to share some of the updates to the GPU in the new Apple A15 Bionic chip.

Later on, Katelyn Hinson, my colleague on the GPU Software team, will tell you all about the new Metal features of the GPU.

She’ll also show you how to use those features in your Metal apps while exploring some cool use cases.

The A15 Bionic is a powerful new platform for your Metal apps and games, with updates to the CPU, GPU, Neural Engine, and other user experience-enhancing technologies.

The A15 GPU builds upon the same tile-based deferred renderer and unified memory architecture as the A14 Bionic.

While we’ve made many microarchitectural improvements in various areas, there are some important changes for performance that I should highlight.

The A15’s GPU has up to five shader cores, and that fifth core provides a performance boost of 25 percent at the same GPU core frequency.

The shader cores now have double the F32 floating-point math units which can boost GPU performance on math-heavy workloads.

The A15’s GPU also makes the UI even more responsive and extends battery life even further.

And you get all of these great improvements for free without any modifications to your code.

But that’s not everything! We brought some brand-new features to the A15 GPU that you can use to make your Metal apps even better.

And all of these new capabilities belong to a new Metal feature set known as AppleGPUFamily8.

For the rest of the talk, Katelyn and I will focus on these new features and explain what they are, why they are useful, and cover the changes to the new Metal API and shading language that support them.

First up, the A15’s new graphics processing features.

Lossy compression, which uses your app’s texture memory usage with minimal impact on image quality.

This new A15 feature gives you the same texture memory bandwidth savings as lossless compression.

I’ll show you how to use lossy compression in your Metal apps with some more details in a moment.

Later on, Katelyn will show you how the A15 GPU extends existing support for sparse textures by including rendering to both sparse depth and stencil textures.

Katelyn will also cover a new compute specific feature: SIMD group shuffle and fill.

The A15 adds these new instructions to the GPU core instruction set.

She’ll explain this feature and show you how to improve your app performance by reducing compute kernel execution time for applicable use cases such as image processing.

I’ll start by taking a closer look at lossy compression.

To better understand lossy compression, it’s worth revisiting lossless compression.

The A12 Bionic first introduced lossless texture compression in 2018, and the A14 Bionic added further improvements to the feature in 2020.

Lossless texture compression saves memory bandwidth, which in turn saves power, so your apps can do even more on a single battery charge.

Lossless compression ensures that it always preserves texture detail.

In fact, your apps might already take advantage of lossless compression on the A12 Bionic and later.

Check out the tech talk, “Discover Metal enhancements for A14 Bionic” and the “Optimizing Texture Data” article on developer.apple.com for more details about lossless compression.

Lossy takes texture compression to the next level on the A15 Bionic.

In addition to the bandwidth savings that lossless compression gives you, lossy compression uses just half the memory footprint of an uncompressed texture.

Lossy compression preserves texture quality wherever it’s possible.

And best of all, you can easily apply this to your render targets on the A15 to take full advantage of those memory savings.

You can enable lossy compression by simply setting a texture descriptor’s new compression type property to lossy.

So why use lossy compression? Well, compression saves significant texture memory bandwidth, whether you choose to use lossless or lossy.

It’s the compression unit that saves the bandwidth by compressing textured data before it’s written to memory.

When you use lossless compression, the GPU must perfectly preserve texture detail.

So Metal cannot guarantee any amount of compression and must allocate enough memory to cover the full uncompressed texture size.

However, when you use lossy compression, textures use just half the memory footprint of lossless.

If the A15 GPU cannot losslessly compress a texture to fit within that 50 percent smaller memory footprint, it’ll reduce the fidelity of regions of the texture so that it does.

Lossy compression supports most pixel formats and texture types, and you can use it on your render targets.

In many cases, you can enable it on textures without any further modifications to your app.

I recommend your apps enable lossy compression wherever the quality tradeoff is acceptable to you.

The easiest place to enable it is your final render target where you’re least likely to notice a loss in quality.

Consider using lossy compression for intermediate render targets and use that memory saving for other things, such as increasing texture resolution.

And be sure to review your postprocessing change to find render target candidates that may benefit from lossy compression.

Here are some of the use cases in detail.

Take a look at the visual difference if I enable a lossy compression for just the final render target.

This split image compares lossless on the left to lossy on the right, and the differences are pretty subtle.

He’s an image that shows the per-pixel differences between lossless and lossy compression.

The black pixels represent no difference; blue to green represent small differences; and red represent the biggest changes.

The red and yellow pixels in this image illustrate a few isolated regions that have the largest difference in the final render.

If I zoom into one of the regions with the scooter, I have a hard time seeing any difference between the left and the right images.

Intermediate render targets also work well with lossy compression.

Here’s a side-by-side view of the puddle’s reflection that compares lossless and lossy compression.

If I switch to the per-pixel difference representation again, the lossy compressed reflection has only minimal differences from the lossless version.

Plus you can increase the resolution of the texture to add more detail with the memory you save with lossy compression.

For example, here’s a high-resolution reflection that shows more detail than the lossless version, all while using the same amount of memory.

The right side of this demo uses lossy compression for every renderable texture in the scene.

When it’s in motion, the scene looks very stable, and it’s very difficult to detect the difference if you compare it to the lossless version on the left.

Metal makes it easy to use lossy compression in your apps.

Here’s how.

Start by initializing a texture descriptor as usual, then set the compression type property to lossy.

Next, set the storage mode to private.

Finally, create the texture.

And your app can now take full advantage of lossy compression and the savings it offers.

Note that you can create textures for lossy compression for most configurations with a few exceptions.

For example, you can use lossy compression for most common texture types, including 2D, 3D, array, and cube.

But the feature doesn’t support some of the less common types.

Similarly, lossy compression supports most common pixel formats, but not the formats with packed color channels.

Lossy compression supports textures as render targets, in blit operations, and when you access them with sample and read operations.

However, note that you cannot populate a lossy texture with shader write operations.

Lossy compression only supports textures in private storage.

You cannot use shared or managed storage nodes.

And lastly, lossy textures work with other common features like MSAA, sRGB, and mipmapping.

Check out the Metal feature set tables on developer.apple.com for more details about lossy compression support.

So in summary, lossy compression saves the same bandwidth as lossless compression while also saving 50 percent of the texture memory.

You can save significant amounts of memory depending on the use case and how much you choose to use lossy compression in your apps.

Lossy compression aims to preserve texture detail but only slightly reducing the quality for regions where the compressed texture data doesn’t fit.

And lastly, lossy compression supports common texture types, common pixel formats, and all GPU access modes other than shader writes, making it easy for you to use lossy compression.

Thanks for listening, and now I’ll hand over to Katelyn.

Katelyn Hinson: Thanks, Dave.

I’m excited to introduce the new sparse textures extensions in the A15 Bionic.

Sparse textures are a great way to create high-resolution textures while managing your memory budget in Metal.

A13 Bionic first introduced sparse texture support, allowing you to map and unmap texture tiles on the GPU timeline.

For more details on how to use sparse textures in your apps, refer to the talks from fall 2019 and 2020.

A15 Bionic extends sparse support by including depth and stencil attachments.

The guiding principle of a sparse texture is, “Don’t allocate what you won’t use.” For example, this app doesn’t need to map the tiles behind the UI elements.

With sparse depth and stencil textures, this app can always leave these obscured tiles unmapped.

You can optimize shadow maps with sparse depth attachments.

If you’re not familiar with shadow mapping, check out Metal’s deferred lighting sample which uses this technique.

A shadow pass renders the shadow map from the light’s perspective, and the lighting pass reads it back.

The texels sampled from the shadow map are always within the projected frustum.

This scenario is a perfect candidate for a sparse texture.

A large portion of the shadow texture does not need to be mapped, as the lighting pass does not sample these tiles.

Here’s a scene that uses shadow mapping and its rendered shadow map.

The app can recover the memory in the tiles outside the view frustum since it doesn’t need to write or read back from those tiles.

Cascaded shadow mapping is a more advanced technique for shadows that use multiple individual shadow maps to cover the scene more efficiently.

It allocates higher resolution shadow maps near the camera and lower resolution maps farther from the camera.

For example, this scene uses three overlapping shadow maps.

Each shadow map has the same texture resolution and is mapped to increasingly larger areas the further it is from the camera.

The green highlighted areas in the shadow map represent the texels that the lighting pass samples.

The lighting pass nonuniformly samples from these tiles — represented as a heat map — with blue tiles undersampled and red oversampled.

You can use sparse tiled shadow maps to replace these textures with a single surface that has adjustable resolution based on sampling rate.

With sparse tiled shadow maps — or STSM — you create a single sparse depth surface.

Instead of using a fixed-resolution texture, the surface has mapped tiles across the sparse mipmap chain.

This technique only maps the tiles needed to match the desired sampling rate.

Here’s an illustration of the physical resolution for each tile in its relative mip level.

You can freely and efficiently adjust the resolution of your shadow map across a scene by mapping tiles across different mips.

Here are the main steps of the STSM technique.

First, generate the density map based on the sampling rate.

Then construct the surface and map tiles according to the density map.

And then render to and sample from the adaptive surface.

To generate the density map, the geometry pass populates a density map buffer for the other passes.

The first step is to get the sampling rate across the shadow map.

The expected sample density is calculated by tracking the shadow space derivatives of rendered geometry.

The fragment shader uses atomics to store the derivatives in a 2D grid, collecting the sampling rates across the shadowUV space.

Once you have the density map, use it to lay out the tiles for your sparse depth texture and generate a table of contents buffer.

This table of contents buffer will be used by your lighting pass.

First, map the tiles of the depth texture by iteratively dividing the surface and scheduling the mapping of each mip level, starting with the bottom mip.

To figure out which tiles to map, start by checking the sampling rate of the current mip from the density map.

In this example, the density map indicates the current mip is inadequate, but the next mip level is a good fit.

In this case, you promote the whole tile by mapping the next mip and unmap the current mip.

Here’s another, more complicated scenario.

The density map indicates the current mip is satisfactory for at least one quadrant.

For the next mip, the sample rate and the density map meets the target for two quadrants, while the remaining two are under the target rate.

In this case, map the tile in the current mip and map half the tiles in the next.

Next a compute shader writes a 2D table that translates between UV and mip levels, which is stored in our table of contents or TOC buffer.

Before sampling a texel from the STSM, the TOC buffer is read by the lighting pass by indexing the table to get the mip.

The mip is then used as an explicit LOD parameter when you sample the shadow map.

The next step is to render the sparse shadow map.

First, cull the shadows by using the TOC buffer, then encode the indirect draw commands to render to the sparse depth texture.

Render the surface by filling each mip map individually with an indirect command buffer.

ICBs are the perfect fit for this task since compute passes can, in parallel, cull and sort each shadow geometry mesh against the resident areas.

The compute shader encodes draw commands into individual ICBs by testing meshes against the bounding volumes of the tiles.

For large objects that stretch across the shadow map, the shader tests the object against the relevant tiles of each mip; and it encodes a draw command to the mip’s ICB if at least one tile overlaps.

If a mip doesn’t have any tiles that overlap, don’t emit a draw command for the object to that mip’s ICB.

The optimized set of draw commands for each object is encoded by the shadow cull compute pass by running all intersection tests in parallel compute threads.

Since the red mesh was closest to the camera, it had the biggest impact at 11 tiles on our shadow map.

Compare that to the orange mesh, the furthest from the camera, which had the smallest impact at three tiles.

Once the indirect draw command is complete, the STSM is ready to be sampled in the lighting pass.

This table compares STSM to shadow mapping and cascaded shadow mapping.

The STSM sample rate and effective quality is about the same as the single shadow map but uses far less memory.

In fact, it uses less than a percent of the memory footprint for the same resolution.

I hope this deep dive gives you some ideas on how to create efficient, high-quality shadows in your Metal apps with sparse tiled shadow maps.

Lastly, I’m excited to introduce the new additions to Metal compute in A15: SIMD shuffle and fill.

In modern image processing, convolution kernels are applied for filters like edge detection, blur, and sharpen.

Here’s a convolution applied to an image from our modern rendering demo.

Workloads like this one are typically limited by texture sampling or reading from thread group memory, leaving the GPU’s math units underutilized.

Apple silicon provides a rich set of SIMD instructions that can be used in Metal compute shaders to help optimize these workloads.

As threads of a SIMD group run concurrently in lockstep, SIMD group functions exploit this lockstep execution to share data between its threads.

For more information about existing SIMD functions, please see the talks for A13 Bionic and A14 Bionic, where they were introduced.

Now let’s discuss the new SIMD instructions available.

New to A15 Bionic is the support for SIMD and quad shuffle and fill.

These instructions are designed to improve sliding-window image operations, like the edge detection convolution previously shown.

These functions optimize compute workloads by sharing data across neighboring threads in a given SIMD group without using memory.

First let’s look at the behavior for quad shuffle down, first supported in A13.

The data buffer has contents A, B, C, and D, which is loaded into the registers of a quad’s threads: zero, one, two, and three.

When applying a shuffle down on a shift of one, the register data in thread zero, one, and two get the data from threads one, two, and three.

The computed quad lane ID doesn’t wrap around, so thread three in the result has the unshifted value D.

Instead, if the quad shuffle and fill down instruction is used, a fill buffer is provided to update thread three in the result.

Now the fill data of thread zero is shuffled to the output data in thread three.

Similarly, for a quad shuffle up and fill with shift of two, we see A and B are shuffled to threads two and three, and the data from the fill buffer is shuffled into the output’s lower lanes.

On Apple silicon, SIMD groups are composed of 32 threads.

And the same shuffle and fill behavior can be applied across our SIMD lane, where the lower delta lanes are filled with the upper lanes of the fill data.

The new SIMD and quad shuffle and fill instructions also have an optional modulo argument.

This allows for user-specified vector widths.

For a module of eight, the SIMD group is effectively split into four vectors.

The data buffer values are first shuffled up two indices, and the fill data is shuffled into each set of eight threads.

Let’s use these new instructions in an example where the kernel used for edge detection of our modern rendering image can be optimized using SIMD shuffle and fill.

In order to generate the final result, a 5-by-5 convolution kernel is applied to the input image.

The output image is split into a set of SIMD groups, where each SIMD group is a 4-by-8 chunk, each thread writing to a single output.

Let’s focus on generating the output for a single SIMD group.

For the 5-by-5 convolution, each thread must read 5 by 5 pixels from the input.

For each 4-by-8 SIMD group, an 8-by-12 region must be sampled in the compute shader.

The naive implementation of this convolution would require 25 samples for each output thread.

This results in great overlap across our SIMD group.

This can be optimized by the shuffle and fill instructions, eliminating duplicate samples within a SIMD group and sharing data through register shuffles.

Let’s see why we would need to read from these locations.

First A is loaded, which is a 4-by-8 window where each thread in the SIMD group samples a single pixel; then B for the top-right window, C for the bottom left, and finally D for the bottom right.

The red outline rectangle indicates the destination region of our SIMD group for the output image.

Through four samples per thread, the 8-by-12 input region was loaded across our SIMD group without any overlapping samples.

Refocusing on the 5-by-5 region for thread zero, these samples can be represented as a 5-by-5 neighborhood.

Quad shuffle and fill down can be used to access the first row of neighbors, first shuffling down A’s data and filling with B’s data.

Then the 32 wide vectors from the previous row are shuffled down for the next row.

As the data is shuffled down a full row, a fill vector is needed to shuffle the samples from C and D in the upper lanes of the 32-wide vector.

Using the same approach, SIMD and quad shuffle down can be used to get the remaining samples in the 5-by-5 region.

Once the full neighborhood has been shuffled, these samples are used as the input to the edge detection algorithm.

While the naive implementation sampled the full neighborhood for each thread, with the new SIMD and quads shuffle and fill instructions, the number of samples for each SIMD group is reduced by 84 percent, eliminating overlapping samples across neighboring threads.

Using the new SIMD operations, many common image processing and machine learning algorithms can apply the same approach to optimize shared data across SIMD groups.

And that’s it for SIMD shuffle and fill.

Let’s recap what we’ve learned.

Lossy compression is an easy-to-enable feature that saves memory footprint and bandwidth while maintaining quality for your textures.

Sparse depth and stencil textures help you create efficient, high-quality shadow maps.

New compute instructions SIMD shuffle and fill reduce overlap and improve sliding-window image operations for machine learning and image processing applications.

And finally, all Metal apps get an additional boost in performance, responsiveness, and power savings from the overall architectural improvements of the A15 Bionic GPU.

Thank you for watching.

Discover advances in Metal for A15 Bionic – Tech Talks – Videos – Apple Developer