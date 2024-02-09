Key Takeaways RDNA 4, AMD's next-gen graphics architecture, is expected to bring technology improvements and performance boost.

RDNA 4 is expected to be AMD's next generation graphics architecture, and we expect to see technology improvements along with a boost in performance. We're now beginning to get a clearer picture of what RDNA 4 might bring to the table when it launches, which is expected to be sometime this year. Typical of AMD, we expect higher-end models to arrive first, with more affordable models coming later.

The most concrete sign of what's to come has appeared thanks to AMD supporting LLVM. For those unfamiliar, LLVM is a set of compiler and toolchain technologies, and AMD is ensuring that it will be compatible with upcoming GPUs at launch. An analysis from Chips and Cheese went through what those changes are, as LLVM, being a compiler, needs to be able to understand the ISA to generate code. These represent architectural improvements to the RDNA graphics architecture, and don't account for the raw performance that we'll see in GPUs that are released.

As a primer, six months ago it was rumored that AMD had canceled its flagship GPUs that were expected to usher in RDNA 4. There were also rumors that the company would switch back to a monolithic die rather than using a multi-chip module (MCM), which the company debuted with RDNA 3. That's not inherently a bad thing, as Nvidia uses monolithic dies for its Ada Lovelace architecture. Still, it's a step back for AMD, a company that has been focusing significant resources on chiplet designs and advanced packaging technologies for both GPUs and CPUs.

Major AI changes

AMD wants to catch up to Nvidia

The biggest changes (in my opinion) highlighted by Chips and Cheese are the AI-related changes to the ISA. Basically, GPUs are incredibly powerful for AI due to their mathematical capabilities and support for parallel operations thanks to the hundreds of cores that they have onboard. Matrices are often at the heart of neural networks, and they're an efficient way to store, represent, and manipulate data. GPUs have specialized matrix math operations, and over time, we've seen that lower precision data types are accurate enough to be acceptable.

RDNA 3 introduced WMMA (Wave Matrix Multiply Accumulate) instructions, and RDNA 4 will improve the efficiency of these while adding instructions that support 8-bit precision, as opposed to just 16-bit precision. 8-bit precision means an increased computational efficiency and reduced memory usage, which will help speed up AI performance in general. As well, less bandwidth is required to transfer data in an 8-bit format versus a 16-bit format, which will again improve AI performance.

With that, though, there are also improvements to handling sparse matrices. Sparse matrices are matrices with a lot of zero elements, as any multiplication by zero will sum to zero. Storage and bandwidth can be reduced because of the compressability of these matrices too, which incentivizes handling sparse matrices differently. With RDNA 4, there are new SWMMAC (Sparse Wave Matrix Multiply Accumulate) instructions designed to handle sparse matrices more efficiently. The below image, shared by Nvidia, demonstrates how handling sparse matrices differently can help.

Sadly, we can't really infer how this will improve performance. All that we can say is that it will, but by how much isn't exactly clear. Chips and Cheese assert that it could increase performance by as much as 2x from sparsity, but that's also a guess. With handling sparsity this way, though, there are potential big improvements to memory usage, bandwidth efficiency, computation, and even energy efficiency.

Improvements to prefetching and coherency

Increased prefetch distance and software-directed prefetching

Software prefetching in GPUs is a technique used to improve the efficiency and performance of memory access by anticipating the data and instructions that will be needed soon and loading them into the cache before they are actually requested by the processing units. This proactive approach aims to reduce memory access latency--a common bottleneck in computing performance--by ensuring that necessary data is readily available in cache, closer to where computation occurs, thereby minimizing the time the GPU spends waiting for data to be fetched from main memory.

As spotted by Chips and Cheese, RDNA 4 seems to be increasing the initial prefetch distance from 64 x 128 bytes to 256 x 128 bytes. This covers 32 KB of code as opposed to the original 8 KB. RDNA 4 also appears to add instructions that may allow software to direct prefetching. Prefetching is something commonly done on CPUs and not GPUs, as it can work out to be computationally expensive on the GPU.

More flexible coherency handling

Additionally, coherency handling will likely become more flexible. Coherency handling in GPUs refers to the management of data consistency across various caches within the GPU and between the GPU and CPU. In a computing environment, especially in systems with multiple processors or cores (like CPUs and GPUs), ensuring that all components have a consistent view of data is important. This consistency is crucial when different parts of the system may read from or write to the same memory locations.

In RDNA 3, memory access instructions had a Global Coherency bit that could be set to allow it to be globally coherent. If this bit was set in a load, it would miss L0 and L1 caches and go straight to L2 cache. When the GLC bit is set for a data load instruction, it forces the instruction to bypass the local L0 and L1 caches, which are private to a Compute Unit (CU) or Shader Engine (SE), and instead go directly to the L2 cache. The L2 cache is shared across multiple CUs or SEs, making it a central point for ensuring coherency.

This approach effectively ensures that a load operation retrieves the most recent version of the data, reflecting any writes that may have been made by threads running on other CUs or SEs. By using the L2 cache as a coherency point, the GPU can maintain a consistent view of memory across its many cores, which is crucial for parallel processing tasks where multiple cores work on different parts of a problem but need to share results or data updates.

A change with RDNA 4 could mean that instead of going through the GPU-wide L2 cache, data could be shared across threads through the L1 cache. Dependent threads would need to be run on the same SE, though in theory, there are still performance gains to be had by an increase in usage of L1 cache.

RDNA 4 is expected to arrive sometime this year

We hope, anyway

With AMD starting to publish patches that ensure compatibility in LLVM with RDNA 4, and with rumors heating up over the last few months, it seems inevitable that the first RDNA 4 GPUs will arrive this year. While we're not sure if they'll be some of the best GPUs when they launch, it's clear that at least some of the architectural improvements are pretty big. These are significant AI improvements, and if leveraged, may help AMD close the gap even a little bit with Nvidia.

If you're looking to buy a GPU for AI though, it still stands that even with RDNA 4, it's extremely likely that you won't be buying an AMD card for AI still. It may help with certain workloads, but Nvidia will almost certainly retain its crown. We'll be ready and waiting though; competition is always good for consumers.