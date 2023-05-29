Arm is the company that designs pretty much all of the CPU cores that end up being used in your Android smartphone, and every year it announces new iterations that will later find their way into chipsets like that year's flagship Snapdragon or the next flagship MediaTek Dimensity. This year, it's releasing a flagship Cortex-X4 core, a Cortex-A720 performance core, and a Cortex-A520 efficiency core. These cores form the basis of the company's new Arm v9.2 compatible designs and the company's Total Compute Solution for 2023, or TCS23. On top of that, we're also seeing a new DynamIQ Shared Unit and an updated Immortalis-G720 GPU. Bigger still is a complete transition towards 64-bit computing, with none of these cores supporting 32-bit.

All three of the new cores are microarchitectural successors to last year's and are primarily focused on introducing IPC and efficiency gains.

64-bit only: "Mission accomplished"

One of the biggest changes in this year's Total Compute Solution from Arm is the transition to 64-bit only. While last year's A510R1 supported the 32-bit AArch32 execution mode, as did the A710 that launched with TCS22 last year, this year, Arm's cores are AArch64 only. The clock has been ticking for 32-bit applications on Android, particularly since Google itself has mandated that all apps updated since 2019 are uploaded as 64-bit binaries.

As Arm puts it, the 64-bit transition is considered "mission accomplished." The reason for this is that the Chinese app market is what held back the rest of the industry in the transition, but the vast majority of apps on Chinese app stores are now 64-bit compliant, too.

The reason for the delay was the lack of a homogenized application ecosystem, meaning that different app stores required different standards of developers. As Arm has worked with various app stores in China, though, along with repeated warnings that a shift would happen, those app stores have been encouraging developers to switch too.

The time has now seemingly come for that transition to happen in its entirety, and there's going to be a few more months until we see these Arm cores in new chipsets, anyway.

Arm Cortex-X4: Even more performance and better efficiency

Arm's X series of cores diverged from its A series a number of years ago, with the philosophy being that it's a powerful core that is allowed to guzzle a bit more power when it needs it. Typically, chipset makers will only include one or two of these at maximum, as they're power hungry, even despite the capabilities that they have, too.

As you can see from the above graph, the Cortex-X4 is the most powerful Arm core yet, but those computation capabilities come at the cost of power consumption. The Cortex-X4 is similar to last year's X3, and as Arm puts it, can even be run at the same frequencies as last year's core and use up to 40% less power. It's less than 10% larger in physical size and the most efficient Cortex-X core ever built.

As for where those IPC improvements come from, there are a number of front-end and back-end improvements to the X4. In those front-end improvements, a large amount of work was put into re-writing and improving on branch predictions, as incorrect branch predictions are costly, performance-wise. Arm also promises that an L2 cache size of 2MB yields higher performance, not so much in benchmarks but in real-world usage.

The new Cortex-X4 core increases the number of Arithmetic Logic Units (ALUs) from 6 to 8, adds an additional branch unit (for a total of 3), adds an extra Multiply-Accumulatator unit, and pipelines floating point and square root operations.

As for the back end, there are a number of improvements, too. Load-store address generation has gone from three instructions to four per cycle, as the load-store pipe was taken and split up. There is also a doubled translation lookaside buffer in L1, along with bank conflict improvements.

All of this comes together to bring some impressive performance uplift in Arm's Cortex-X4. All in all, you can expect an average of a 15% performance improvement with the Cortex-X4. In the power and performance curve shared by Arm, the X4 extends ahead of the X3 in both performance and power consumption. In other words, that 15% performance improvement comes at a pretty significant power draw. It's worth mentioning, too, though, that it's not quite an apples-to-apples comparison; the Cortex-X3 came with 1MB of L2 cache last year, which means that should a manufacturer stick to the same L2 cache size this year, there may not necessarily be a 15% performance uplift.

One thing is for sure, though, and it's that if you're running the X4 at maximum speed, it will likely be a major power guzzler. We may see some OEMs this year continue to do what they did last year and throttle many of this year's chipsets out of the box. For example, OnePlus and Oppo both do this, and with those power efficiency gains when running at the same performance points as the X3, it's likely that there will be benefits for those companies to continue doing so. We may not see that 15% performance uplift across the board, but we may see further efficiency improvements instead for next year's chipsets.

Arm Cortex-A720: Balancing performance and power consumption

While Arm's X series of cores are typically let run a bit wild, the A series of cores typically aim to balance power consumption against performance. With the Cortex-A720, Arm promises a 20% more efficient core, with increased performance at the same power as the A715 from last year.

As for where this year's A720 improvements come from, most of them are in the front end. Pipelines have been shortened with one cycle removed from the branch mispredict engine, with this single cycle drop being said to account for a 1% increase in benchmarks. Benchmarks typically result in the fewest branch mispredicts, meaning that this will likely improve overall real-world performance by a more significant (but largely immeasurable) amount.

In the out-of-order core, we see a number of structural improvements that help to improve performance without impacting the area taken up by the core or its efficiency. For starters, just like in the X4, floating point divides and square root operations are now pipelined. There are also faster transfers from floating point, NEON, and SVE2 numbers to integers and other overall improvements to speed up processing.

Arm shared the above graph to illustrate how the A720 compares to last year's A715 in performance and efficiency, where an ISO process and ISO frequency is used in SPECint_base2006. Cache sizes remain the same, too, so it's very much an apples-to-apples comparison.

In terms of power consumption, the A720 remains much in line with last year's model, though it ekes out a little bit more performance at the same power levels. With the A720, like with the X4, Arm appears to be focusing more on highlighting how it's getting better performance out of last year's power constraints rather than continuously increasing the power that these cores are capable of.

Arm Cortex A520: Doubling down on efficiency

Of course, when it comes to Arm's cores, it's not all about performance. With the X series putting everything into raw computational power and the A7xx balancing computational needs and power draw, the A5xx series focuses purely on efficient processing. It's the lowest power per area Arm v9.2 core and builds on that same merged-core architecture that we saw introduced with the A510.

What this merged core architecture means is that some resources can be shared between two cores, where two cores can be grouped into a "complex." The L2 cache, the L2 translation lookaside buffer, and vector datapaths are shared within this complex. To be clear, this does not mean it has to be bundled into two cores, and a one-core complex can be assembled for peak performance. In fact, one of Arm's TCS2023 core layouts that they showed us involved a single X4 core, five A720 cores, and three A520 cores, meaning that at least one A520 core is in isolation.

The A520 is an efficiency-first design, and like the other cores, Arm focused largely on improving that efficiency at the same power points as the last generation. This includes improving branch predictions while also removing or scaling down some performance features. This performance was recovered through greater efficiency as a result. Interestingly as well, Arm has removed the third ALU that was in the A510, saving power in issuing logic and forwarding results.

In real-world results, it seems that the A520 isn't as large of a jump from its predecessors as the A720 and the X4 are. Much of its capabilities at lower power intervals overlaps with the A510 from the above graph, and it's only at the upper echelons of performance do we see efficiency gains. The divergence in performance and power between the two cores is promising, but it's unclear if we'll see any actual real-world benefits when comparing the A520 to the A510. After all, it's hard to actually properly measure performance and efficiency differences between the two in the real world.

DSU-120: Up to 14 cores of computational goodness

The DynamIQ Shared Unit, or DSU, is a integrates one or more cores with an L3 memory system, control logic, and external interfaces in order to form a multicore cluster. It's essentially Arm's fabric that allows all of these cores to communicate with each other and share resources, and as such, it's a fairly important piece of the puzzle for any chipset maker looking to build a chip with Arm's core designs.

Building on DSU-110, Arm has made a number of improvements to DSU-120 that will serve to benefit the whole chip that it's included on. For starters, there are now up to 14 cores per cluster (up from 12) and support for up to 32MB of L3 cache. It also greatly improves efficiency in a number of key areas, including in the event of cache misses, while also reducing power leakage.

In a way, Arm's DSU is the backbone of TCS23, as it forms the basis of how each of these cores interacts with each other and share data. Any improvements here will benefit the entire cluster, but it seems most of the changes are related to power consumption and efficiency.

Efficiency is the new goal

The industry has seemingly been shifting for a while, but the main first impression I get from these cores is that efficiency is now the name of the game. While we were told about how much faster the X4 core is and how it's the company's fastest core ever, they were very quick to note the efficiency improvements of running it at last year's peak performance instead.

Across the board, every performance gain was underpinned by just how much more efficient that component was too, and more or less, all the changes of the DSU were in efficiency and power leakage. Performance is important, but it really feels like the industry as a whole is trying to make current computational levels more efficient rather than going for massive performance increases year-on-year.

We expect that these cores will formulate the basis of the MediaTek Dimensity 9400 and the Qualcomm Snapdragon 8 Gen 3, but in what formation remains to be seen. As previously mentioned, Arm talked about using a 1+5+3 core layout in its own internal testing, but that doesn't mean it's what partners like MediaTek and Qualcomm are looking to do themselves.