ARM announces the Cortex-A78 CPU and Mali-G78 GPU designs for high-end Android smartphones
As part of its TechDay 2020, ARM has made three major announcements. The headline major announcement is the Cortex-X Custom program (CXC), containing the new Cortex-X1 CPU core. The Cortex-X1 brings to bring higher peak performance than any Cortex-A series CPU, while breaking the envelope of the Cortex-A series’ PPA. The other two announcements that ARM made were a lot more routine. The Cortex-A78 CPU and the Mali-G78 CPU are now official, and they act as the successors of the Cortex-A77 CPU and the Mali-G77 CPU respectively. Let’s cover these announcements one-by-one:
With the Cortex-A78, ARM’s key focus was on efficiency demands, such as demands for longer battery life, new mobile form factors, and shrinking SoC areas. Sustained performance is the keyword here for the Cortex-A78, while the Cortex-X1 shoots for the stars with its goal of achieving maximum short-term peak performance.
ARM says the Cortex-78 represents the “very best” of its drive for high-end performance at best-in-class efficiency. These are not just empty words either. For the last couple of years, the Cortex-A76 and the Cortex-A77 have shown best-in-class energy efficiency and best-in-class PPA (performance, power, and area). They did not have the design required to compete with Apple’s A-series chips, but because of lower power generated, their energy efficiency was at worst the same as Apple and at best even higher than Apple.
The A78’s performance improvements cover the use cases of productivity, communication, security and camera-based tasks, advanced gaming, XR, and ML-based experiences.
In sustained performance, the Cortex-A78 brings double-digit improvements. It provides a 20% improvement in sustained performance over its predecessor, the Cortex-A77, in the same mobile thermal power envelope. AnandTech went through the numbers and explained that the 20% figure is a combination of 7% higher IPC over the A77, while the remaining 13% performance gains are credited to the 5nm process, on which the next-generation SoCs will all be fabricated. ARM notes the importance of sustained performance by saying that mobile devices have a limited capacity to dissipate power, and sustained performance avoids power throttling for applications demanding a lot of power. This, in turn, improves the UX by avoiding lag or frame drops.
The push on power efficiency translates into higher energy efficiency, as the two are related, but different concepts. According to ARM, at high-performance points, such as those that are the peak for current mobile devices, the Cortex-A78 offers 50% energy savings over 2019 devices at the same performance as the Cortex-A77. This is impressive and it makes the A78 the most energy-efficient Cortex-A CPU ARM has ever designed.
ARM’s focus on sustained performance will benefit the next wave of mobile innovation such as new form factors (foldable phones) as well as improved “digital immersion” through 5G. The reality check is that this is not the case for the current generation, and it won’t matter much even in the next generation.
One use case that will be improved by the Cortex-A78 is AAA mobile gaming, when combined with ARM’s own new Mali-G78 GPU. The combination of the two aims to bring high-fidelity gaming experiences to mobile. Their greater performance will, when coupled with 5G’s fast speed and high bandwidth, enable premium gaming on mobile. The A78’s efficiency has a benefit here, as it will provide longer battery life for extended gaming. ARM says it’s also working with the ecosystem to further enhance performance and build richer gaming experiences, and gives an example of its work with Unity to bring Burst Compiler to Android.
Machine learning (ML) performance is another priority for ARM. The CPU is the first-choice processor for ML computing on mobile, although these days high-end SoCs come with separate neural processing units (NPUs). ARM’s CPUs support the most popular real-world ML applications and use cases on smartphones, such as social media filters, dictation, security, and security. The Cortex-A78 uses 8% less power on average for ML-based tasks compared to the A77, which leads to 10% official efficiency improvements.
ARM Cortex-A78 – Architecture
The ARM Cortex-A78 has the same architecture as the previous generation (it’s still an ARM v8.2 core). ARM did, however, add microarchitectural features that aim to push performance higher in an area and power-efficient manner. ARM is saving area and power while maintaining the needed performance levels. Again, ARM’s focus on the Cortex-A series remains on area and power efficiency rather than peak performance, which is now a job taken up by the Cortex-X program.
The Cortex-A78’s performance improvements are enabled through additional microarchitectural features that optimize width and depth. The instruction decode width remains at 4-wide, same as the A77 and the A76. (The Cortex-X1’s decode width, on the other hand, is 5-wide, while the A13 has a 7-wide decode width.) ARM has added greater branch prediction for bandwidth and accuracy as well as instruction fusion cases. These architectural improvements enable a 7% increase in single-thread performance over the A77.
Efficiency has been maximized through reducing structures that have low performance and area, such as on the L1-I and L1-D caches. ARM has optimized existing structures to consume less power, such as the brand prediction structures. ARM says this leads to 4% less power for performance per mW and 5% less area for performance per mm2 compared to the A77.
The A78 keeps the focus on sustained performance at best-in-class efficiency at the cluster level. A DynamIQ cluster of 4x Cortex-A77 and 4x Cortex-A55 CPUs can be upgraded to 4x A78 cores and 4x A55 cores. This provides 20% sustained performance improvements in 15% less area. Applications that require several high-performance threads in parallel, such as high-fidelity gaming, will benefit due to the sustained performance push.
ARM notes the enhanced area efficiency of the A78 DynamIQ cluster makes it ideal for foldable phones and multiple and larger displays. Another focus is on getting smartphones 5G-ready through performance and energy improvements. 5G supposedly provides “far faster speeds”, “far lower latency”, and “far faster and more ubiquitous connectivity for mobile devices for high-bandwidth applications”. This may be the case a few years from now, but at present, most of these benefits aren’t noticeable for end consumers.
Overall, the Cortex-A78 is a solid product. Next-generation flagship SoCs will incorporate multiple A78 cores to complement the single Cortex-X1 core that has higher power and area requirements, and some value-oriented SoCs will even opt to skip out the Cortex-X1 entirely. For the mid-range SoC market, the A78 will be the CPU core of choice for 2021 SoCs, and its focus on sustained performance is welcome.
ARM’s Mali series of GPUs hasn’t been nearly as successful as its Cortex series of CPUs, to put it mildly. The Mali GPUs have been consistently outperformed both in terms of performance and power efficiency by Apple’s custom GPUs and Qualcomm’s custom Adreno GPUs, year after year. Last year’s launch of the new Valhall architecture and the Mali-G77 GPU did nothing to change that, sadly. SoCs featuring the Mali-G77 included the Exynos 990 and the MediaTek Dimensity 1000L respectively. Both of them, unfortunately, appeared to have weak implementations that meant their GPU performance could not compete with Qualcomm’s Adreno 650 GPU, never mind Apple’s class-leading GPUs in the Apple A12 and A13. Mali has lagged behind for years, and its improvements haven’t been enough to change the status quo in the mobile GPU space.
Nevertheless, ARM is nothing if not optimistic. It notes that its partners have shipped over one billion Mali GPUs annually, making Mali the number one shipped GPU in the world. This number will only increase, supposedly, as many more different types of devices enable graphic-intensive use cases such as advanced mobile gaming and XR (VR and AR). According to ARM, this makes Mali the most widely used GPU for mobile development across the ecosystem.
ARM notes that in 2019, it announced its first GPU based on the Valhall architecture – the Mali-G77. In 2020, the G77 is getting succeeded by the Mali-G78, which is also based on the Valhall architecture. While ARM says it’s the most performant GPU for premium mobile devices to date, the numbers don’t back it up despite what ARM ironically says about it being a fact supported by the numbers. The G78 brings a 25% improvement in performance over the G77, which is meager, to say the least. The gap in peak GPU performance between the G77 and the Apple A13’s GPU was significant, which means the G78 won’t be able to catch up to the A13, never mind the upcoming Apple A14’s GPU. Qualcomm will also continue to remain one step ahead because of its own incremental performance improvements.
Game-changing graphics and all-day gaming on mobile are already possible on other GPUs, so ARM’s marketing here rings a little hollow.
The Mali-G78 is built with developers and the end-user in mind, according to ARM. It enables high-quality mobile gaming experiences with console games now available on mobile. The G78 brings longer battery life to premium mobile devices. It also brings a further ML performance boost for more complex gaming, video, camera, security ML features on mobile devices.
ARM is bullish about the prospect of mobile gaming. Mobile gaming accounted for more than 46% of the global games market in 2019, reaching $68.2 billion in revenue. It’s also set to continue growing over the next few years as it will outpace both PC and console gaming. More premium gaming titles are coming to mobile and users expect a similar experience on mobile compared to consoles.
To make these experiences possible, the Mali-G78 comes with the necessary performance boost. It has a 15% performance density improvement for gaming content compared to the G77. For the same amount of area as the previous generation, the G78 will provide more performance. This boost is made possible by four key features:
- Support for up to 24 cores
- Asynchronous Top Level
- Tiler improvements
- Improved fragment dependency tracking
While the G77’s maximum core count was 16, ARM has increased the maximum core count on the G78 to a maximum of 24 cores. Of course, just because there is a maximum doesn’t mean mobile chip vendors will actually incorporate 24 cores. The widest core variant of the G77 we have seen so far is the Mali-G77MP11 on the Exynos 990, while the Dimensity 1000 has a Mali-G77MC9.
ARM believes Asynchronous Top Level to be a game-changing feature for GPU performance. This is said to squeeze as much performance out of mobile games as possible, ensuring maximum performance.
Tiler improvements, on the other hand, add an extra layer of quality to mobile games. Games brought over from PC and console often have extremely complicated assets and sophisticated scenes, which cause performance sticking points and bottlenecks. Tiler improvements reduce the vertex load on the GPU for these complex scenes and assets. This improves performance for complicated console-like gaming content.
ARM has also enhanced the fragment dependency tracking on the G78. This particularly affects mobile games with complex gaming scenes involving smoke, trees, and grass. The results are that ARM has seen up to 17% performance improvements on top mobile games compared to the G77.
The Mali-G78 has 10% better energy efficiency than its predecessor. Again, that won’t be enough to catch up either with Qualcomm or with Apple. ARM’s goals here seem particularly conservative. The Asynchronous Top Level feature plays an important role in energy efficiency, as it enables a reduction in power, thus enabling content to be generated in a sustainable way. Therefore, when a device is outputting content at the desired frame rate, it can clock down to save energy. Increasing the Top Level for this task uses a bit more energy, but the energy-saving from reducing the frequency of the shader cores are far higher. That’s because the shader cores use 90-95% of the GPU’s energy budget.
Better energy efficiency in the G78 is also achieved thanks to Fused multiply-add (FMA). It’s been completely redesigned from the ground up, leading to a 30% energy reduction to the unit. The FMA unit is responsible for most of the calculations that happen inside a GPU, and that’s why it made sense for ARM to target it for energy reductions.
A GPU’s parallel data processing capability makes it suitable for running ML workloads, although ARM does acknowledge that CPU and GPU remain the primary processors for ML. As use cases get more complex, some workloads will be offloaded to the GPU. The main ML use cases for the GPU are linked to security features on the device, different camera, and video modes as well as applications with AR features.
The role of ML on the GPU enables experiences such as face tracking within the photo or video frame, games that use AR features, and more. For these ML-based tasks, the Mali-G78 features a 15% average performance improvement for various ML workloads compared to the G77. The G77 brought a 60% improvement in ML performance over previous generations, so the year-over-year improvement this year is much smaller. Asynchronous Top Level is vital in boosting ML performance as clocking the shader cores helps with the various ML use cases on the GPU.
Then, there is the announcement of the Mali-G68. This is nothing but a narrower variant of the Mali-G78, just as the Mali-G57 was a narrower variant of the Mali-G77. ARM says this is the first sub-premium Mali GPU for 2021 devices. It has all of the G78’s features such as tiler improvements and the new FMA unit in the execution engine but supports up to 6 cores instead of 24. Near-premium performance at a lower cost is the aim of this GPU.
ARM developed this sub-premium GPU tier after listening to feedback from partners who wanted premium features across their portfolio of devices. The G68 has a lower silicon area, as expected, and brings high-performance gaming to a wider audience of developers and consumers.
Finally, ARM mentions its developer partnerships. It makes it easy for developers to optimize their content to run better on Mali GPUs (in theory). One example is the Performance Advisor. Second is ARM’s collaboration with Unity to bring the Burst Compiler. Details on this can be read in the source article.
Mali-G78 – Outlook
The outlook for the Mali-G78 is bleak. It seems as if ARM just isn’t interested in making substantial year-over-year performance improvements in the same mold that Apple is making, in the same mold that Qualcomm made in the past. While Qualcomm’s rate of improvement has also slowed, its baseline is at a higher place than ARM. It looks bad for the Android ecosystem when reviewers state with numerical evidence that the A13’s GPU’s sustained performance is higher than the Snapdragon 865’s peak performance. The performance delta between Apple and Android GPUs is growing, and it’s only growing wider.
The G78, therefore, isn’t a magic solution to solve ARM’s Mali GPU woes and bring them to the top of the performance charts. It will still be ranked below Apple and Qualcomm’s GPUs. It will be the default choice for some SoCs just because it’s ARM’s stock GPU IP, and custom solutions have barriers to entry and cost more as well.
Next year, it’s doubtful whether Samsung Systems LSI will actually end up using the Mali-G78. Samsung has been a high-profile customer of Mali GPUs, but last year, it signed a partnership with AMD to bring the RDNA GPU architecture to its mobile SoCs in 2021. If that roadmap remains on track – and at this point we have no reason to suspect it isn’t on track – then the Exynos 990’s successor will feature an AMD RDNA GPU instead of a Mali GPU. It will, indeed, be a big design loss for ARM. Even other vendors such as MediaTek have more options these days. Imagination Technologies’ new A-series GPU architecture has a design target for higher performance than the G78, and it’s possible that MediaTek switches away from Mali in the future. Qualcomm, of course, has no reason to abandon its Adreno GPU efforts, which still remain best-in-class in terms of performance and efficiency when talking exclusively about the Android smartphone market.
Thus, it’s clear that ARM will need to increase the rate of yearly improvements in Mali GPUs to make a real difference in the mobile GPU market. If it can’t do this, it faces the risk of being made an afterthought in the premium flagship mobile GPU space.
ARM Ethos N78
Finally, ARM has also announced the Ethos N78 neural processing unit (NPU). It’s the successor of the N77 NPU. It delivers greater on-device ML capabilities and up to 25% more performance efficiency. Configurability is also a strength as available configurations range from 1 TOP/s on up to 10 TOP/s. For more details, check out ARM’s blog post. This NPU will probably have limited design wins as Qualcomm, Samsung, HiSilicon, and MediaTek all have their own Neural Processing Units/AI Engines.