Samsung Austin R&D Center reveals details of its unreleased Exynos M6 CPU microarchitecture
We know that the custom CPU core project at Samsung’s Austin Research & Development Center (SARC) came to an end in October 2019. For a project that was promoted with such fanfare with the launch of the Exynos M1-featuring Exynos 8890 in 2016, it was a sad end. Why did SARC fold up the project? The Exynos M5 custom core, featured in the Exynos 990 SoC, is the last Samsung-designed fully custom core for the foreseeable future, and in hindsight, it’s easy to see why Samsung gave up on custom cores, as they simply weren’t competitive enough. It is now known that the Exynos M5 core has a 100% power efficiency deficit against ARM’s Cortex-A77, which says a lot. Yet, it didn’t have to turn out that way. The Exynos M1 and Exynos M2 designs showed some promise, and the custom CPU core project was, at that time, viewed important for the sake of competition in the mobile CPU space. The Exynos M3 was a big downturn despite the major IPC increase, and the Exynos M4 and Exynos M5 failed to keep up with ARM’s stock CPU IP. What were the microarchitectural changes in the next custom core, the cancelled Exynos M6?
Up until now, the answer to that question was unknown. Now, though, the SARC CPU development team has presented a paper titled “Evolution of the Samsung Exynos CPU Architecture” (which we came to know via AnandTech) at the International Symposium for Computer Architecture (ISCA), which is an IEEE conference. It reveals a lot of details about previous Exynos M series CPUs as well as the architecture of the cancelled Exynos M6.
The paper presented by SARC’s CPU development team details the team’s efforts over its eight-year existence, and also reveals key details of the custom ARM cores ranging from the Exynos M1 (Mongoose) to the current-generation Exynos M5 (Lion), and even the unreleased Exynos M6 CPU, that would, prior to cancellation, have been expected to feature in the Exynos 990’s 2021 SoC successor.
Samsung’s SARC CPU team was established in 2011 to develop custom CPU cores, which were then featured in Samsung Systems LSI’s Exynos SoCs. The first Exynos SoC to use a custom core was the Exynos 8890, which was featured in 2016’s Samsung Galaxy S7. The custom cores remained a part of Exynos SoCs until the Exynos 990 with the Exynos M5 cores, which featured in the Exynos-powered Samsung Galaxy S20 variants. (The upcoming Exynos 992, likely to feature in the Galaxy Note 20, is expected to feature ARM’s Cortex-A78 and not the Exynos M5.) However, SARC had completed the Exynos M6 architecture before the CPU team had gotten news of it being disbanded in October 2019, with the disbandment being made effective in December.
The ISCA paper features an overview table of the microarchitectural differences between Samsung’s custom CPU cores from the Exynos M1 to the Exynos M6. Some of the well-known characteristics of the design had been disclosed by the company in its initial M1 CPU architecture deep dive at the HotChips 2016 event. At HotChips 2018, Samsung gave a deep dive on the Exynos M3. The architecture of the Exynos M4 and the Exynos M5 cores has also been detailed, as well as that of the M6.
AnandTech notes that the one key characteristic of Samsung’s designs over the years was that it was based off the same blueprint RTL that was started off with the Exynos M1 Mongoose core. Samsung continued to make improvements in the functional blocks of the cores over the years. The Exynos M3 represented a change from the first iterations as it substantially widened the core in several respects, going from a 4-wide design to a 6-wide mid-core. (The Apple A11, A12, and A13, on the other hand, have a 7-wide decode width, while the Cortex-A76, A77, and A78 have a 4-wide width. The Cortex-X1 increases the decode width to 5-wide.)
The report also makes some disclosures that weren’t public before regarding the Exynos M5 and the M6. For the Exynos M5, Samsung made bigger changes to the cache hierarchy of the cores, replacing private L2 caches with a new bigger shared cache as well as disclosing a change in the L3 structure from a 3-bank design to a 2-bank design with less latency.
The cancelled M6 core would have been a bigger jump in terms of the microarchitecture. SARC had made large improvements such as doubling the L1 instruction and data caches from 64KB to 128KB – AnandTech notes that this is a design choice that has only been implemented by Apple’s A-series cores so far, starting with the Apple A12.
The L2 was doubled in its bandwidth capabilities up to 64B/cycle, while the L3 would have seen an increase from 3MB to 4MB. The Exynos M6 would have been an 8-wide decode core. As noted by AnandTech, this would have been the widest commercial microarchitecture currently known in terms of decode. However, even though the core was much wider, the integer execution units didn’t see a lot of change. One complex pipeline added a second integer division capability, while the load/store pipelines remained the same as the M5 with one load unit, one store unit, and one load/store unit. The floating-point/SIMD pipelines would have seen an additional fourth unit with FMAC capabilities. The L1 DTLB was increased from 48 pages to 128 pages, and the main TLB was doubled from 4K pages to 8K pages (32MB coverage).
The Exynos M6 would have represented another significant change from its predecessors by increasing the out-of-order window of the core from the first time since the M3. There would have been larger integer and floating-point physical register files, and the ROB (Reorder Buffer) would have increased from 228 to 256. AnandTech notes that one important weakness of the custom Exynos cores is still present on the M5 and would have been present on the M6 as well. It would be its deeper pipeline stages that would result in an expensive 16-cycle mispredict penalty, which was higher than ARM’s CPU cores that have 11-cycle mispredict penalty. The SARC paper goes into even more depth into the branch predictor design, showcasing the CPU core’s Scaled Hashed Perceptron based design. This design would have improved continuously over the years and implementations, improving the branch accuracy and reducing the mis-predicts per kilo-instructions (MPKI) continuously. SARC presents a table that shows the amount of storage structures that the branch predictor takes up within the front-end. The core’s prefetching technologies were also detailed in the paper, covering the introduction of a µOP cache in the M5, as well as the team’s efforts into hardening the core against security vulnerabilities such as Spectre.
Efforts to improve memory latency in the custom Exynos cores was also detailed by SARC in the paper. In the Exynos M4, the SARC team included a load-load cascade mechanism that reduced the effective L1 cycle latency from four cycles to three on subsequent loads. The M4 core also introduced a path bypass with a new interface from the CPU cores directly to the memory controllers, which avoided traffic through the interconnect. According to AnandTech, this explained some of the bigger latency improvements the publication was able to measure with the Exynos 9820. The Exynos M5 introduced a speculative cache lookup bypass, which issued a request to both the interconnect and the cache tags simultaneously. This would possibly save on latency in case of a cache miss as the memory request is underway. The average load latency was also continuously improved over the generations from 14.9 cycles on the M1 to 8.3 cycles on the M6.
While the above microarchitectural characteristics are quite technical, CPU enthusiasts will be familiar with the term Instructions Per Clock (IPC), which means per-MHz performance in single-thread CPU performance (it’s the primary major factor determining single-thread CPU performance, with the other factor being the clock speed of the core). Integer IPC and floating-point IPC are both determinants of IPC. The SARC team managed to get an average of 20% annual improvements from the M1 to the M6. The M3, in particular, represented a big percentage improvement in IPC, although it was let down by other factors. The Exynos M5 represented a 15-17% improvement in IPC, while the IPC improvement for the unreleased Exynos M6 has been disclosed to have an average of 2.71 versus 1.06 for the M1, representing a 20% improvement over the M5.
Brian Grayson, the paper’s presenter, did answer questions about the program’s cancellation during the Q&A session. He said the team had always been on-target and on-schedule with performance and efficiency improvements with each generation. (Does that mean that the targets weren’t high enough in the first place?). The team’s biggest difficulty, on the other hand, was in terms of being extremely careful with future design changes as the team didn’t have the resources to start from scratch or to completely rewrite a block. With hindsight, the team would have done different choices in the past with some of the design directions. In stark contrast, ARM has multiple CPU teams working in different locations that actually compete with each other. This allows for “ground-up re-designs” such as the Cortex-A76. The Cortex-A77 and the Cortex-A78 are the direct successors of the A76.
The SARC team had ideas for improvements for upcoming cores such as the hypothetical Exynos M7. However, it was supposedly a very high up person at Samsung who decided to cancel the custom core program. As AnandTech notes, the custom cores weren’t competitive in terms of power efficiency, performance, and area usage (PPA) compared to ARM’s CPUs of any particular generation. Last month, ARM announced the Cortex-X Custom program featuring the new Cortex-X1, a next-generation core intended for 2021 mobile devices. It has a design philosophy of breaking the Cortex-A PPA envelope and going for absolute performance instead. The Exynos M6, therefore, would have had a tough time competing with it. Even so, it seems Samsung won’t adapt the Cortex-X1 and will go only with the Cortex-A78 + Cortex-A55 combo in the Exynos 992 – it may be adopted in next year’s Galaxy S flagship, though.
The SARC team still currently designs custom interconnects and memory controllers for Samsung Systems LSI. It was also working on custom GPU architectures, but Samsung Systems LSI signed a deal with AMD to use AMD’s next-generation (Next graphics architecture) RDNA GPU architecture in future Exynos GPUs, starting in 2021.
Overall, the custom CPU core project was an illuminating lesson for mobile chip vendors on what can go wrong. The SARC CPU team had high ambitions of competing with Apple, which is the undisputed leader in the mobile CPU space. Unfortunately, it failed to compete with ARM, never mind Apple. The issues could have been solved, but year after year, SARC’s efforts were a step or two behind, and it reflected adversely in shipping products such as the Exynos 9810 variants of the Samsung Galaxy S9. Now, all major Android mobile chip vendors will use ARM’s stock CPU IP from 2021, and this list includes Qualcomm, Samsung, MediaTek, and HiSilicon. The fight will be taken to Apple with cores such as the Cortex-X1, not custom ARM cores designed from scratch.