Geekbench CEO Fireside Chat pt.2: OEMs Cheating on Benchmarks, Custom Cores, and Honest Manufacturers
Geekbench 4 just released, and it has been making waves in the tech industry. It shows substantial improvements over Geekbench 3, and has seen a highly favourable reception.
The new version of Geekbench has resulted in a considerable shakeup of where devices rank in relation to each other. We’ve seen Qualcomm Snapdragon 820 chips fall, and ARM-based designs rise, leading to questions as to what changes were made to the benchmark to cause this shift.
We had an opportunity to sit down with the CEO of Primate Labs, John Poole, to interview him about the launch of Geekbench 4, and about the mobile ecosystem as a whole. In part 2 we talk about OEMs cheating on benchmarks, custom cores, graphics performance and the input manufacturers have in the design of benchmarks.
Steven: On that note, there’s a bit of a sentiment right now that the whole “real men use custom cores” mentality that Qualcomm is pushing is more marketing than actual fact. A lot of people seem to think that Qualcomm would be better off using ARM’s cores even though they have their own custom ones. That it is more about trying to show “oh, we’re using your own thing, no one else has this”, rather than actual performance improvements.
John: There are a lot of smart people at Qualcomm and the Snapdragon used to be a great chip. The 800 was a great chip. All the chips that came before; [like] the s4, those are great chips. I think what’s happened is basically Apple showed up with a 64-bit, coming back to that, and everybody panicked. I think what happened was road maps that were in place where it’s like “Okay, it’s 2013 the spec’s just been ratified. 2014 is when we’re going to start talking about shipping chips”, and then basically it’s like “Oh crap, we need to do this as soon as possible.” Because I mean you can have the argument whether or not you need 64-bit in a phone, but there were all sorts of other great stuff in ARMv8 that it just made sense, and as much as they’re not competing with Apple, they’re competing with Apple.
Steven: I think that is actually something that gets lost a lot. People don’t realize that it’s more than just the fact that it is 64-bit. More than just the 4GB limit, which android doesn’t have anyway because of LPAE, Large Physical Address Extension. So that wasn’t even a concern and yet that seems to be what people focus on.
John: Yeah, it’s one of those things where you get wrapped up. It’s like “It’s 64-bit… and a bunch of other really cool stuff”, and all everybody ever thinks of is like “It’s 64-bit, why do I need 64-bit in my phone?” And you can see with the first versions, I was going back when I was doing some background research when I was writing up my article, I looked at what Ars Technica was writing around the time of iOS 7 and the iPhone 5s. It took until iOS 7.1 or 7.2 to iron out a lot of these little memory bugs. They were all sorts of issues because part of the downside of 64-bit is like “Hey, your pointers are now twice as big. This is great.” I think I’m not ready to write Qualcomm off yet. The 810 definitely had a lot of issues, and that they used…
Steven: Well, that issue is not just Qualcomm you’re talking about. Also Samsung with their own custom cores. How custom is it really? It seems to perform relatively closely to an A72 core.
“I think part of it is that people are still dealing with the aftershocks of taking their roadmaps, throwing them out, and scrambling”
But I think part of it is that people are still dealing with the aftershocks of taking their roadmaps, throwing them out, and scrambling. Now that things have settled down, now that they’ve had time to deal with that, I’m optimistic but then again as I was saying before, I’m optimistic about Zen. There may be times were company X is better off to use off-the-shelf cores and use the implementations that ARM gives you. At the same time, I think the diversity is good. I think you look at the x86 world, and basically we’ve got a monoculture now, and it’s all Intel all the time, as much as AMD has done some great things in the past. When we’re looking at our database, because we’ve got all the uploads from all the results, it’s predominantly Intel. Basically, you’ve got people using AMD because they love the AMD brand.
Steven: It’s like a 75/25 split, or maybe even a little bit more now.
John: I’d have to go look at it but i think it was an even bigger skew than that.
Steven: But speaking of Intel and AMD and your new GPU benchmark, what do you think of the integrated GPU market? There seems to be some interesting things happening there with Iris Pro and AMD potentially bringing back more Fusion cores to go with Zen.
John: Going back to what we were talking about before about Tick-Tock and how Intel was like “Here’s our new architecture, here’s our process improvement.” The integrated cores have basically been like tick-tick-tick-tick-tick. Every time you go to IDF, they’re like “Oh, here’s all the cool stuff. Yeah, okay, we’re making stuff faster, and Ivy Bridge is going to be like five percent faster than Sandy Bridge, isn’t this awesome?”
Steven: I mean, from Sandy Bridge to Skylake, we’re looking at what, a 10 times improvement for integrated GPUs?
John: Yeah. I mean like I’ve got a Mac Pro as my primary laptop, and it’s got the 750M in there from Nvidia, and there are times where the Iris Pro in there (and this is Haswell) is competitive. The thing that you’ve got when you start going to integrated GPUs is one of the big things that always killed GPU compute early on was transfer bandwidth. You have to take all your data, shove it across a PCIe bus, and then it has to sit on a card, and once the card has figured out you have to pull it all back down, and you’re sucking down through like this little tiny straw. PCIe 4.0 is a big improvement over what we used to have, but if it’s all sitting in the same memory, with OpenCL 2 you can just basically say to your GPU “Oh, here’s a pointer to the huge array. Go have fun.” It makes it a lot more compelling, and we’re seeing that a lot especially on mobile.
[Integrated graphics] work really well, and I’d say for the vast majority of people they’re fine
Whenever we go to a conference that’s talking about doing compute on GPUs or any power efficiency type thing, we’re seeing huge wins by using the GPU in terms of both power and in terms of time. It’s faster and uses less power. So, I think the integrated graphics, if I wasn’t if I wasn’t doing CUDA work, if I wasn’t doing this sort of development work where it’s great to have as much variety as possible, I would just go with integrated graphics. They work really well, and I’d say for the vast majority of people they’re fine. They run cooler. Now I mean Nvidia Pascal is an amazing architecture. I know Acer is going crazy and shoving those two 1080s, but when they announced their mobile lineup and they basically said “It’s the same as the desktop, we’ve just underclocked a little bit.”
Steven: That’s a little crazy.
John: Yeah. You can get a 1070, and it takes one PCIe connector. it’s not like the two 8 pins that you used to have to do. I remember when Nvidia started really pushing down this path with the 700 series, and you could get the 750 and it was all bus powered and you actually could play games on it, and it was great. I’m hoping with integrated graphics they continue along the path, and what I love to see is AMD, like I know AMD’s been pushing the APU concept for a while but unfortunately they’ve had good GPUs with ok CPUs, if we can get a good GPU and CPU in the same package, I think that would be really exciting. I’d love to see systems built on that.
Steven: That’s pretty much what Iris Pro is right now, you just don’t see many chips. You don’t see very many people using it, partially because it’s more expensive.
John: It’s more expensive, and I think people still have the whole…
Steven: They want the little GPU sticker!
John: Yeah, and I think people just get a little irrational sometimes when they’re talking about performance. And this is something that we’ve seen too. People have preconceived notions of how things should perform and sometimes they’re based on fact, and sometimes they’re just based off, I hate to say, prejudices. They’re sort of like “Well, discrete GPUs are better. I’m not gonna buy a system with an integrated GPU.” Okay, so when is it going to be slower? “Well that thing I do.” OK, can you quantify that? “No. Screw you.” But I’ve done some gaming on integrated graphics, and I mean yeah you can’t run the Crysises of the world at full tilt…
Steven: But you can run CIV 5, XCOM…
John: Exactly. Which is often all you need. I think they will continue to get better, and when it gets to a point in time when all graphics are integrated and you’ve all got this nice big homogenous memory space, I’m hoping more and more apps can use this compute stuff that we’re pushing. That’s why we’re pushing this, and I’m hoping companies like AMD and Nvidia can continue to push this as well. I think there’s some really compelling stuff in there.
Steven: Jumping back a bit to the Nvidia Denver stuff, we see a lot of chips with different specialities, especially Denver with its fantastic performance with anything that has a loop. Have you done anything different with the benchmark to highlight it or avoid it?
John: One of the things we did when we did Geekbench 3 was, I mean a lot of our stuff was more loopy. It was smaller kernels, it was a lot more… I don’t want to say simplistic because we were still trying to figure out like “What Photoshop does, let’s do what Photoshop does. What this other application does. Let’s do that”, but a lot of that stuff was running on nice uniform memory access in a loop, so Denver or some other cores might have been at an advantage because they’re doing dynamic binary translation. It’s kinda like the Transmeta of the 21st century. I believe they bought a bunch of their patents. My assumption is basically that Denver is sort of brainchild of Transmeta or at least some of the Transmeta bits. Nvidia and Intel picked up parts of Transmeta. Different parts, and I saw something on Twitter yesterday where someone from Intel was talking about working on the power modeling team for Intel’s basically Denver. They were talking about like the dynamic binary translation, and this person on LinkedIn was talking about it and doing power performance analysis, and I’m like oh, that’s an interesting thing to talk about. So I mean Intel has got something like this presumably too under the covers. So when we did Geekbench 4, one of the things is we started looking at bigger kernels.
We started looking at “What are larger code bases that people are using?” and we move away from the loop structure and all that. And this is kind of what we’re seeing with the fallout. Some things in Geekbench 3 and 4 slot in roughly the same place. But some of these chips that might have had an unnatural advantage because we had a particular style of kernel that was more predominant, they’re getting knocked down because we’re using these larger code bases where you’ve got much larger code footprints. We’re evicting stuff in and out of icache a lot more than we used to, so they’re not seeing the same benefit. That’s one of the things that we did. Let’s move away from these more loopy type things where they might still be able to model performance, but maybe not exactly. Let’s use a larger code base, and then if we’re running what the phone, or what the tablet, or what the laptop is running anyway, and the chip still does well, then who cares, because you’re doing the same thing. Whereas before we were trying to sort of synthesize things down into smaller smaller kernels that, again, would work well on these 1st, 2nd, 3rd generation smartphones. We were a little more open to gaming that way, and not to say that any of these companies were sorta like “Let’s get the best Geekbench score. Let’s completely change our system, change our CPU design so that we score really well on Geekbench.”
Steven: Speaking of gaming the benchmark, with Geekbench 3 there were a lot of companies attempting to find a way to artificially boost their scores relative to where they probably should have been. You even had to exclude them from your official performance charts. Has there been anything done in Geekbench 4 to try to prevent that from happening again?
John: One of the things with Geekbench 3 was that once you start getting onto newer phones, since we had these splits between the mobile dataset and the desktop dataset, that you can see like if you were running on an iPhone 6S or a Galaxy S6, you’re talking maybe 20 seconds. It was fast. Now we’re talking three, four, five minutes.
Steven: Which is still not bad.
John: It’s still not bad. It’s reasonable. I mean, if you’re running benchmarks everyday when you’re developing them it gets a little tedious, but I mean for the average consumer we think that’s an acceptable increase in runtime. Five minutes isn’t the end of the world. I can’t remember what 3DMark is at. I know there have been times I‘ve been bored running 3DMark.
Steven: I think they’re also launching a new version or just launched a new version.
“Once we have these large runtimes if you start gaming things by ramping up your clock speeds (…) you’re going to start putting actual real danger in the phone.”
If you’re going to game it by like “Oh, it’s Geekbench, high performance mode on”, you won’t get as much out of it. You might still get a couple percent, but is it really worth it? And we had some frank discussions with the companies that were doing this, and basically said we’ve got a separate build that does not look like Geekbench, it strips out all the identifiers it strips out all the stuff, it looks like “Bob’s Mini Golf Putt” or something like that, and we run that, and once we start seeing differences that’s where you start going like “Ok, what’s going on here?”, and we’ve not seeing much of that now. Companies seem to have gotten wise to the fact that we stood up and we called out. I know Samsung and Sony were really bad, but we had discussions with them and it seemed to be that they kind of understand now that it’s not the way to go. Because I remember AnandTech called them, Ars Technica called them out…
Steven: It’s in the public sphere.
John: Yeah, it’s not a winning strategy.
Steven: It definitely causes quite a few issues. Speaking of communicating with the manufacturers, how much input do they have into upcoming versions? Do you speak with them a lot about stuff like what chips are coming up and how to prepare for them?
John: Generally speaking what we’ll do is when we do a new benchmark, we run what we call pre-release source licensing. I know Futuremark calls it a BDP, benchmark development program. So what we’ll reach out to people and basically say “Hey, we’ve got a new version coming out in a year or a year and a half or something like that. Would you like to license the source code either post release or pre-release?” and some companies are like “Yeah, we’ll want to play around with that, just we’ll do it after release.” Other companies are really keen to see what’s going on, and they’ll provide feedback either at a very high-level saying “Oh, we think this is really exciting.” One of the things we’re talking about for Geekbench 5 is machine learning, because everybody now is talking about doing machine learning, TensorFlow and all this stuff. That’s something that people have already said to us “Oh, that would be really exciting”.
For the Geekbench 4 timeframe, it would be like “Oh, we really want to see SQLite” or “We like this benchmark, but we want to make it better” or “We want to see this or that.” And then the problem is that we’ve got 10 to 15 of these companies all talking to us at once, and the problem for us is we’d like to think that everybody is sort of like “Yeah, we want to see you guys make a really great benchmark” because that makes the industry better as a whole and at the same time you never quite know if someone’s like “Well, we really want you to do this because our chip does this really well, and everybody else does it really poorly”, and just balancing those requests and trying to come up with a reasonable alternative.
Steven: And to some extent you saw that in the PC gaming market recently, with the tessellation performance difference between Nvidia and AMD, and tessellation usage being bumped up like crazy on GameWorks games.
John: And that’s the sort of thing we want to avoid. For 3 we had a much smaller program, but we were growing bigger and we’re becoming more and more of an industry standard benchmark, so for Geekbench 4 we had a lot of input, and generally what we found was stepping back and trying to evaluate things objectively as possible, looking at “Is this a good benchmark? Okay. Is this a benchmark that’s going to benefit the person who is suggesting it? Is this a change that’s going to benefit the person who is suggesting it or is it going benefit other people?” There’s one company, it’s almost at the point where all their changes are benefiting their competitors more than them. And they’re like “No, this is the right thing to do.” And they come to us and it’s just like “We really want you to do this”, and I’m like “Okay, well you know this like this makes everybody else look ten percent better, and you don’t change.” and they’re like “No, it’s completely the right thing to do.” There are good actors out there. There are good companies that really do care about the quality of the code we’re producing. There are others were sort of like “We’d really like to see you do this”, and it’s like “Well, that’s not going to benefit anyone else.” It’s definitely a balancing act, and its us quite frankly saying no a lot more than we say yes to things. Because at the end of the day it’s really great being able to work with these companies and it’s really great taking their input, but at the end of the day, our primary concern is “Is our benchmark objective?” and “Is it doing the right thing for our users?”
Steven: What was the process of developing Geekbench 4 like? What would you like to improve? How can you make the results more relevant to actual performance?
John: Some of the some of the feedback we got was that Geekbench 3, the integer section in particular was very cryptography and very compression heavy. So obviously we wanted to move right away from that.
Steven: To some extent, the new requirement for encryption has resulted in a boost in importance for that.
John: Yeah, and that’s one of the reasons why we kept the AES benchmark in Geekbench 4 and why we give it so much weight is if you look at some of the reviews for low end handsets, now that you’ve started requiring encrypted file systems, now that you start to require that everything is going through AES, if you’ve got instructions, you can look at usability charts and it’s a night and day difference. Your phone is faster, it runs better, it uses less power. So those sorts of things are important, but the concern was that there was an unhealthy mix towards that direction, so that was one of the things we wanted to address. We also wanted to pull in, as I said before, larger more ambitious codebases. I think we’ve got a good coverage. A lot of people like to talk about how everything should be big and complicated, well they are still times where like small tight loops are really important like image editing that sort of thing. The Photoshops of the world, they’re working in small tight loops
Steven: It’s the same thing over and over again, and that’s why it’s so good for GPU accelerated compute.
John: Exactly. And as times goes on, hopefully more and more of that stuff can get offloaded, leaving the CPUs to do the big complicated messy things. We wanted to pull in some bigger workloads, like the LLVM workload which we sort of you as sort of a stand-in for JIT compilation. You’re taking this big piece of source code and spitting out binary code. That’s a big hairy problem and it’s something that you do and web browsers use LLVM not so much, and this is the problem with a three-year lead time is that everybody goes off and it’s like “We’re going to make our own JIT”, but it’s still representative. You’re still doing the same sort of operations on the same sort of data structures, and that’s really great.
We’re doing more stuff with HTML because that’s still big. People are still using web browsers, not everything’s an app now. Those sorts of things, making it more and more relevant. Also, making sure that when we start looking at things, going back to Andrei’s comment about how the rankings are now proper, taking a look at other sorts of performance. Taking a look at how do apps perform how do you know these other things perform and making sure that we’re not completely disagreeing with people here. Making sure that like “Okay, so web browsers are doing this, well on our HTML benchmarks they should kind of be in the same ballpark.” If we’re showing Intel is up here and ARM is down here when it’s actually the reverse that’s no good.
That’s it for part two. Stay tuned as there is a lot of in-depth information and entertaining tidbits coming in part three of our fireside chat. We hope you enjoyed the chat and learned a thing or two!