Leak Suggests 'RTX 4090' Could Have 75% More Cores Than RTX 3090
Ada pushes the envelope, with TSMC 5nm and a big jump in core counts
A leaker by the name of @davideneco25320 on Twitter has shared some very specific details about Nvidia's next-generation Ada (aka Lovelace) GPUs including SM counts and names of each new die. If his data is accurate (and given the recent Nvidia hack, it very well could be), Ada will be a massive upgrade over Ampere, the RTX 30-series, especially for the flagship GPU. As this is leaked data and cannot be completely trusted, take these results with a grain of salt.
J'ai fait un petit graphique pic.twitter.com/zilwXgi0vaMarch 1, 2022
The leak shows that Nvidia will not be changing its nomenclature for the Ada generation, keeping the two letter prefix and three digit number system as the Ampere generation. AD102 denotes the flagship GPU, likely for an RTX 3090 or Titan-class card, with AD103 following as the next most powerful die (perhaps for a potential RTX 4080). AD104-106 will follow suit being the midrange dies (i.e. RTX 4070 and RTX 4060) and AD107 will fill out the entry-level market for Nvidia's Ada GPUs (i.e. something like an RTX 4050).
Note also that the codenames suggest Nvidia will be using the Ada codename and not the previously rumored Lovelace codename, so that's how we'll refer to the future GPUs for now.
One thing that has changed significantly is the number of SMs in Ada. The flagship AD102 die will supposedly tip the scales with a whopping 144 SMs in a single die. By way of comparison, Ampere's GA102 only has 84 SMs, so this is a 71% increase in SM count, which should likewise apply to GPU cores, RT cores, TMUs, and other elements. This will be one of the largest jumps we've ever seen in a single generation.
If Nvidia keeps the number of CUDA cores the same on Ada, this means we could be looking at 18,432 CUDA cores for the flagship card. Nvidia's upcoming RTX 3090 TI 'only' has 10,752 CUDA cores, using the full GA102 chip. Of course we'll also see lesser variants that use partially harvested AD102 chips, and while 144 SMs may be the maximum, we wouldn't be surprised to see 10–20% of the SMs disabled for some graphics card models.
The number of SMs in the other chips isn't nearly as high, though the numbers are still very respectable. AD103 will supposedly have the same 84 SMs as GA102 with 84 SMs, a 40% jump from GA103. AD104 will follow suit, with the same 60 SMs as GA103, or 25% more SMs than GA104. AD106 is a bit closer to GA106, with 36 SMs — a 20% uplift. Finally, AD107 will supposedly feature just 24 SMs, again the same respectable 20% jump in SM count compared to GA107.
If these leaks and rumors prove accurate, we can expect flagship cards like a future RTX 4090 and RTX 4080 to pack some incredible performance improvements over the current RTX 30-series. It's certainly a larger jump than Ampere compared to Turing, at least in some respects. RTX 3080 for example had 68 SMs compared to RTX 2080 Ti's 68 SMs, though there were plenty of other changes.
The above doesn't account for any additional performance improvements coming from the Ada architecture itself, which could bring further benefits. It has been rumored for some time that Ada will be jumping ship from Samsung back to TSMC with its latest N5 5nm node. That alone should provide some significant improvements in efficiency and transistor count over Ampere, and may also unlock higher clock speeds.
Power consumption could also be increased for Ada GPUs with the addition of the new 16-pin power connectors that are being developed and produced right now for future PCIe 5.0 graphics cards. Featuring a maximum power output of 600W from a single plug, that would give Nvidia a ton of headroom to boost performance on Ada GPUs.
Ada may also be the first PCIe 5.0 compliant graphics solution, and while the increase in PCIe bandwidth might not matter too much, it certainly won't hurt performance. What we don't know is how much Nvidia plans to change the fundamental building blocks in Ada. For example, Turing had 64 FP32 cores and 64 INT32 cores per SM, which were able to run concurrently on different data. Ampere altered things so that the INT32 cores became INT32 or FP32 cores, potentially doubling the FP32 performance.
Ampere also features 3rd generation Tensor cores and 2nd generation RT cores for ray tracing. Ada will likely use 4th generation Tensor cores and 3rd generation RT cores. What will that mean? We don't have exact details, but Ada will almost certainly deliver far more performance than the current Ampere GPUs. There might be more CUDA, Tensor, and/or RT cores per SM, or the internal pipelines may simply be revamped to improve throughput.
Memory is also another big player when it comes to GPU performance, and could play an even bigger role in improving frame rates considering how many SMs Ada may have. GDDR6+ and GDDR7 are already on Samsung's roadmap featuring substantial bandwidth improvements over GDDR6X, and Nvidia will likely use one or both of these new standards if they're ready in time for Ada production. After all, the more cores you have, the more memory bandwidth you need to feed them all.
Generally speaking, Nvidia has improved performance on its fastest GPUs by around 30% with previous architectures, but with the change in process node and massively increased core counts, plus a potentially higher power limit, it's not unrealistic to expect even bigger improvements from Ada.
Will the RTX 4090 (or whatever it ends up being called) end up delivering twice the performance of the RTX 3090? That's ambitious but certainly not out of reach. 75% more cores with higher clockspeeds and/or a more efficient architecture would do the trick. We'll find out more later this year, as Ada is expected to launch in the September timeframe.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Aaron Klotz is a contributing writer for Tom’s Hardware, covering news related to computer hardware such as CPUs, and graphics cards.
-
hotaru.hino The RTX 3090 had over twice as many shaders as the RTX 2080 Ti, but certainly didn't perform twice as fast.Reply -
USAFRet
The same as in drive performance.hotaru.hino said:The RTX 3090 had over twice as many shaders as the RTX 2080 Ti, but certainly didn't perform twice as fast.
2x benchmarks numbers does not equal 2x actual performance. -
blppt Hopefully we get a card that can actually handle raytracing this time. Third time's the charm?Reply -
spongiemaster
Nvidia changed the definition of a cuda core going from Turing to Ampere, so while it could have twice the performance in certain compute work loads, in games that required a combination of int and fp calculations, there weren't really twice as many execution units. Nvidia themselves stated that Ampere was a maximum of 1.7 times faster than Turing in rasterized graphics. Also memory bandwidth didn't come close to doubling between Turing and Ampere.hotaru.hino said:The RTX 3090 had over twice as many shaders as the RTX 2080 Ti, but certainly didn't perform twice as fast. -
hotaru.hino
Then let's look at some values. Or just one because I'm feeling lazy.spongiemaster said:Nvidia changed the definition of a cuda core going from Turing to Ampere, so while it could have twice the performance in certain compute work loads, in games that required a combination of int and fp calculations, there weren't really twice as many execution units. Nvidia themselves stated that Ampere was a maximum of 1.7 times faster than Turing in rasterized graphics. Also memory bandwidth didn't come close to doubling between Turing and Ampere.
Looking at page 13 (or 19 in the PDF) of the whitepaper on Turing, there's a graph of what games had a mix of INT and FP instructions. I'm just going to pick out Far Cry 5 from this. So if we take this graph, let's just assume for every 1 FP instruction, there were 0.4 INT instructions. And if we laid this on one of Turing's SMs, this means that if there are 64 FP instructions, there's only 25 INT instructions, or a utilization rate of ~70%. For Ampere, since there are some CUDA cores that are split between FP and INT, we can balance which one does what for better utilization. And doing some math, for the best utilization in Far Cry 5's example, 90 FP instructions plus 36 INT instructions, using 126 out of 128 of the CUDA cores vs 89.
So right off the bat, without adding any more CUDA cores, Ampere should have about a 1.4x lead over Turing in this example. So how much does Ampere get in practice? 1.28x based on TechPowerUp's 4K benchmark.
And sure Ampere doesn't have 2x the memory bandwidth, but it has almost double L1 cache, which should soak up the deficiency.
Either way, my commentary is pointing out the odd conclusion the article seems to hint that NVIDIA only needs to add 75% more shaders for double the performance. -
sizzling
Not had any problems with Ray Tracing on my 3080.blppt said:Hopefully we get a card that can actually handle raytracing this time. Third time's the charm? -
spongiemaster
Below is a comparison between 1/4 of a Turing SM and 1/4 of an Ampere SMhotaru.hino said:Then let's look at some values. Or just one because I'm feeling lazy.
Looking at page 13 (or 19 in the PDF) of the whitepaper on Turing, there's a graph of what games had a mix of INT and FP instructions. I'm just going to pick out Far Cry 5 from this. So if we take this graph, let's just assume for every 1 FP instruction, there were 0.4 INT instructions. And if we laid this on one of Turing's SMs, this means that if there are 64 FP instructions, there's only 25 INT instructions, or a utilization rate of ~70%. For Ampere, since there are some CUDA cores that are split between FP and INT, we can balance which one does what for better utilization. And doing some math, for the best utilization in Far Cry 5's example, 90 FP instructions plus 36 INT instructions, using 126 out of 128 of the CUDA cores vs 89.
So right off the bat, without adding any more CUDA cores, Ampere should have about a 1.4x lead over Turing in this example. So how much does Ampere get in practice? 1.28x based on TechPowerUp's 4K benchmark.
And sure Ampere doesn't have 2x the memory bandwidth, but it has almost double L1 cache, which should soak up the deficiency.
Either way, my commentary is pointing out the odd conclusion the article seems to hint that NVIDIA only needs to add 75% more shaders for double the performance.
As you can see, the number of FP32 CUDA cores doubled in Ampere, this is how Nvidia claims twice as many CUDA cores. However, you can also see not everything doubled. Whereas every CUDA core in Turing could concurrently perform 16x INT32 and 16x FP32 calculations, half of Ampere cores can either do 16x INT32 or 16FP32 instructions while the other half can only do 16x FP32 instructions. For purely fp32 workloads, you could see up to a theoretical doubling of performance, but for purely int32 workloads you wouldn't see any performance improvement at all. Because work loads are usually more floating point oriented than integer based (Nvidia claims 36 integer operations for every 100 floating point), Nvidia chose this layout that favors floating point performance. In the real world of mixed loads in games, you're going to see somewhere in between those 0 and 100% figures with Nvidia claiming a max of 70% improvement over Turing. -
digitalgriffin hotaru.hino said:The RTX 3090 had over twice as many shaders as the RTX 2080 Ti, but certainly didn't perform twice as fast.
I agree. Even if you could develop a driver that could dish out the draw calls fast enough, there's a limit to the scheduler on the gpu being able to feed the cores. While more cores support more complex objects better, most objects aren't that complex. When you have 100 trees in the background you need them to be simple. So scenes are composed of many low to moderate complexity objects. This is also part of variable rate shading improvements. As a result most draw calls under utilize the GPUs full potential.
This is part of the genius of unreal new engine. They reduce complexity of a scene so much in software by eliminating unseen details and reducing them. The math behind it is genius. Mesh reduction has never been a simple cs problem. They reduced it to a simple O(n*n) problem. Where n resents the number of layers of reduction.