It's no secret that the recent leaps in AI, including ChatGPT and Stable Diffusion, are impressive. They can create text, images, video, and more, all based on a text prompt, with little user input at all. The other thing they have in common is that they all run in the cloud, so they're on somebody else's computer, and can be expensive once subscription fees are taken into account. To save some money, many AI tasks can run on your home computer, from LLMs to the datasets that train them.

But exactly what hardware do you need to power these advanced algorithms? Well, while you don't necessarily need the best CPU to run deep learning tasks, you probably will need one of the best graphics cards. That's because the new technology in GPUs, like Nvidia's Tensor cores, are custom designed for accelerating AI tasks. You'll also want a hefty amount of VRAM, so that more data can go into active memory, which saves time while training. That's why we've put this list together of the best GPUs for deep learning tasks, so your purchasing decisions are made easier.

Our picks of the best graphics cards for deep learning use

MSI Gaming X Slim GeForce RTX 4070 Ti Editor's choice Plenty of power with a relatively affordable price tag The MSI Gaming X Slim GeForce RTX 4070 Ti features fourth-generation Tensor cores, which are purpose built for accelerating AI tasks. It's a cost-effective way of getting into deep learning models, but will hit VRAM limits on advanced tasks. Pros 240 fourth-generation Tensor cores

12GB of GDDR6X VRAM with 504.2 GB/s of bandwidth

Huge heatsink with three fans Cons Not enough VRAM for any AI task you might want to run

285W TDP requires a hefty PSU and cooling $785 at Newegg

When picking a graphics card for deep learning tasks, the most important specification is how many AI accelerator cores are onboard. These purpose-built cores perform very efficient matrix multiplication, which is the most time-intensive type of calculation in any deep neural network. Until very recently, the only graphics cards with these types of cores came from Nvidia, with its Tensor Cores. With the Ada Lovelace architecture in Nvidia's 4000-series, these are fourth-generation and are greatly improved from their introduction in the Tesla V100. In RTX 4070 Ti models like this MSI Gaming X Slim, you get 240 Tensor Cores, which is plenty for running matrix calculations. They're actually so fast that they sit idle roughly half of the time during GPT-3-sized training, with the bottleneck being how fast data can arrive from global memory.

Thankfully, with 12GB of GDDR6X supplying 504.2GB/s of bandwidth, the RTX 4070 Ti can fill the Tensor cores back up with data pretty quickly. It's not as fast as the dedicated server cards that have up to three times the memory bandwidth, but it's fast enough for homelab use. And with 128kb of shared memory per SM and 48MB of L2 cache, we can use larger tile sizes to further reduce global memory access. That makes it faster to train any models that can fit data chunks into the L2 cache size, as they only have to be loaded once. The main drawback of the RTX 4070 Ti is the 12GB VRAM, which is enough for training smaller models or image generation tasks, but not really enough for using transformers. It's also a multi-slot design, even in this slim version, so fitting multiple into a PC chassis can be tricky. This could be resolved by using PCIe extenders, which don't hurt the performance of deep learning tasks as much as you'd think.

ASUS TUF Gaming RTX 4070 OC Best value The best priced GPU with Tensor cores $640 $680 Save $40 The ASUS TUF Gaming RTX 4070 OC is a great 1440p gaming card, but it's also perfect for deep learning tasks like image generation or running local text-based LLMs, as it has a large number of fourth-gen Tensor cores and 12GB of VRAM. Pros 184 fourth-generation Tensor cores for accelerating AI workflows

12GB of GDDR6X supplying 504.2GB/s of bandwidth

Large heatsink with three fans Cons 12GB isn't enough for every AI task

4-slot design so hard to use multiple GPUs $640 at Amazon

The RTX 4070, like the more expensive Ti variant, has fourth-generation Tensor Cores which make short work of deep learning workflows. However, this time around, it has 184 of them, so it will have around 80% of the power available for those speedy matrix calculations. The L2 cache is also lower, at 36MB, but that's still large enough to limit the amount of calls to global memory. With 504.2GB/s of bandwidth from the 12GB of GDDR6X memory, it can keep the same size of dataset in active memory, and get to it just as fast as the Ti variant.

This particular model from ASUS has factory overclocked cores to 2,580MHz, which might make a slight difference in AI accelerated workflows. Overclocking the memory speed would have more of an impact than the default of 21Gbps. The same limitations as the Ti variant also apply, namely the size of the dataset that you can put into 12GB of memory limits the type of training that you can do. Adding a second GPU of the same type and using parallel workflows can help, and that's where using the RTX 4070 brings some improvements. With a 200W TDP, it's easier to both power and keep cool, and the 3.25-slot cooler with three axial fans will keep the core and memory at comfortable temperatures over the long times that training an AI model can take.

MSI Suprim Liquid X GeForce RTX 4090 Premium pick Lots of VRAM and Tensor cores The MSI Suprim Liquid X GeForce RTX 4090 card is a slim, watercooled variant of Nvidia's flagship, with a 240mm radiator to keep the card cool under any workload. That'll come in handy during AI tasks, which can take significant time to complete. Pros 512 fourth-generation Tensor cores for AI tasks

24GB of VRAM with 1,008GB/s of bandwidth

Watercooled for thermal performance Cons 450W power requirement

Need a large case to fit the radiator $2200 at Newegg

When we reviewed the Nvidia GeForce RTX 4090 in its Founder's Edition form at launch, we called it "the untouchable king of performance." Now, that was based on gaming workloads, which it demolished at 4K resolution and provided enough power for 8K resolution gaming if you have the monitor to display it. Now it also comes with a hefty 450W TDP, but we saw it didn't go much over 400W during gaming loads, and with AI workloads having the Tensor Cores sitting idle for roughly half the time, it's a fair bet that it won't get anywhere near that TDP. The FE variant with heatsink and two fans kept under 65 Celsius even under 420W workloads, and this particular MSI Suprim Liquid X comes with an AIO watercooler with a 240mm radiator to wick away heat from the core and memory. I expect it will stay well under that 65C target during workloads, which means your expensive RTX 4090 will last for longer than if it was running hotter.

As for AI workloads, the RTX 4090 has enough power for the trickiest workloads like transformers to train LLMs, with 512 Tensor Cores providing over twice the power as the RTX 4070 Ti. And with 24GB of GDDR6X and a 384-bit memory bus, it brings 1,008GB/s of bandwidth to your deep learning needs. That's double that of the RTX 4070 Ti, and two-thirds of the bandwidth from the substantially more expensive server-class GPUs with Tensor Cores. Make no mistake about it, this is the GPU that you should aim for when doing deep learning, and the only reason it's not getting the top pick in this list is that it's often out of stock everywhere, as companies buy them in pallet loads to run their own AI tasks on.

Nvidia H100 Spare no expenses For server-grade tasks The Nvidia H100 is specifically built for AI accelerated workflows in workstation or server installs, as it doesn't have any graphics output ports. With 80TB of VRAM, it can tackle advanced tasks like transformers or training LLMs for other uses. Pros 80TB of VRAM

PCIe 5.0

51 teraFLOPS of FP32 performance Cons Costs as much as a midrange sedan

No fans, so have to rely on server fans $30100 at Amazon

The Nvidia H100 PCIe 80 GB is one of the latest AI-focused professional graphics cards from the company, built to chew through AI accelerated tasks in a server setting with up to eight of these expensive GPUs running in parallel. According to Tim Dettmers, it brings twice the relative performance as the RTX 4090, in 16-bit training, 16-bit inference, and 8-bit inference tasks. With 456 Tensor Cores and 2TB/s of memory bandwidth from 80GB of HBM3 memory, it's also the first GPU to support PCIe 5.0 for faster connections to the motherboard. It also supports NVLink, which directly connects the GPUs together, so they bypass the motherboard and CPU when passing data between them.

With a 350W maximum TDP, it draws power from a 16-pin PCIe cable. The two-slot thermal solution is passive, which is expected for a server-class GPU like this. To use it in a desktop workstation will require some ingenuity for enough airflow to keep it cool. It's not just a hardware solution, as it comes with a five-year subscription to Nvidia AI Enterprise, which is a fully featured AI software platform with over 100 frameworks, pretrained models to get started quicker, and more to help AI professionals do their job. This is the current pinnacle of AI-accelerated GPUs, and is more versatile than the Tensor Processing Units (TPUs) that Google uses in Google Cloud for AI training. The only real drawback to these graphics cards is the price, which is as much as a family car. Then again, for companies invested in AI training, the only thing that matters is the time that AI training can be accomplished in, and that's where the H100 excels.

Nvidia Tesla V100 Best last-gen server card Server-class AI computing at an affordable price The Tesla V100 was the first graphics card to feature Tensor cores, which are designed for accelerating AI workflows and deep learning models. It's a few years old at this point but is still capable, and is a great starting point for building a server for deep learning tasks. Pros 16GB of VRAM

640 first-generation Tensor cores

Relatively low 350W power requirement Cons No active cooling

No display outputs $1470 at Amazon

The Nvidia Tesla V100 was the first graphics card to feature the Volta architecture, and the very first with Tensor Cores to accelerate AI workflows. The GV100, to use its proper name, has 672 Tensor Cores for accelerating AI calculations. Now, it's worth mentioning that this first-generation Tensor Core isn't directly comparable to the second, third, or fourth generation cores as they were improved and gained added functionality as each new release came out. Still, that's more Tensor Cores than a RTX 4090, and with 16GB of HBM2 memory with a 4,096-bit bus width, pushes 897GB/s of bandwidth. That's a colossal amount, and will work wonders with image generation tasks.

It will struggle with transformers unfortunately, as those tasks are best with at least 24GB of memory to fit the huge datasets they need, but it will still get you going on your deep learning journey. It's also got a relatively low L2 cache of 6MB, so it will be fetching data from global memory more often than newer graphics cards. With a two-slot, passively cooled design, the V100 is usable in workstations or servers, as long as enough consideration for airflow is budgeted for. It's powered by two 8-pin PCIe connectors for a total board power of 350W.

Zotac Gaming GeForce RTX 3090 Trinity OC Best last-gen consumer card Very capable for machine learning $2070 $2400 Save $330 The Zotac Gaming GeForce RTX 3090 Trinity OC is the best value proposition from the Ampere architecture, with many third-generation Tensor cores to accelerate AI tasks and 24GB of VRAM for fairly large data sets. Pros 328 third-generation Tensor cores

24GB of VRAM with 936.2 GB/s of bandwidth

Large heatsink with three fans Cons Three slot thickness $2070 at Amazon

The Nvidia 3000-series is still a capable force in AI tasks and the GeForce RTX 3090 is one of the best, if you can find one for sale these days. This model from Zotac has a huge 24GB of GDDR6X memory, which, when coupled with the 384-bit bus, means 936.2GB/s of bandwidth for speedy memory access for deep learning datasets. That's higher than most of the other entries on this list, but comes with a price tag to match the performance. With only 6MB of L2 cache, it will be fetching data from global memory more often, but that's going to be helped by the huge bandwidth numbers, so the performance won't suffer that much.

With a 350W TDP fed by two 8-pin PCIe cables, it will be easy to power and won't require a new ATX 3.0 compliant PSU to run. The large 3-slot heatsink and three fans will keep it cool, especially with the on/off cycle for the Tensor Cores as they wait for more data to be fed from the memory. The older Tensor Cores are less powerful than those in Nvidia's 4000-series, but the more important factor with this card is the 24GB of VRAM, which enables the use of the latest LLM models, and likely, the datasets necessary for some time in the future. That's because until consumer GPUs go higher than 24GB, AI scientists will be aiming to fit their models into that amount of memory.

XFX Speedster MERC310 AMD Radeon RX 7900XTX Black 24G Best AMD option 24GB of VRAM and ROCm support The XFX Speedster Merc310 AMD Radeon RX 7900XTX Black is a monster of a GPU with 24GB of VRAM that has recent support for PyTorch 2.0.1 and the ROCm open software platform that makes it viable for deep learning. The large memory capacity means it's specially suited for training large language models (LLMs). Pros 192 AI accelerators

24GB of VRAM with up to 960GB/s of bandwidth

Large cooler with three fans Cons Limited community support as Nvidia is more widespread $972 at Amazon $970 at Newegg

The main reason that most of the graphics cards on this list are from Nvidia is that Tensor Cores make a vast difference in how fast GPUs can handle deep learning tasks. While earlier AMD architectures like RDNA and RDNA2 had great silicon with high FP16 performance and high memory bandwidth, the lack of AI accelerators made them a non-starter for professional use. With RDNA3, AMD introduced AI Accelerators, its version of Tensor Cores, with 192 on the flagship Radeon RX 7900 XTX. With 24GB of GDDR6 memory, a 384-bit bus, and 96MB of L3 cache, this graphics card could get up to 3,500 GB/s of memory bandwidth while using Infinity Cache. Those are the four most important requirements for deep learning tasks covered, and we already know that AMD is good for FLOPS performance. The only piece of the puzzle missing is software support, as all the hardware in the world can't help you without something to run on it.

AMD GPUs use ROCm software to provide a way to use the widely used PyTorch framework for building deep learning models. Until RDNA3, it didn't have any AI acceleration for consumer GPUs, so while it was usable, it was slower than alternative graphics cards at the same price. Now with the release of ROCm 5.7.1 for Ubuntu Linux, two consumer GPUs get support to use PyTorch 2.0.1 with acceleration; the Radeon RX 7900 XTX and the Radeon Pro W7900. With the 24GB of VRAM on this card from XFX, you have ample space to train LLMs or other deep learning tasks.

ASRock Phantom Gaming Intel Arc A770 Best Intel option Surprisingly capable, especially at image generation tasks $310 $330 Save $20 The ASRock Phantom Gaming Intel Arc A770 is Intel's flagship discrete GPU, with 16GB of VRAM for large data sets, and 512 of Intel's version of Tensor cores for accelerating AI workflows. Pros 16GB of VRAM with 512GB/s of bandwidth

512 Intel tensor cores

Relatively low 225W power requirement Cons Need access to ReBar or Smart Access Memory for best performance $320 at Amazon $310 at Newegg

While AMD graphics cards only recently got support for AI acceleration, the discrete Arc cards from Intel came with the company's own version of Tensor Cores straight out of the gate. These cores are called Intel Xe Matrix Extensions Engines, or Intel XMX Engines for short. On the Intel Arc A770, it comes with 512 XMX Engines, which are used for XeSS upscaling in games that support it. They're also general purpose AI accelerators, and can be used for deep learning tasks. And with 16GB of VRAM with a decent 512GB/s of bandwidth, you can use relatively large models for LLMs or image generation.

The new SYCL Joint Matrix Extension makes it so Intel XMX can be used in the same way as Nvidia's Tensor Cores, accelerating deep learning frameworks like TensorFlow and libraries like oneDNN. Intel has a robust developer team that has been cranking out AI tools, drivers, and a full ecosystem of AI software. They've got in-depth guides to get deep learning software, like TensorFlow running on Arc GPUs, or anything else you might need to know. The one big drawback is that you need a relatively new motherboard and CPU that can support Resizable BAR, as Intel has said, the performance of Arc GPUs won't be great without it.

What you need to know about picking a GPU for deep learning tasks

When picking a graphics card for deep learning tasks, it's important to know which specifications are relevant, and in which order they are important. One of the leading voices in making deep learning more accessible is Tim Dettmers, and we used his expert advice for picking our choices. The primary factor should be the number of Tensor cores, which are only found on Nvidia graphics cards from the Volta architecture onwards, and on consumer graphics cards starting from Ampere, the Nvidia 3000-series. With the Ada Lovelace architecture, Tensor cores are in their fourth generation, and as they have been improved each time, the latest graphics cards are the best to pick up. Then memory bandwidth comes into play, then cache configurations, and finally FLOPS. The other thing to remember is that the amount of VRAM dictates the tasks you can run, with 12GB being a minimum for image generation, and 24GB for work with transformers.

If you're only starting out getting to grips with deep learning tasks, you don't want to dive in at the deep end. That's why my recommendation for starting out is an Nvidia GeForce RTX 4070 Ti, like the MSI Gaming X Slim model. With 12GB of VRAM, it's got enough for image generation workloads, and the fourth-gen Tensor cores will chew through tasks, saving you time. For moving on to transformers or to generate images or other LLM outputs faster, I recommend any Nvidia GeForce RTX 4090 model that you can find in stock, which currently is this Gigabyte Gaming variant. With 24GB of VRAM you'll be able to use larger datasets, and the increase in Tensor cores will be noticeable. The reasons that make it such a good buy for home users is also why you can't find any in stock, as companies have been buying them in droves to power their own AI aspirations.

If money is no object, and you're making serious income from your deep learning tasks, the Nvidia H100 is the best server-class GPU you can buy as a consumer to accelerate AI tasks. With 80GB of VRAM, you can use significantly larger datasets loaded into memory, opening access to tasks that you can't achieve on desktop-class cards. And to round off your deep learning rig, you'll want to use one of the best motherboards to tie everything together. Here, you're probably going to want to look for stability and longevity, as you won't be risking overclocking, which would be disastrous if it failed part-way through training a model.