Hi, this is Wayne again with a topic “QAT: Speeding SSL with NGINX”.
Now I reached out to Intel with a lot of questions about quick assist over the last few months and accelerators in general and the accelerator strategy, because it’s not just the cars, it’s also the accelerators, and it’s like okay. Well, you know help me understand that help me get a little closer to the metal, and so they have Intel sponsored my trip to Innovation 2023, which helped make this video possible, which is awesome thanks, Intel and uh yeah, just full disclosure, when I say accelerator, quick, Assist is essentially the accelerator I’m talking about it’s the best foot forward from Intel. It’S fabulous competitors, don’t have anything that can touch it and once you’re using it, it’s going to be hard to go back to a world without it quick assist, qat yeah. It’S part of Intel, Xeon, server, processors, and you know it isn’t about CPU cores. It sounds like a riddle right, like you, do, processing without a CPU core, but accelerators like this are no joke Sur. You don’t think the future of Compu is just cramming more and more cores on a silicon die without adding anything else right right. Let’S take a closer look so computation without a CPU core, a GPU, that’s the first thing you should think of Graphics, Processing, Unit, graphics card or sometimes the GPU is built into the CPU. It accelerates drawing the UI doing video playback gaming, rendering you name it, but it didn’t take long for clever programmers to figure out how to use gpus for doing other kinds of compute, not necessarily Graphics, oriented famously cryptocurrency, all not all that long ago. But now ai is also using the GPU resources pretty heavily. You may have heard of quicksync. This Tech has been in Intel CPUs that have a built-in GPU for many generations now and even if you had an addin GPU you’d still leave your built-in CPU GPU enabled. I mean it’s beneficial, but but why well video, editing and Playback software could Leverage The GPU built into the Intel CPU quick, sync Hardware, to do acceleration, rendering a movie or extending the runtime the battery runtime of a laptop, because that’s always going to use less power Than a discret GPU streaming video same kind of thing, it’s built in to the CPU silicon, but it’s not a CPU core. It’S not even necessarily being used to drive a monitor.
Usually it’s doing some kind of Graphics compute, although not necessarily – and you use it even if you’ve got a discrete GPU in your system – a discret addin card. It’S no coincidence that the word quick appears in quick, assist or Intel qat, because Engineers looked at how valuable, quick sync was in desktop use cases, and you know how quick assist dates from 2015 I’d say: it’s been robust and mature since at least 2018. It appears in adom and Zeon D embedded CPUs that are typically used for Edge and network applications and server class xon CPUs as well.
I mean for a while Intel, even even offered quicksync pcie cards just like this one, which is on loan to me from Microsoft. Sql MVP Glen Barry, that’s data bases, this one dates from 2018 2019 and this is basically built into Sapphire. Rapid server CPUs some of them anyway, and it’s even faster and better than the one like what’s built into the CPU is better than this. This qat accelerator is very well documented, generally very well supported on both Linux and Windows. Server operating systems well even workstation, if you’re to run a server CPU in a workstation and the use cases are surprisingly not narrow. We talk about SQL server for databases, but it’ll do a lot more.
How does that work? Well, Glenn Barry and I have done a lot of experiments with Microsoft, SQL server and qat in particular – and you know, SQL Server is licensed by the core. So suppose you had a Zeon gold, 64 48h CPU, that’s 32 cores and it has two qat devices per socket. Generally, licensing costs are pretty high and the performance trade-off you’re better off with a single socket server than a dual socket server for most use cases.
So you’re going to have two qat devices. So if you have this 32 core CPU and it’s under a pretty significant load, meaning the CPU is fully loaded and you get the qat hardware and the Enterprise Edition of SQL Server. The CPU impact of running a backup with compression is going to be pretty minimal, especially if you’re backing up to a backup device the backup Target, where you’re, storing the backup isn’t uh under pressure from the actual database system. This is huge if you’re a database administrator, because the users don’t know you’re running a backup.
The system doesn’t feel like it slows to a crawl while you’re running a backup, in other words now because of the way the SQL Server is licensed, you couldn’t bump up, say from 32 to 48 cores and Reserve 16 cores for handling background jobs. That’S just not how the license works. If you get a 48 core system, you’re paying for a 48 core license, so quick assist is huge, even Intel in their own marketing gets wrapped up in talking about other kinds of accelerators, along with qat AMX. Avx 512. Understand that those other facilities on a given CPU are not really qat, but they can help feed qat, but qat is literally uncore. What Intel refers to as uncore it’s an accelerator, that’s not on the CPU. It is literally this pcie device crammed into CPU silicon. It’S memory, mapped IO, and so the CPU, for its part, is really just handling some bookkeeping and traffic cop stuff for shuttling around large blocks of data which doesn’t have a lot of overhead to it.
So qat in this context is hugely valuable check out. Some of our past discussions for more info on that. So, okay, I’ve convinced you with SQL Server backups, go when you’re running with a quick assist accelerator in hardware and the right Hardware conditions.
But what if I told you that you can use Quick assist accelerator to do common networking functions and even Network cryptography, know time to haul out the pcie cards again, what separates a $ 1,500 network card from a $ 300 network card? Basically, how smart the network card is, how much work can be done on the network card, how much pressure it can take off of CPUs like our pcie accelerator? So let’s talk TLS, you know the lock icon in the Intel website. That’S TLS encryption, a busy web server based around a dual socket 8490 H just sets up and tears down 40,000 connections per second plus. The usage is entirely dominated by this encryption setup and tear down task our test system.
Here. This is based around an e810 Intel interface. This configuration, I can’t even saturate dual 100 Gig connections with our engine X, test application.
Is it set up inefficiently? No, it’s just that. We don’t have enough cores that are fast enough to saturate more than about 78 gbits per channel. The Intel e810 does support some offload, meaning the network card will help with some types of acceleration, but it can’t really help much with anything as high level as TLS encryption into uh.
Our web server use case here: 100 % CPU utilization and not being able to Peg 200 gbits of interface speed with two 60 core CPUs. It sounds like something is broken cuz. This is table Stakes as far as networking goes, but no unaccelerated TLS is computationally expensive.
Even though it’s ubiquitous the network set up and tear down, I mean it’s going to take a lot of compute to do that now, granted most of the time when you’re running a web application like this you’re, not setting up and tearing down each connection the clients To the system will hold the connection open, so they don’t have to set it up and tear it down a lot, but if you’re running a service or something like that or something stateless yeah, it could actually do this now, hang on before we start talking about Tls and Border Gateway appliances and that sort of stuff that could handle this offload of TLS. Let’S talk about qat now how I did this I’ll talk about in a second, but we we had 128 cores pegged before, and our bottleneck was on the CPU side. Well now, let’s rerun using our qat accelerator, we’re only using 20 cores to set up and tear down our TLS connections. So now I’ve got about 100 cores left over to run my application. This represents about a 5x acceleration and it’s also moved the bottleneck to the networking stack or the networking interface, where I want it now, as I was putting this together, trying to figure out the easiest way to show this.
It’S like. Oh, let’s go to the terminal and let’s run things in the terminal. That’S not super amazing. The thing that you want to look for is the Intel device plugins for stuff.
I saw that at Innovation for Red Hat and it’s part of the system. It’S it’s part of the ecosystem for Ubuntu, but you can also run it with Docker and kubernetes. In fact, the Intel device plugins for kubernetes was updated.
Just last week I mean it’s been updated continuously, but there was a couple features. I was waiting on and the device plugins actually makes it easy to use, not just qat but also Intel, fpgas uh, even just their gpus, but also DSA and everything else. You set up your host, you install the drivers on your host and then you turn kubernetes loose with it or in our case Docker. If you need a cheat sheet for Docker or you just look for Docker resources, there are a lot of Docker containers on Docker hub from Intel, some of them date back 3. Four 5 years, some of them haven’t been updated in 3 years, which is not super. Encouraging but you can find commercial projects they’re using qat that don’t don’t have anything to do with Intel, but you can take a peek inside their Docker file and understand. What’S going on open visual cloud is one of them.
You can take a look at their Docker file and see how they’re setting up the qat accelerator on their. You know, debbi and host container or the container on which their thing is based and get some Clues, because that’s basically what I did to set this up. So my Apples to Apples is comparison was basically just setting up engine X, the normal way using the reference engine X container and then using this container or modified version of this container that loads the qat drivers. So I have to set up qat on the host and then run through and do all of that and then for The Benchmark is just how fast I can set up and tear down TLS connections now, admittedly real world, unless you’re running a micros service. The way that a web server is going to run is not setting up and tearing down TLS connections as fast as possible.
Even if you have millions of people hitting the web server uh they’re, not establishing new connections. Every time they connect to the server, they are reusing connections that are already open, which are considerably less CPU overhead. So, depending on what your application is, you may see a 25 or 30 % performance bump or if you really are running micr Services, it is like a 90 % plus bump that we’re. Seeing here I mean you, you can go from you’re using most of your 100 plus cores for setting up and tearing down TLS connections to you know using 10 or 15 cores enable in order to do that. So so, if you want to replicate the result or more importantly, experiment with qat, if you’ve got the hardware, then check out the level one Forum thread to help you get this set up.
You can run this on a single machine. You don’t have to have a kubernetes cluster, although the device plugins for kubernetes really super cool, but running this this on a single machine to experiment with it and see for yourself the acceleration difference you can get from qat, pretty darn impressive. This is what I’m looking at when I’m absolutely floored by these quick assist capabilities. If you think it’s going to be hard to modify your application to do what I did to show you this.
No, I I did this with Docker. I swapped one bone stock engine X container for another engine X container that Intel Engineers have lovingly hand tuned for qat, and you know in a two socket system I could have two qat devices per socket for a total of four. It works pretty well if, for some reason you can’t use Docker or a containerization system, Intel’s documentation on GitHub is topnotch. You can check out their project and put it together for yourself. So I ask you: does it make economic sense to just keep throwing cores at these kinds of problems? Even Intel competitors say no, but your options there are more expensive networking, infrastructure or other accelerators or even Network appliances where, when the internet traffic is coming into your data center, that’s the thing that does TLS and encryption F5 Networks, Cisco. They all offer appliances to do that, but in our infrastructure is code World.
Moving more of that stuff off of specialized appliances onto more generic appliances is advantageous, not to mention it costs less I mean Intel does want to charge more for CPUs that have these kinds of accelerators, but you can save $ 1,000 on a single network card, so It also remains to be seen if the market will decide whether type of acceleration belongs on the uncore part of a CPU or if it actually does belong on a pcie add-in card. But I’ll tell you putting this kind of thing on a pcie add-in card is a little bit more of an uphill battle. You still have a lot of data that you have to move around and transform it’s de facto it’s going to be more efficient. If you don’t have to move that data very far and on CPU is a lot closer than down a PCI connection, but the qat is so mature and so much more robust and so well executed.
I it makes me understand what Intel Engineers are talking about when they’re saying that accelerators as a competitive Advantage for Intel is a thing in 2023 I mean qat is here to stay and it’s their best example of an accelerator. Now I’ve only shown you qat for backup, compression and networking acceleration, and our networking example really is more around encryption than networking, but we are able to leverage it for networking and like quick, sync and gpus. Quick assist is far more flexible than what I’m showing you. It’S just that for this networking use case, you can get a 5x for free, basically uh, well freeish. The most I can fault Intel for is not including qat and far more CPUs.
It’S in a lot of the atom and Zeon D, CPUs zon D, CPUs being meant for networking and Edge compute appliances, so not super expensive, but for Sapphire Rapids, Big Boy, server, CPUs. It’S enabled in a little less than half of the 50 Sapphire Rapids skus that I looked at I’d like to see it in every zeeon server CPU, even the Zeon, W workstation CPU. So the developers can, you know, take it on for themselves.
There’S a lot of developers out there working with Docker and under other containerization Technologies on their workstation, and they want to be able to try it on their workstation, see the acceleration and then move it into production. The most important takeaway here is that this is a lot of compute that can really accelerate some workloads in a way that even adding dozens of CPU cores doesn’t really necessarily offset, and that’s a big part of Intel’s road map for the future. Yeah they’ve got cores, but they need to add other stuff to feed the cores and qat as an cator feeds.
The cores makes sense. I understand now. If you want to check and see if your existing xon system has qat cuz, it’s been there for a lot of generations. You might be able to use it if you’re already using Docker or Linux, for your host or anything else, be sure to check out the writeup on the level one forums I’m W, I’m signing out and I’ll see you there .