On quadrupling throughput of our Quality of Service

by Michael Tremer, October 15, 2019, Updated August 15, 2020

Do you like what you are reading? Subscribe to our newsletter and don't miss out on the latest...   Join Now

There have been improvements to our Quality of Service (or QoS) which have made me very excited.

Our QoS sometimes was a bottleneck. Enabling it could cut your bandwidth in half if you were unlucky. That normally was not a problem for larger users of IPFire, because if you are running a 1 Gigabit/s connection, you would not need any QoS in the first place, or your hardware was fast enough to handle the extra load.

For the smaller users this was, however, becoming more and more of a problem. Smaller systems like the IPFire Mini Appliance are designed to be small (the clue is in the name) and to be very energy-efficient. And they are. They are popular with users with a standard DSL connection of up to 100 Megabit/s which is very common in Germany. You have nothing to worry about here. But if you are lucky to have a faster Internet connection, then this hardware and others that we have sold before might be running out of steam. There is only so much you can get out of them.

In the test case for this, a six year-old IPFire Eco Appliance with an Intel D-525 processor with two physical cores and Intel Hyper-Threading was connected to a 400 MBit/s Internet connection. Without QoS enabled that bandwidth was easily handled with a CPU load of around 20%. After QoS was enabled, the throughput collapsed to around 120 MBit/s at less than 50% CPU load. Clearly that drop seemed to be too high and there was still some CPU cycles left to use.

And that is where the search began...

Finding the bootleneck

IPFire is very tuned. We have spent so many development hours over the years to get everything optimised as much as possible. That is why there were not too many things left to review again.

The QoS is split into three parts: The first one is to classify packets when they arrive. We will look if any of the configured rules match and if so mark the packet to belong to a certain class. Packets are then being sent through the whole firewall ruleset and potentially allowed to pass or being dropped. After the firewall has made the decision to which interface the packet is being sent, it is being enqueue in a so-called qdisc until there is free bandwidth and it can be sent.

It turned out that there was space for improvement in all those steps.

iptables is now able to decide into which class to sort a packet in one go. Before this used to be a two step process where the packet was being marked first and then later on move into the correct class. Since this can now be done in one step, this is bringing us a little advantage, which unfortunately is not noticeable (less than a percent increase), but it makes our code cleaner because we had some problems with interference of the Intrusion Prevention System and NAT which need to mark packets, too.

ToS

The second improvement is to drop support to changing the ToS (Type of Service) bits. Those would mark if a packet should be forwarded with lowest latency, cheapest route and some others more. Those are however not evaluated by any ISP and it is just a waste of CPU cycles to mark those packets as such. The whole checksum of the IP packet needs to be recalculated which adds a lot of overhead for a useless operation.

Applications like VoIP phones and Skype set these bits on the client, so rules can still be used to find this traffic when it is coming from the local network and put it into the right class. That is very handy for crystal-clear VoIP calls!

Breakthrough!

That again improved throughput only very little. The biggest break-through was the realisation that one processor core was more busy with waiting for other cores to handle a packet than processing packets itself. It wasted time on waiting, which is not great when you do not have the fastest processor anyways.

To process incoming packets - which only ever was the bottleneck - we send them through a device called Intermediate Queueing Interface (IMQ). That is a small driver which has never made it into the Linux kernel and was patched into the IPFire kernel by us. It always worked flawlessly, but evidently was not good enough to be merged into the mainline kernel. After reading this post, you know why.

More advanced appliances use multiple queues when they receive packets from the network. Cheaper network adapters do not support this, but the better ones do. That way, each processor core can handle packets from their own queue and that means you do not only use one processor, but all of them at the same time.

Now, TCP connections have packets in an order. Therefore there needs to be synchronisation between the processor cores to align data in the right order to run the layer 7 filter over it. For that, processor cores need to talk to each other and that is normally a very expensive operation. It is extremely slow on Intel Atom processors, but rather fast on Intel Xeon processors. I do not need to get into detail on how that relates to power consumption.

Mostly the QoS does not really care about TCP. It works on layer 3 which is only IP. IP connections are stateless and therefore less synchronisation would allow us to process more packets.

The IMQ driver has 38 mentions of the word "lock" in its source code. It uses them excessively and has always been criticised for its poor handling of SMP systems. The implementation that is coming with the Linux kernel - and is working slightly differently - only has two mentions, hence, uses a lot less locking.

I will no longer bore you with the technical detail. This post is probably long enough already. I am bad with keeping things short. If you are really keen have a look at the patchset.

Let's fast-forward to the results

The same appliance is now forwarding packets with a whopping 390 MBit/s and has CPU utilisation of around 40-50%. CPU usage therefore has not gone down, but throughput has reached the maximum it could possibly be. Goal achieved!

Naturally IPFire is able to shift more than 400 MBit/s of traffic now, but this was the real-world test. In a lab I am being able to saturate a 10G link with almost no noticeable change in CPU utilisation on an IPFire Enterprise Appliance which is brilliant, too.

Thank you for your help - please donate!

All of this was done over two intense afternoons. Sitting down with a cup of tea and analyse something until you find the root cause of it. Garnishing it all with a nice improvements to the web UI and there you are: IPFire has massively improved its performance on small systems and even on larger ones.

All of this will come with IPFire 2.23 - Core Update 137 which will be released in a couple of weeks.

This would not have been possible if I didn't have access to the hardware and Internet connection. Thank you very much for your support - you know who you are. If you want to support our project, too and want to help me paying for my tea, please donate. It is very important to be able to work on things like this!