diff --git a/docs/source/tuning.rst b/docs/source/tuning.rst index d8a6fc6b2..4b58fb70d 100644 --- a/docs/source/tuning.rst +++ b/docs/source/tuning.rst @@ -94,8 +94,8 @@ The device driver also prints out some PCIe-related information when it attaches Note that ``lspci`` reports ``LnkSta: Speed 8GT/s (ok), Width x16 (ok)``, indicating that the link is running at the max supported speed and max supported link width. If one of those is reported as ``(degraded)``, then further investigation is required. If ``(ok)`` or ``(degraded)`` is not shown, then compare ``LnkSta`` with ``LnkCap`` to see if ``LnkSta`` reports lower values. In this case, ``lspci`` reports ``LnkCap: Port #0, Speed 8GT/s, Width x16``, which matches ``LnkSta``. It also reports ``MSI: Enable+ Count=32/32``, indicating that all 32 MSI channels are active. Some motherboards do not fully implement MSI and limit devices to a single channel. Eventually, Corundum will migrate to MSI-X to mitigate this issue, as well as support more interrupt channels. Also note that ``lspci`` reports ``MaxPayload 512 bytes``---this is the largest that I have seen so far (on AMD EPYC), most modern systems report 256 bytes. Obviously, the larger, the better in terms of PCIe overhead. -Any processes that access the NIC should be pinned to the NIC's NUMA node. The NUMA node is shown both in the ``lspci`` and driver output output (``NUMA node: 3``) and it can be read from sysfs (``/sys/class/net//device/numa_node``). Use ``numactl -N `` to run programs on that NUMA node. For example, ``numactl -N 3 iperf3 -s``. If you're testing with ``iperf``, you'll want to run both the client and the server on the correct NUMA node. +Non-uniform memory access (NUMA) is another potential pitfall to be aware of. Systems with multiple CPU sockets will generally have at least one NUMA node associated with each socket, and some CPUs, like AMD EPYC, have internal NUMA nodes even with a single CPU. For best performance, any processes that access the NIC should be pinned to the NIC's local NUMA node. If packets are stored in memory located on a different NUMA node, then there will be a performance penalty associated with the NIC accessing that memory via QPI, UPI, etc. Use ``numactl -s`` to get a list of all physical CPUs and NUMA nodes on the system. If only one node is listed, then no binding is required. If you're running a CPU with internal NUMA nodes such as AMD EPYC, make sure that BIOS is set up to expose the internal NUMA nodes. The NUMA node associated with the network interface is shown both in the ``lspci`` and driver output output (``NUMA node: 3``), and it can also be read from sysfs (``/sys/class/net//device/numa_node``). Use ``numactl -l -N `` to run programs on a specified NUMA node, for example, ``numactl -l -N 3 iperf3 -s``. Recent versions of ``numactl`` also support automatically determining the NUMA node from the network device name, so in this case ``numactl -l -N netdev:enp129s0 iperf3 -s`` would run ``iperf`` on the NUMA node that ``enp129s0`` is associated with. It's important to make sure that both the client and the server are run on the correct NUMA node, so it's probably a better idea to manually run ``iperf3 -s`` under ``numactl`` than to run ``iperf3`` as a system service that could potentially run on any NUMA node. On Intel CPUs, `PCM `_ can be used to monitor QPI/UPI traffic to confirm that processes are bound to the correct NUMA nodes. It's also advisable to go into BIOS setup and disable any power-management features to get the system into its highest-performance state. -Notes on the performance evaluation for the FCCM paper: the servers used are Dell R540 machines with dual Intel Xeon 6138 CPUs and all memory channels populated, and ``lspci`` reports ``MaxPayload 256 bytes``. The machines have two NUMA nodes, so only one CPU is used for performance evaluation to prevent traffic from traversing the UPI link (can use Intel PCM to confirm use of the correct NUMA node). On these machines, a single ``iperf`` process would run at 20-30 Gbps with 1500 byte MTU, or 40-50 Gbps with 9000 byte MTU. The Corundum design for those tests was configured with 8192 TX queues and 256 RX queues. +Notes on the performance evaluation for the FCCM paper: the servers used are Dell R540 machines with dual Intel Xeon 6138 CPUs and all memory channels populated, and ``lspci`` reports ``MaxPayload 256 bytes``. The machines have two NUMA nodes, so only one CPU is used for performance evaluation to prevent traffic from traversing the UPI link. On these machines, a single ``iperf`` process would run at 20-30 Gbps with 1500 byte MTU, or 40-50 Gbps with 9000 byte MTU. The Corundum design for those tests was configured with 8192 TX queues and 256 RX queues.