Sendmsg + raw ethernet + multiple frames

I use linux 3.x

and modern glibc(2.19)

.

I would like to send multiple Ethernet frames without switching back and forth from kernel / user space.

I have MTU = 1500

, and I want to send 800 KB

. I am calling the receiver address like this:

struct sockaddr_ll socket_address;
socket_address.sll_ifindex = if_idx.ifr_ifindex;
socket_address.sll_halen = ETH_ALEN;
socket_address.sll_addr[0] = MY_DEST_MAC0;
//...

      

After that I can call sendto/sendmsg 800KB / 1500 ~= 500

once and everything works fine, but it requires user space <->

kernel reconciliation ~ 500 * 25

once per second. I want to avoid this.

I am trying init struct msghdr::msg_iov

with relevant information, but the error message is too long, looks like msghdr::msg_iov

it cannot describe something with size > MTU

.

So, is it possible to send a lot of raw Ethernet frame to Linux from user space at once?

PS

Data (800KB)

I am getting from a file and reading it into memory. So struct iovec

good for me, I can create a suitable number of Ethernet headers and have iovec per 1500 packet

, one point for data, one point in the Ethernet header.

+3


source to share


1 answer


Wow.

My last company built real-time video encoding equipment. In the lab, we had to blast 200MB / second from the linked link, so I have some experience with this. This is followed by the following.

Before you can tune, you must measure. You don't want to make multiple system calls, but can you prove by measuring time that the significant overhead is significant?

I am using a wrapper procedure clock_gettime

that returns time back with nanosecond precision (for example (tv_sec * 100000000) + tv_nsec

). Call it [here] "nanotime".

So any given syscall requires a measurement:

tstart = nanotime();
syscall();
tdif = nanotime() - tstart;

      

To send/sendto/sendmsg/write

do this for small data, to make sure you are not blocking [or using O_NONBLOCK if applicable]. This gives you the syscall overhead

Why are you going directly to Raster Ethernet? TCP [or UDP] is usually fast enough and modern NICs can use envelope wrapping or stripe on hardware. I would like to know if there is some particular situation that requires ethernet frameworks, or if it would be that you were not getting the performance you wanted and came up with this as a solution. Remember you are doing 800KB / s (~ 1MB / s) and my project was doing 100x-200x more than TCP.

How do I use two simple calls write

to a socket? One for the header, one for the data [all 800KB]. write

can be used on a socket and has no error or limitation EMSGSIZE

.

Also, why would you need a header in a separate buffer? When you allocate your buffer, just do:

datamax = 800 * 1024;  // or whatever
buflen = sizeof(struct header) + datamax;
buf = malloc(buflen);

while (1) {
    datalen = read(fdfile,&buf[sizeof(struct header)],datamax);
    // fill in header ...
    write(fdsock,buf,sizeof(struct header) + datalen);
}

      

This even works for the ethernet frame case.

One of the things can also be to use setsockopt

to increase the size of the kernel buffer for your socket. Otherwise, you can send data, but it will be deleted in the kernel before the receiver can drain it. More on this below.

To measure the performance of a wire, add several fields to the header:

u64 send_departure_time;  // set by sender from nanotime
u64 recv_arrival_time;  // set by receiver when the packet arrives

      

So the sender sets the departure time and writes [just make a header for this test]. Call this package Xs

. the receiver prints it out when it arrives. the receiver immediately sends a message to the sender [name it Xr

] with an exit stamp, which is the content Xs

. When the sender receives this, he prints it with the arrival time.

With the above, we now have:

T1 -- time packet Xs departed sender
T2 -- time packet Xs arrived at receiver
T3 -- time packet Xr departed receiver
T4 -- time packet Xr arrived at sender

      

Assuming you are doing this on a relatively quiet connection with almost no other traffic, and you know the connection speed (e.g. 1 Gbps), with T1 / T2 / T3 / T4 you can calculate the overhead.

You can repeat the measurement for TCP / UDP vs ETH. You may find that he is not buying you as much as you think. Once again, can you prove it with an accurate measurement?

I "invented" this algorithm while working at the aforementioned company, only to find out that it is already part of the video standard for sending raw video over a 100 Gigabit NIC, and the network adapter is timing to the hardware.

One of the other things you might want is to add throttle control. This is similar to what bittorrent does or what the PCIe bus does.

When the PCIe bus nodes are first started, they report how much free buffer space is available for blind write. That is, the sender can explode so much without any ACK message. As the receiver runs out of its input buffer, it sends periodic ACKs to the sender with the number of bytes it could have merged. the sender can add this value back to the blind write limit and keep moving.

For your purposes, the blind write limit is the target kernel socket buffer size.

UPDATE

Based on some additional information from your comments [the actual system configuration should go in more complete form as an edit to your question at the bottom].

You have a need for a raw socket and an Ethernet raster post. You can reduce the overhead by setting more MTUs through ifconfig

or calling ioctl

with SIOCSIFMTU

. I recommend ioctl

. You might not need to set MTU to 800KB. Your network processor board has a practical limit. You can probably increase the MTU from 1500 to 15000 quite easily. This will reduce the syscall overhead by 10x, which might be "good enough".



You may have to use sendto/sendmsg

. Two calls write

were based on converting to TCP / UDP. But, I suspect that sendmsg

s msg_iov

will have more overhead than sendto

. If you search, you'll find that in most of the code examples you want to use sendto

. sendmsg

seems like unnecessary overhead to you, but it can cause additional kernel overhead. Here's an example that uses sendto

: http://hacked10bits.blogspot.com/2011/12/sending-raw-ethernet-frames-in-6-easy.html

In addition to improving syscall overhead, a larger MTU can improve "wiring" efficiency, although this does not seem like a problem in your use case. I have experience with CPU + FPGA systems and the communication between them, but I'm still puzzled by one of your comments about "not using a wire". The FPGA connected to the CPU Ethernet pins I get is sort of. More precisely, do you mean the FPGA pins connected to the network pins of the NIC card / CPU chip ??

Are the CPU / NIC on the same board and the FPGA pins connected across the board traces? Otherwise, I don't understand "not using a wire".

However, again, I must say that you should be able to measure your performance before blindly trying to improve it. ...

Did I run the test case I suggested to determine the system call overhead? If it's small enough, it may not be worth it to try to optimize it, and it can seriously hurt performance in other areas that you didn't understand when you started.

As an example, I once worked on a system that had a serious performance problem so that the system would not work. I suspected that the serial port driver was slow, so I recoded from a high level language (like C for example) to assembler.

I increased the driver's performance by 2x, but it improved the system efficiency by 5%. It turned out that the real problem was that other code was accessing non-existent memory that had just caused the bus timeout, slowing down the system noticeably [it didn’t generate interrupts that would make it easier to find like on modern systems).

This is when I learned the importance of measurement. I made my optimization based on educated assumption, not hard data. After that: a lesson learned!

Currently, I never try a big optimization until I can measure it. In some cases, I add optimizations that I am sure will improve the situation (for example, add a function). When I measure it [and because I can measure it], I find out that the new code is actually slower and I have to revert the change. But, what's the point: I can prove / disprove this with hard performance data.

What processor are you using: x86, arm, mips, etc. At what clock speed? How much DRAM? How many cores?

What FPGAs are you using (eg Xilinx, Altera)? What is the specific type / part number? What is the maximum clock speed? Is the FPGA entirely dedicated to logic or do you also have a CPU inside it like microblaze, nios, arm? Does the FPGA have access to its own DRAM [and how much DRAM]?

If you increase the MTU, can the FPGA handle it, in terms of buffer / space or clock point ??? If you increase the MTU, you may need to add the ack / sync protocol as I suggested in the original post.

The processor is currently doing a blind write of data, hoping the FPGA can handle it. This means you have an open race condition between the CPU and FPGA.

This can be mitigated as a side effect of sending small packets. If you increase the MTU too much, you can suppress the FPGA. In other words, it is a very overhead that you are trying to optimize, allowing the FPGA to keep up with the data rate.

This is what I meant by the unintended consequences of blind optimization. It can have unintended and worse side effects.

What is the nature of the data being sent to the FPGA? You are sending 800KB, but how often?

I am assuming this is not the FPGA firmware itself for several reasons. You said the firmware is almost full now [and receiving ethernet data]. In addition, the firmware is usually loaded via I2C bus, ROM or FPGA programmer. So am I right?

You are sending data to FPGA from a file. This means it is only sent once when your CPU application starts up. It's right? If so, no optimization is required because it is startup / startup costs that have little impact on system performance.

So, I have to assume that the file is being downloaded many times, possibly a different file each time. It's right? If so, you may need to consider using syscall read

. Not only from overhead costs, but also from the optimal length. For example, IIRC, the optimal transfer size for copying / transferring to disk-to-disk or file-to-file is 64KB, depending on the characteristics of the file system or underlying disk.

So, if you want to reduce the overhead, reading data from a file can be significantly more expensive than a data-generating application [if possible].

The kernel syscall interface is designed for very low overhead. Nuclear programmers [I happen to be alone] spend a lot of time ensuring that the overhead is low.

You say that your system uses a lot of CPU time for other purposes. Can you measure other things? How is your application structured? How many processes? How many threads? How do they communicate? What is latency / bandwidth? You may be able to find [perhaps find] the bottlenecks and recode them, and you will experience an overall reduction in CPU usage that far exceeds the maximum benefit you will get from MTU tuning.

Trying to optimize the system call overhead might be like my serial port optimization. Much effort and yet the overall results were / were disappointing.

When considering performance, it is important to consider this from the point of view of the overall system. In your case, that means CPU, FPGA, and whatever else in it.

You say the processor does a lot of things. Could / should some of these algorithms go into FPGA? The reason is that they are not due to the fact that FPGAs are almost out of space, otherwise would you? Is the FPGA firmware 100% complete? Or, is there more RTL? If you are using 90% of the space utilization in an FPGA and you need more RTL, you may want to move to the part of the FPGA that has more room for logic, possibly with a higher clock speed.

At my video company, we used FPGA. We used the largest / fastest modern part the FPGA vendor had. We also used almost 100% of the logic space and required the maximum clock speed. We told the vendor that we are the largest consumer of FPGA resources from any of our client companies worldwide. Because of this, we strained vendor development tools. Location and route often fail and need to be relaunched to get proper placement and time.

So, when the FPGA is almost full of logic, the location and route can be difficult to reach. This may be a reason to consider most [if possible]

0


source







All Articles