Modern network cards have the ability to take a long data buffer (bigger than the MTU of the link) and segment it in hardware, sticking a set of common headers on the front of each segment. This has a number of variants and a number of names: Large Send Offload, TCP Segmentation Offload, and a number of tunnel offloads. (My favourite one would have to be UFO - UDP Fragmentation Offload)
In Linux, GSO - Generic Segmentation Offload - provides a lot of software infrastructure for these various offloads. GSO gives you a few perks even if you don't have hardware support: it allows you to do most of the work with a large buffer, and only split it right before you hand it to hardware.
Now SCTP is a bit of a special case. Per 90017accff61 ("sctp: Add GSO support"):
As the commit message mentioned, normally the skb has its gso_size property set to the point at which it should be split. Now, here's where things get interesting. Because SCTP needs to signal that it's splitting somewhere else, it overrides the meaning of the gso_size property. If gso_size == GSO_BY_FRAGS then we're dealing with a skb that should be split on fragments, otherwise it should be split normally using the value of gso_size.
This is set up in 3953c46c3ac7 ("sk_buff: allow segmenting based on frag sizes"), and GSO_BY_FRAGS is set to 0xffff.
Now this requires that every user of gso_size checks for the GSO_BY_FRAGS case. Most do, usually by special handlers. However, I came across one that didn't today: the token bucket filter queuing discipline. In this case, it could cause massive performance regressions when using SCTP and the tbf qdisc. (For what it worth, I have no idea why you would want to do that, but it's still a bug.)
I proposed a fix, but it's likely there are other cases.
This is a good example of in-band signalling: there's both control signals (GSO_BY_FRAGS) and data signals in the one channel (the gso_size variable). It's a good example of the downsides: you need to check for both cases everywhere, but it's also a good example of the upside: no extra field was required in the data structure to support this case - and this is a big thing in the network stack.
In Linux, GSO - Generic Segmentation Offload - provides a lot of software infrastructure for these various offloads. GSO gives you a few perks even if you don't have hardware support: it allows you to do most of the work with a large buffer, and only split it right before you hand it to hardware.
Now SCTP is a bit of a special case. Per 90017accff61 ("sctp: Add GSO support"):
So, if SCTP wants to get the advantages of GSO, they need to do some magic to allow a buffer (skb) to be split at the right spots. To do this, they create fragments and do GSO on the fragments rather than by splitting a long linear buffer.SCTP has this pecualiarity [sic] that its packets cannot be just segmented to (P)MTU. Its chunks must be contained in IP segments, padding respected. So we can't just generate a big skb, set gso_size to the fragmentation point and deliver it to IP layer.
As the commit message mentioned, normally the skb has its gso_size property set to the point at which it should be split. Now, here's where things get interesting. Because SCTP needs to signal that it's splitting somewhere else, it overrides the meaning of the gso_size property. If gso_size == GSO_BY_FRAGS then we're dealing with a skb that should be split on fragments, otherwise it should be split normally using the value of gso_size.
This is set up in 3953c46c3ac7 ("sk_buff: allow segmenting based on frag sizes"), and GSO_BY_FRAGS is set to 0xffff.
Now this requires that every user of gso_size checks for the GSO_BY_FRAGS case. Most do, usually by special handlers. However, I came across one that didn't today: the token bucket filter queuing discipline. In this case, it could cause massive performance regressions when using SCTP and the tbf qdisc. (For what it worth, I have no idea why you would want to do that, but it's still a bug.)
I proposed a fix, but it's likely there are other cases.
This is a good example of in-band signalling: there's both control signals (GSO_BY_FRAGS) and data signals in the one channel (the gso_size variable). It's a good example of the downsides: you need to check for both cases everywhere, but it's also a good example of the upside: no extra field was required in the data structure to support this case - and this is a big thing in the network stack.
Comments
Post a Comment