Recently my home server and VM host randomly started losing network connectivity. On the outside it seems that it was still working, but I was unable to access it in any remote way. The ethernet adapter seemed to be on, according to the switch, so the issue must have been somewhere in software.
It wouldn’t be the first time a driver would be the issue of hanging network connection. In the past I’ve been burned by buggy WiFi drivers on Linux and Windows computers.
After digging a bit into the system logs, I stumbled on the following:
vmhost kernel: e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang: vmhost kernel: TDH <0> vmhost kernel: TDT <1> vmhost kernel: next_to_use <1> vmhost kernel: next_to_clean <0> vmhost kernel: buffer_info[next_to_clean]: vmhost kernel: time_stamp <10fbc2f81> vmhost kernel: next_to_watch <0> vmhost kernel: jiffies <10fbc3871> vmhost kernel: next_to_watch.status <0> vmhost kernel: MAC Status <40080083> vmhost kernel: PHY Status <796d> vmhost kernel: PHY 1000BASE-T Status <7800> vmhost kernel: PHY Extended Status <3000> vmhost kernel: PCI Status <10> vmhost kernel: e1000e 0000:00:1f.6 eth0: Reset adapter unexpectedly vmhost kernel: vmbr0: port 1(eth0) entered disabled state
Then it would reset the network adapter and after a bit do it again, until the machine completely goes offline.
Looking for answers online I stumbled upon this ServerFault thread.
Cause
Reading upon different sources and bug reports list, it seems the best way to reproduce the issue is to have high-bandwidth situation on the device, i.e. streaming large amounts of data that would saturate the interface.
In my case it was usually happening when I’m streaming media from the local Plex server to a device. Due to the way the network is set up, the Windows VM that runs the Plex instance has to fetch the media file from a NFS network share on a separate device, transcode it in the VM and then send it to the playback device.
This adds up to a lot of network traffic, usually ~40-100mbit/s, depending on the device that plays the media file and the source media file quality.
The same issue manifested itself when streaming games via Steam Link to our Apple TV. The connection is wired, but it’s not uncommon for the network to drop. I think it’s correlated with the same issue, but will keep an eye for it to see if it will happen in the future after the fix.
Possible fix
Seems a possible fix would be to disable GSO (Generic Segmentation Offload), TSO (TCP Segmentation Offload) and GRO (Generic Receive Offload) on the network interface*:
ethtool -K eth0 gso off gro off tso off
I have applied this to my setup and I’m waiting to see if this will actually solve the issue in the long run.
Footnotes
* These options are related to offloading package segmentation to the network interface controller to reduce CPU usage on the machine. More details can be found in this Wikipedia article.