// Intel 82574L NICs: network hangs / ASPM Bug / e1000 driver

A few days ago, I ran into an ugly bug on different Scientific Linux 6.3 hosts (therefore this should also affect RHEL 6.3 and CentOS 6.3). The network hangs while the system itself is up, running and responsive. “Just” no network. Restarting the affected network interfaces is not enough, only a complete reboot brings the Intel 82574L-based network cards back to life (those NICs are onBoard on the Supermicro X9SCM-F and X8SIL mainboards of the affected hosts, so I can't simply change them). The logs showed entries like the following:

[...]
Jan 24 09:52:35 host2 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Jan 24 09:52:35 host2 kernel: Hardware name: X9SCL/X9SCM
Jan 24 09:52:35 host2 kernel: NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out
Jan 24 09:52:35 host2 kernel: Modules linked in: fuse autofs4 sunrpc vboxpci(U) vboxnetadp(U) vboxnetflt(U) vboxdrv(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ext3 jbd uinput raid1 sg microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000e ext4 mbcache jbd2 fpu aesni_intel cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt raid10 sd_mod crc_t10dif ahci video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jan 24 09:52:35 host2 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.19.1.el6.x86_64 #1
Jan 24 09:52:35 host2 kernel: Call Trace:
Jan 24 09:52:35 host2 kernel: <IRQ>  [<ffffffff8106a1e7>] ? warn_slowpath_common+0x87/0xc0
Jan 24 09:52:35 host2 kernel: [<ffffffff8101c0fa>] ? intel_pmu_enable_all+0xba/0x160
Jan 24 09:52:35 host2 kernel: [<ffffffff8106a2d6>] ? warn_slowpath_fmt+0x46/0x50
Jan 24 09:52:35 host2 kernel: [<ffffffff8144792d>] ? dev_watchdog+0x26d/0x280
Jan 24 09:52:35 host2 kernel: [<ffffffff814476c0>] ? dev_watchdog+0x0/0x280
Jan 24 09:52:35 host2 kernel: [<ffffffff8107d2c7>] ? run_timer_softirq+0x197/0x340
Jan 24 09:52:35 host2 kernel: [<ffffffff810a0910>] ? tick_sched_timer+0x0/0xc0
Jan 24 09:52:35 host2 kernel: [<ffffffff8102adad>] ? lapic_next_event+0x1d/0x30
Jan 24 09:52:35 host2 kernel: [<ffffffff81072991>] ? __do_softirq+0xc1/0x1e0
Jan 24 09:52:35 host2 kernel: [<ffffffff81095510>] ? hrtimer_interrupt+0x140/0x250
Jan 24 09:52:35 host2 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Jan 24 09:52:35 host2 kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Jan 24 09:52:35 host2 kernel: [<ffffffff81072775>] ? irq_exit+0x85/0x90
Jan 24 09:52:35 host2 kernel: [<ffffffff814f1fa0>] ? smp_apic_timer_interrupt+0x70/0x9b
Jan 24 09:52:35 host2 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
Jan 24 09:52:35 host2 kernel: <EOI>  [<ffffffff812ec17e>] ? acpi_idle_enter_c1+0xa3/0xc1
Jan 24 09:52:35 host2 kernel: [<ffffffff812ec15d>] ? acpi_idle_enter_c1+0x82/0xc1
Jan 24 09:52:35 host2 kernel: [<ffffffff813f6c67>] ? cpuidle_idle_call+0xa7/0x140
Jan 24 09:52:35 host2 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Jan 24 09:52:35 host2 kernel: [<ffffffff814d109a>] ? rest_init+0x7a/0x80
Jan 24 09:52:35 host2 kernel: [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
Jan 24 09:52:35 host2 kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
Jan 24 09:52:35 host2 kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
Jan 24 09:52:35 host2 kernel: ---[ end trace 1f3cc9d5dfc619c0 ]---
Jan 24 09:52:35 host2 kernel: e1000e 0000:02:00.0: eth1: Reset adapter
[...]

After some googleing, I found a useful Bug-Report and a mailing list thread. Especially three postings are quite informative:

It seems that the ASPM of the Intel 82574L is broken. The corresponding Linux driver “e1000” therefore has this chip on its ASPM blacklists and disables it when the systems boots. However, there is some side effect which re-enabled the NIC'S ASPM state L1 after a network connection was established. This does not happen on all Linux flavors and kernel versions, but it happens at least on Scientific 6.3 with kernel 2.6.32-279.19.1.

Workaround: disable the NIC's ASPM after the system boots

A quick workaround is to manually disable the NIC'S ASPM after the system booted and the network “stabilized” (e.g. after a few minutes). The following command disables ASPM for a device:

setpci -s <ID-of-device> CAP_EXP+10.b=40

You can use lspci -vnn to get the device ID (first number of the line, 02:00.0 in the following example output):

[root@host2 ~]# lspci -vnn | grep '82574'
02:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]

Example: I used /etc/rc.local to disable ASPM on the device with ID 02:00.0, five minutes after the system boots by putting the following lines at the end of the file:

# workaround for Intel 82574L bug, see http://bit.ly/1565w6I for details
printf '%s\n' 'setpci -s 02:00.0 CAP_EXP+10.b=40' | at now + 5min

Use lspci -vvvv -s <ID-of-device> if you want to check if ASPM is really disabled (look for “LnkCtl: ASPM Disabled”):

[root@host2 ~]# lspci -vvvv -s 02:00.0
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
[...]
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
[...]

I hope this helps someone else in some way. :-)

Comments

dtonhofer
No. 1 @ 2013/02/12 14:27

82574L!! One of my bugbears.

May be fixed soon:

Dean Nelson says on 2012-12-13 in https://bugzilla.redhat.com/show_bug.cgi?id=697036

In the mean time, I've created a few RHEL6.3 kernel rpms with the two
commits applied. And you can find them at the following URL
http://people.redhat.com/dnelson/.bbaf48b55803a7eb34e330530d/
No. 2 @ 2013/02/12 17:41

This really sounds like the very same issue as my friend Kris reported at http://blog.krisk.org/2013/02/packets-of-death.html

Kris also has an update (including details from Intel) here http://blog.krisk.org/2013/02/packets-of-death-update.html

I hope that helps!

-Jared

No. 3 @ 2013/02/12 19:54

@Jared Smith: Thanks for the information but no, this is another problem.

The not-working ASPM-blacklist of the e1000 driver seems to be fixed in version 2.1.4 (and RHEL 6.4 should ship this version of e1000). It seems that Ubuntu backported the fix to e1000 v1.4.4, so the problem appears to be only on Red Hat platforms.

Also interesting: Packet drop issues may occur in some 82574 and 82583-based adapters, EEPROM fix

Guest
No. 4 @ 2013/02/18 22:12

Hey guys, don't confuse the e1000 with the e1000e driver!

Intel 82574L uses the e1000e driver.

Jeroen
No. 5 @ 2013/06/12 14:51

I have the same problem with: 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet controller (rev 09)

Curiously ASPM is disabled already, but the same problem occurs with almost the exact same backtrace, same queue timeout. This is on Ubuntu 13.04's 3.8 kernel, by the way.

Once I start transferring large files (i.e. 1GB+ in a few seconds), it'll consistenly lock up the network card after 30 seconds or so.

The link does on occasion reestablish itself after 10 or so seconds, but it'll be interminably slow from thereon out. Only a reboot will fix it, rmmod+modprobe might do the trick the first time it happens.

It's not limited to i82574L or that kernel series, that's for sure. :( There's a 2-line patch for the r8169 driver, sadly not yet in mainline:

https://patchwork.kernel.org/patch/2403501/

(And yes, it says it's about an AMD Vi pagefault. That happens to be the error you'll see preceding the net queue timeout in this particular case. But same slowpath backtrace, same everything else).

No. 6 @ 2013/06/27 20:24

Thanks for sharing this knowledge, I would like to add if you update NIC driver from intel using EPEL repo you obtain the same ASM feature disabled

Regards

No. 7 @ 2013/08/26 19:13

Hi Andreas,

Thanks for posting this. I had recently switched from using ESXi to virtualize my home/SOHO servers, and switched to KVM on CentOS 6.4. After about a week of use, I had horrible packet loss problems for an entire weekend. Tried several of the “e1000e” / “82573L” “Fixes” only to have the problem come back after 5-10 minutes or heavy traffic through the adapter.

Your fix with “setpci” has worked like a champ for me so far. If this holds up, looks like I might be buying you a coffee!

Thanks again, Merci beaucoup!

Mich165
No. 8 @ 2013/10/26 02:07

Your Super fix WORKS!! THANK YOU!!! had big problems with this finally its working!!! Wohoo

Youre the King!

Thanks again!! Mich

Andy Linton
No. 9 @ 2014/04/14 08:27

I created a script to do this (/usr/local/sbin/e1000e-fix). It gets called from /etc/rc.local at the end of the boot sequence.

#!/bin/sh

# workaround for Intel 82574L bug, see http://bit.ly/1565w6I for details

#
# Wait a while after boot - not sure why we want this
# asjl 14 April 2014
#
#sleep 120

PROGNAME=`basename $0`
#
# Find each Intel 82574L-based network card
#
for DEVICE in `lspci -vnn | grep '82574' | cut -d' ' -f 1`
do
        #
        # log the ASPM state before
        #
        lspci -vvvv -s ${DEVICE} | grep LnkCtl | tr '\t' ' ' | logger -t "${PROGNAME}: ${DEVICE}"
        #
        #disable ASPM for a device
        #
        setpci -s ${DEVICE} CAP_EXP+10.b=40
        #
        # log the ASPM state after
        #
        lspci -vvvv -s ${DEVICE} | grep LnkCtl | tr '\t' ' ' | logger -t "${PROGNAME}: ${DEVICE}"
done

The messages I see at boot time look like:

Apr 14 16:16:26 berwick e1000e-fix: 01:00.0:   LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
Apr 14 16:16:26 berwick e1000e-fix: 01:00.0:   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
Apr 14 16:16:27 berwick e1000e-fix: 04:00.0:   LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
Apr 14 16:16:27 berwick e1000e-fix: 04:00.0:   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
No. 10 @ 2014/05/08 09:30

Many thanks! Coffee is on it's way :-)

Leave a comment…




M A D J T
  • E-Mail address will not be published.
  • Formatting:
    //italic//  __underlined__
    **bold**  ''preformatted''
  • Links:
    [[http://example.com]]
    [[http://example.com|Link Text]]
  • Quotation:
    > This is a quote. Don't forget the space in front of the text: "> "
  • Code:
    <code>This is unspecific source code</code>
    <code [lang]>This is specifc [lang] code</code>
    <code php><?php echo 'example'; ?></code>
    Available: html, css, javascript, bash, cpp, …
  • Lists:
    Indent your text by two spaces and use a * for
    each unordered list item or a - for ordered ones.
I'm no native speaker (English)
Please let me know if you find any errors (I want to improve my English skills). Thank you!
QR Code: URL of current page
QR Code: URL of current page 2013:02:11:intel-82574l-network-nic-aspm-bug-e1000-linux-rhel-centos-sl-6.3 (generated for current page)