// Intel 82574L NICs: network hangs / ASPM Bug / e1000 driver

A few days ago, I ran into an ugly bug on different Scientific Linux 6.3 hosts (therefore this should also affect RHEL 6.3 and CentOS 6.3). The network hangs while the system itself is up, running and responsive. “Just” no network. Restarting the affected network interfaces is not enough, only a complete reboot brings the Intel 82574L-based network cards back to life (those NICs are onBoard on the Supermicro X9SCM-F and X8SIL mainboards of the affected hosts, so I can't simply change them). The logs showed entries like the following:

[...]
Jan 24 09:52:35 host2 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Jan 24 09:52:35 host2 kernel: Hardware name: X9SCL/X9SCM
Jan 24 09:52:35 host2 kernel: NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out
Jan 24 09:52:35 host2 kernel: Modules linked in: fuse autofs4 sunrpc vboxpci(U) vboxnetadp(U) vboxnetflt(U) vboxdrv(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ext3 jbd uinput raid1 sg microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000e ext4 mbcache jbd2 fpu aesni_intel cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt raid10 sd_mod crc_t10dif ahci video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jan 24 09:52:35 host2 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.19.1.el6.x86_64 #1
Jan 24 09:52:35 host2 kernel: Call Trace:
Jan 24 09:52:35 host2 kernel: <IRQ>  [<ffffffff8106a1e7>] ? warn_slowpath_common+0x87/0xc0
Jan 24 09:52:35 host2 kernel: [<ffffffff8101c0fa>] ? intel_pmu_enable_all+0xba/0x160
Jan 24 09:52:35 host2 kernel: [<ffffffff8106a2d6>] ? warn_slowpath_fmt+0x46/0x50
Jan 24 09:52:35 host2 kernel: [<ffffffff8144792d>] ? dev_watchdog+0x26d/0x280
Jan 24 09:52:35 host2 kernel: [<ffffffff814476c0>] ? dev_watchdog+0x0/0x280
Jan 24 09:52:35 host2 kernel: [<ffffffff8107d2c7>] ? run_timer_softirq+0x197/0x340
Jan 24 09:52:35 host2 kernel: [<ffffffff810a0910>] ? tick_sched_timer+0x0/0xc0
Jan 24 09:52:35 host2 kernel: [<ffffffff8102adad>] ? lapic_next_event+0x1d/0x30
Jan 24 09:52:35 host2 kernel: [<ffffffff81072991>] ? __do_softirq+0xc1/0x1e0
Jan 24 09:52:35 host2 kernel: [<ffffffff81095510>] ? hrtimer_interrupt+0x140/0x250
Jan 24 09:52:35 host2 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Jan 24 09:52:35 host2 kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Jan 24 09:52:35 host2 kernel: [<ffffffff81072775>] ? irq_exit+0x85/0x90
Jan 24 09:52:35 host2 kernel: [<ffffffff814f1fa0>] ? smp_apic_timer_interrupt+0x70/0x9b
Jan 24 09:52:35 host2 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
Jan 24 09:52:35 host2 kernel: <EOI>  [<ffffffff812ec17e>] ? acpi_idle_enter_c1+0xa3/0xc1
Jan 24 09:52:35 host2 kernel: [<ffffffff812ec15d>] ? acpi_idle_enter_c1+0x82/0xc1
Jan 24 09:52:35 host2 kernel: [<ffffffff813f6c67>] ? cpuidle_idle_call+0xa7/0x140
Jan 24 09:52:35 host2 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Jan 24 09:52:35 host2 kernel: [<ffffffff814d109a>] ? rest_init+0x7a/0x80
Jan 24 09:52:35 host2 kernel: [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
Jan 24 09:52:35 host2 kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
Jan 24 09:52:35 host2 kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
Jan 24 09:52:35 host2 kernel: ---[ end trace 1f3cc9d5dfc619c0 ]---
Jan 24 09:52:35 host2 kernel: e1000e 0000:02:00.0: eth1: Reset adapter
[...]

After some googleing, I found a useful Bug-Report and a mailing list thread. Especially three postings are quite informative:

It seems that the ASPM of the Intel 82574L is broken. The corresponding Linux driver “e1000” therefore has this chip on its ASPM blacklists and disables it when the systems boots. However, there is some side effect which re-enabled the NIC'S ASPM state L1 after a network connection was established. This does not happen on all Linux flavors and kernel versions, but it happens at least on Scientific 6.3 with kernel 2.6.32-279.19.1.

Workaround: disable the NIC's ASPM after the system boots

A quick workaround is to manually disable the NIC'S ASPM after the system booted and the network “stabilized” (e.g. after a few minutes). The following command disables ASPM for a device:

setpci -s <ID-of-device> CAP_EXP+10.b=40

You can use lspci -vnn to get the device ID (first number of the line, 02:00.0 in the following example output):

[root@host2 ~]# lspci -vnn | grep '82574'
02:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]

Example: I used /etc/rc.local to disable ASPM on the device with ID 02:00.0, five minutes after the system boots by putting the following lines at the end of the file:

# workaround for Intel 82574L bug, see http://bit.ly/1565w6I for details
printf '%s\n' 'setpci -s 02:00.0 CAP_EXP+10.b=40' | at now + 5min

Use lspci -vvvv -s <ID-of-device> if you want to check if ASPM is really disabled (look for “LnkCtl: ASPM Disabled”):

[root@host2 ~]# lspci -vvvv -s 02:00.0
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
[...]
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
[...]

I hope this helps someone else in some way. :-)

// Microsoft Internet Explorer 8: Fix the weird textarea scrolling bug

MSIE 8 brings some new, weird bugs. One of the most annoying one targets <textarea>s with a percentage value for the width CSS property (ironically in standards mode only, the problem disappears in IE7 compatibility view or quirks mode). When a textarea has got enough content to offer scrollbars and the user already scrolled a little bit, every keystroke will because a disturbing auto-scroll effect (even with flickering scrollbars under certain conditions like onkeyup/down events or border CSS property values). This is a real show-stopper which makes editing longer texts very uncomfortable for IE8 users.

The following file shows the problem. Simply download and open it with IE8.1)

msie8-textarea-bug.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>MSIE8 textarea scroll bug/fail demo</title>
</head>
<script type="text/javascript">/*<![CDATA[*/
function checkMode() {
    var standard = false;
    if (document.compatMode) {
        if (document.compatMode === "BackCompat") {
            m = "Quirks";
        } else if (document.compatMode === "CSS1Compat") {
            m = "Standards Compliance";
            standard = true;
        } else {
            m = "Almost Standards Compliance";
        }
        if (standard === false) {
           alert("ATTENTION: The document is being rendered in "+m+" Mode - there will be NO bug!");
        }
        //alert("The document is being rendered in "+m+" Mode.");
    }
}
checkMode();
/*]]>*/</script>
<style type="text/css">
textarea {
    height: 180px;
    width: 50%; /* this will trigger the bug */
}
</style>
<body>
  <form name="foobar" method="post" action="">
    <textarea>
Please scroll down a little bit and try to edit text. E.g.
change the line by clicking into the text, type "asdf",
go to another line, type "asdf" and so on.
 
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua.
    </textarea>
</body>
</html>

I don't know how to get rid of the flickering scroll bars,2) but there is a workaround for the automatic scroll effect:

  • The bug is triggered by a percentage value of the CSS width property.
  • To get rid of it, you need to define a fixed px width.
  • If you do not want to loose the possibility to use flexible widths for <textarea>, you can get the needed percentage value through the back-door by using min-width and max-width with the same value afterwards. You may use browser specific hacks to hide/overwrite the fixed width from IE6 which is not affected by this bug but does not understand min/max-width.

Example:

textarea {
    /* width for non-IE browsers */
    width: 50%; 
 
    /* width for IE
       NOTE: - "\9" at the end is a CSS hack to address only IE (all versions).
             - "#" in front is a CSS hack to address IE6/7 */
    width: 500px\9; /* fix the bug */
    /* get the needed percentage value in IE7 and IE8 */
    min-width: 50%\9;
    max-width: 50%\9;
    /* get the needed percentage value in IE6 */
    #width: 50%;
}

Hope that helps. :-)

1)
tested on WinXP SP3 32bit
2)
this is thankfully just a minor, “cosmetic” problem

// 204 = 1223

Die Gleichung sieht komisch aus, ist total falsch und macht keinen Sinn. Jedenfalls für einen normalen Menschen. Die Jungs die den Internet Explorer entworfen haben scheinen dies aber anders zu sehen.

Heute war ein wenig Ajax angesagt. Das Skript hat den HTTP-Status-Code einer XML-Antwortdatei abgefangen, welches bei einer gültigen Anfrage ohne Ergebnis keine Ausgabe bringt und nur den 204 Code sendet, welcher für no content steht – also genau die ggf. vorliegende Situation “gültige Anfrage aber kein Inhalt” beschreibt. Soweit so gut. Dies sollte nun abgefangen werden um direkt den darauffolgenden Request abzusetzen:

} else if (http_request.status == 204) {

*kabooom*. Hat überall funktioniert, nur im IE nicht. Nach einigem Debuggen staunte ich nicht schlecht, als ein

alert(http_request.status);

in jedem Browser 204 ausgab, im IE allerdings ein 1223 produzierte. Also schnell Google angeworfen und mal wieder die Bestätigung eingeholt, das der IE – auch in der Version 7 – ein Biest ist. Er nimmt sich nämlich einfach die Freiheit in machen Fällen den Status-Code von 204 auf 1223 umzuschreiben. Zum kotzen. Um ein zusätzliches

|| http_request.status == 1223

kam ich in dem Skript also nicht herum. :-/

I'm no native speaker (English)
Please let me know if you find any errors (I want to improve my English skills). Thank you!
QR Code: URL of current page
QR Code: URL of current page start (generated for current page)