[svlug] dma_intr <-- Strange DMA hard drive error with Via chipset
Marat BN
maratbn at yahoo.com
Fri Dec 17 02:34:55 PST 2004
Dudes,
We're getting a serious problem with a Via chipset
system running dual hard
drives with Linux Volume Manager in DMA mode. Time to
time, we're getting
the following errors dumped onto the console and into
kernel log:
*********************
Nov 30 14:49:38 localhost kernel: hda: dma_intr:
status=0x51 { DriveReady
SeekComplete Error }
Nov 30 14:49:38 localhost kernel: hda: dma_intr:
error=0x84 {
DriveStatusError BadCRC }
Nov 30 14:49:38 localhost kernel: hda: dma_intr:
status=0x51 { DriveReady
SeekComplete Error }
Nov 30 14:49:38 localhost kernel: hda: dma_intr:
error=0x04 {
DriveStatusError }
Nov 30 14:51:55 localhost smartd[1174]: Device:
/dev/hda, ATA error count
increased from 594 to 596
***************
It appears there is a function in the kernel called
dma_intr which prints
out these error messages above. But what do you think
these error messages
mean? The system also tends to crash frequently.
Check out the following:
*************************
Nov 30 15:40:49 localhost kernel: hda: dma_intr:
status=0x51 { DriveReady
SeekComplete Error }
Nov 30 15:40:49 localhost kernel: hda: dma_intr:
error=0x84 {
DriveStatusError BadCRC }
Nov 30 15:40:49 localhost kernel: hda: dma_intr:
status=0x51 { DriveReady
SeekComplete Error }
Nov 30 15:40:49 localhost kernel: hda: dma_intr:
error=0x04 {
DriveStatusError }
Nov 30 15:56:22 localhost sshd(pam_unix)[4343]:
session opened for user
guard by (uid=500)
Nov 30 15:57:27 localhost sshd(pam_unix)[4343]:
session closed for user
guard
Nov 30 15:57:37 localhost sshd(pam_unix)[4415]:
session opened for user
guard by (uid=500)
Nov 30 15:59:53 localhost su(pam_unix)[4593]: session
opened for user root
by guard(uid=500)
Nov 30 16:01:03 localhost su(pam_unix)[4593]: session
closed for user root
Nov 30 16:01:23 localhost kernel: Unable to handle
kernel paging request at
virtual address 89868286
Nov 30 16:01:23 localhost kernel: printing eip:
Nov 30 16:01:23 localhost kernel: c0135e87
Nov 30 16:01:23 localhost kernel: *pde = 00000000
Nov 30 16:01:23 localhost kernel: Oops: 0000
Nov 30 16:01:23 localhost kernel: CPU: 0
Nov 30 16:01:23 localhost kernel: EIP:
0060:[<c0135e87>] Not tainted
Nov 30 16:01:23 localhost kernel: EFLAGS: 00010093
Nov 30 16:01:23 localhost kernel:
Nov 30 16:01:23 localhost kernel: EIP is at
(2.4.20-6crusoe)
Nov 30 16:01:23 localhost kernel: eax: 89868286 ebx:
00000102 ecx:
c1a4ef18 edx: c1a4ef28
Nov 30 16:01:23 localhost kernel: esi: 00000008 edi:
00000000 ebp:
c1a4ef84 esp: cb1bfe2c
Nov 30 16:01:23 localhost kernel: ds: 0068 es: 0068
ss: 0068
Nov 30 16:01:24 localhost kernel: Process AGM (pid:
4837,
stackpage=cb1bf000)
Nov 30 16:01:24 localhost kernel: Stack: c030e458
00000004 c1a4ef18 00000000
00000007 00000007 00002202 000001d2
Nov 30 16:01:24 localhost kernel: 000001d2
00000000 c0138c9e c1a4c12c
000001d2 cb1be000 00000001 c01392e1
Nov 30 16:01:24 localhost kernel: 000001d2
c027efcc 00000300 c013a87f
c027efc0 00000000 00000001 00000001
Nov 30 16:01:24 localhost kernel: Call Trace:
[<c0138c9e>] (0xcb1bfe54))
Nov 30 16:01:24 localhost kernel: [<c01392e1>]
(0xcb1bfe68))
Nov 30 16:01:24 localhost kernel: [<c013a87f>]
(0xcb1bfe78))
Nov 30 16:01:24 localhost kernel: [<c012ca68>]
(0xcb1bfeb4))
Nov 30 16:01:24 localhost kernel: [<c012cf71>]
(0xcb1bfed8))
Nov 30 16:01:24 localhost kernel: [<c0115bf2>]
(0xcb1bff08))
Nov 30 16:01:24 localhost kernel: [<c012ee20>]
(0xcb1bff2c))
Nov 30 16:01:24 localhost kernel: [<c0123b3f>]
(0xcb1bff4c))
Nov 30 16:01:24 localhost kernel: [<c011fd3c>]
(0xcb1bff78))
Nov 30 16:01:24 localhost kernel: [<c011fc5d>]
(0xcb1bff7c))
Nov 30 16:01:24 localhost kernel: [<c010a92f>]
(0xcb1bffa0))
Nov 30 16:01:24 localhost kernel: [<c0115a70>]
(0xcb1bffb0))
Nov 30 16:01:24 localhost kernel: [<c0109368>]
(0xcb1bffb8))
Nov 30 16:01:24 localhost kernel:
Nov 30 16:01:24 localhost kernel:
Nov 30 16:01:24 localhost kernel: Code: 8b 00 43 39 d0
75 f9 8b 44 24 08 89
da 8b 78 24 8b 40 44 89
**********************************
We suspect that the problem with the hard drive DMA
access corrupted the
kernel and executables, causing them to crash.
However, we checked the
md5sums of the kernel and several executables (even
though executables
should not cause a total system crash like above), and
found the md5 sums to
perfectly match corresponding md5 sums on
sister-systems. As you can see,
this log was made with kernel 2.4.20-6, but I got a
similar crash with
2.4.26. Anybody have any ideas as to what may be
causing the crash?
The error messages "dma_intr" go away if we turn off
the DMA with
"hdparm -d0 /dev/hdX"; however, the system may still
crash even if we turn
the DMA off.
The "dma_intr' seems to take place only on our systems
with Via chipset. If
we boot the drives on a different system with a
different chipset, the
"dma_intr" message does not appear.
Thanks a lot for your time, and hope some of you might
have some pointers on
what's going on here.
Marat
__________________________________
Do you Yahoo!?
Yahoo! Mail - Helps protect you from nasty viruses.
http://promotions.yahoo.com/new_mail
More information about the svlug
mailing list