[svlug] dma_intr <-- Strange DMA hard drive error with Via chipset

Marat BN maratbn at yahoo.com
Fri Dec 17 02:34:55 PST 2004


Dudes,

We're getting a serious problem with a Via chipset
system running dual hard 
drives with Linux Volume Manager in DMA mode.  Time to
time, we're getting 
the following errors dumped onto the console and into
kernel log:

*********************
Nov 30 14:49:38 localhost kernel: hda: dma_intr:
status=0x51 { DriveReady 
SeekComplete Error }
Nov 30 14:49:38 localhost kernel: hda: dma_intr:
error=0x84 { 
DriveStatusError BadCRC }
Nov 30 14:49:38 localhost kernel: hda: dma_intr:
status=0x51 { DriveReady 
SeekComplete Error }
Nov 30 14:49:38 localhost kernel: hda: dma_intr:
error=0x04 { 
DriveStatusError }
Nov 30 14:51:55 localhost smartd[1174]: Device:
/dev/hda, ATA error count 
increased from 594 to 596
***************

It appears there is a function in the kernel called
dma_intr which prints 
out these error messages above.  But what do you think
these error messages 
mean?  The system also tends to crash frequently. 
Check out the following:

*************************
Nov 30 15:40:49 localhost kernel: hda: dma_intr:
status=0x51 { DriveReady 
SeekComplete Error }
Nov 30 15:40:49 localhost kernel: hda: dma_intr:
error=0x84 { 
DriveStatusError BadCRC }
Nov 30 15:40:49 localhost kernel: hda: dma_intr:
status=0x51 { DriveReady 
SeekComplete Error }
Nov 30 15:40:49 localhost kernel: hda: dma_intr:
error=0x04 { 
DriveStatusError }
Nov 30 15:56:22 localhost sshd(pam_unix)[4343]:
session opened for user 
guard by (uid=500)
Nov 30 15:57:27 localhost sshd(pam_unix)[4343]:
session closed for user 
guard
Nov 30 15:57:37 localhost sshd(pam_unix)[4415]:
session opened for user 
guard by (uid=500)
Nov 30 15:59:53 localhost su(pam_unix)[4593]: session
opened for user root 
by guard(uid=500)
Nov 30 16:01:03 localhost su(pam_unix)[4593]: session
closed for user root
Nov 30 16:01:23 localhost kernel: Unable to handle
kernel paging request at 
virtual address 89868286
Nov 30 16:01:23 localhost kernel:  printing eip:
Nov 30 16:01:23 localhost kernel: c0135e87
Nov 30 16:01:23 localhost kernel: *pde = 00000000
Nov 30 16:01:23 localhost kernel: Oops: 0000
Nov 30 16:01:23 localhost kernel: CPU:    0
Nov 30 16:01:23 localhost kernel: EIP:   
0060:[<c0135e87>]    Not tainted
Nov 30 16:01:23 localhost kernel: EFLAGS: 00010093
Nov 30 16:01:23 localhost kernel:
Nov 30 16:01:23 localhost kernel: EIP is at 
(2.4.20-6crusoe)
Nov 30 16:01:23 localhost kernel: eax: 89868286   ebx:
00000102   ecx: 
c1a4ef18   edx: c1a4ef28
Nov 30 16:01:23 localhost kernel: esi: 00000008   edi:
00000000   ebp: 
c1a4ef84   esp: cb1bfe2c
Nov 30 16:01:23 localhost kernel: ds: 0068   es: 0068 
 ss: 0068
Nov 30 16:01:24 localhost kernel: Process AGM (pid:
4837, 
stackpage=cb1bf000)
Nov 30 16:01:24 localhost kernel: Stack: c030e458
00000004 c1a4ef18 00000000 
00000007 00000007 00002202 000001d2
Nov 30 16:01:24 localhost kernel:        000001d2
00000000 c0138c9e c1a4c12c 
000001d2 cb1be000 00000001 c01392e1
Nov 30 16:01:24 localhost kernel:        000001d2
c027efcc 00000300 c013a87f 
c027efc0 00000000 00000001 00000001
Nov 30 16:01:24 localhost kernel: Call Trace:  
[<c0138c9e>]  (0xcb1bfe54))
Nov 30 16:01:24 localhost kernel: [<c01392e1>] 
(0xcb1bfe68))
Nov 30 16:01:24 localhost kernel: [<c013a87f>] 
(0xcb1bfe78))
Nov 30 16:01:24 localhost kernel: [<c012ca68>] 
(0xcb1bfeb4))
Nov 30 16:01:24 localhost kernel: [<c012cf71>] 
(0xcb1bfed8))
Nov 30 16:01:24 localhost kernel: [<c0115bf2>] 
(0xcb1bff08))
Nov 30 16:01:24 localhost kernel: [<c012ee20>] 
(0xcb1bff2c))
Nov 30 16:01:24 localhost kernel: [<c0123b3f>] 
(0xcb1bff4c))
Nov 30 16:01:24 localhost kernel: [<c011fd3c>] 
(0xcb1bff78))
Nov 30 16:01:24 localhost kernel: [<c011fc5d>] 
(0xcb1bff7c))
Nov 30 16:01:24 localhost kernel: [<c010a92f>] 
(0xcb1bffa0))
Nov 30 16:01:24 localhost kernel: [<c0115a70>] 
(0xcb1bffb0))
Nov 30 16:01:24 localhost kernel: [<c0109368>] 
(0xcb1bffb8))
Nov 30 16:01:24 localhost kernel:
Nov 30 16:01:24 localhost kernel:
Nov 30 16:01:24 localhost kernel: Code: 8b 00 43 39 d0
75 f9 8b 44 24 08 89 
da 8b 78 24 8b 40 44 89
**********************************
We suspect that the problem with the hard drive DMA
access corrupted the 
kernel and executables, causing them to crash. 
However, we checked the 
md5sums of the kernel and several executables (even
though executables 
should not cause a total system crash like above), and
found the md5 sums to 
perfectly match corresponding md5 sums on
sister-systems.  As you can see, 
this log was made with kernel 2.4.20-6, but I got a
similar crash with 
2.4.26.  Anybody have any ideas as to what may be
causing the crash?

The error messages "dma_intr" go away if we turn off
the DMA with 
"hdparm -d0 /dev/hdX"; however, the system may still
crash even if we turn 
the DMA off.

The "dma_intr' seems to take place only on our systems
with Via chipset.  If 
we boot the drives on a different system with a
different chipset, the 
"dma_intr" message does not appear.

Thanks a lot for your time, and hope some of you might
have some pointers on 
what's going on here.

Marat 



		
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - Helps protect you from nasty viruses. 
http://promotions.yahoo.com/new_mail




More information about the svlug mailing list