NV+ lockup during quotacheck [Assistance Needed]

This forum will be deprecated soon. Please post issues pertaining to boot, installation, upgrade, and expansion in the appropriate forums above.

Moderators: chirpa, c3po

Similar topics


NV+ lockup during quotacheck [Assistance Needed]

Postby dberger » Wed Jan 28, 2009 10:14 pm

I got home to find my 3 brand new Samsung 1TB drives had been delivered. I pulled the first disk on my 3x500GB X-RAID NV+, mounted a new drive, and inserted it, and went down to make dinner.

About an hour later I came up to discover the NV+ was non-responsive - no response to network pings, frontview, or ssh, one flashing drive led and two solid ones, but no activity.

With little other choice, I pulled the plug and did a hard restart. The unit started checking quotas, got to about 25%, and locked up - again vanishing off the net.

Another hard boot - with a bit of panic setting in.

Again, got to about 25% on the quota check, and suddenly stopped responding to pings.

Now I'm getting seriously nervous.

I hard-boot it again, and this time as soon as it starts responding to pings, I start attempting to ssh in.

It gets to about 10% on the quota check before I get a shell. top reveals that quotacheck is running, amd mount reveals that my data volume is mounted. I start looking around to see if all the data looks in-tact, when the system again drops off the net - the LCD panel reads 24.7%.

One more hard-boot, ssh in again, this time, kill quotacheck.

Boot proceeds "normally" - and the raid volume starts re-syncing. I noticed that the volume was actually mounted read-only (despite the mount entry claiming read-write), but mount -o remount,rw works, and the volume is writable.

As I type, the re-sync is 5% done, the load average on the box is around 4, and I'm crossing everything I can cross that the resync finishes successfully.

If anyone from Netgear wants additional data for further diagnostics, I'm more than willing to provide it.

The unit is running Radiator 4.1.4 - a couple of potentially relevant log entries:

* this is me - removing a disk *

Wed Jan 28 21:02:46 PST 2009 Disk remove event occurred on SATA channel 1.

Wed Jan 28 21:02:46 PST 2009 Disk fail event occurred on SATA channel 1. If this disk is used in a redundant volume (RAID level 1, 5, or X-RAID), that volume is unprotected, and an additional disk failure may render that volume dead. You should replace the failed disk as soon as possible. Note that some disks may inadvertantly report failure. If you feel this is the case, rebooting the NAS device will automatically resync the disk to the RAID volume. If you get further failure messages, you should replace the disk immediately. If this disk is used in a RAID 0 volume, your volume is now dead as RAID 0 does not provide disk failure protection.

Wed Jan 28 18:47:31 PST 2009 Disk initialization started. The estimated time of completion is 8 hour(s) and 49 minute(s), at which time you will be notified via email. You can also check the progress in Frontview in the Volumes -> RAID Settings tab. Please do not shutdown the system while the initialization is in progress.

Wed Jan 28 18:50:38 PST 2009 A SATA reset has been performed on one or more of your disks that may have affected the RAID parity integrity. It is recommended that you perform a RAID volume resync from the RAID Settings tab ( accessible in the Volumes page => Volume tab in FrontView ). The resync process will run in the background, and you can continue to use the ReadyNAS in the meantime.

Wed Jan 28 19:33:25 PST 2009 Access to the disk on channel (??) is producing I/O errors. Although the array is still redundant, please replace this drive as soon as possible, as it is likely to fail soon.

Edit - the unit once again vanished from the net about 10% into the resync. Nothing in /var/log/syslog between

Jan 28 21:04:38 pensive RAIDiator: Disk initialization started. The estimated time of completion is 8 hour(s) and 49 minute(s), at which time you will be notified via email. You can also check the progress in Frontview in the Volumes -> RAID Settings tab. Please do not shutdown the system while the initialization is in progress.\n\n[Wed Jan 28 21:04:28 PST 2009]

And

Jan 28 21:38:14 pensive syslogd 1.4.1#10: restart.
Jan 28 21:38:14 pensive kernel: klogd 1.4.1#10, log source = /proc/kmsg started.
Jan 28 21:38:14 pensive kernel: Linux version 2.6.17.8ReadyNAS (root@calzone) (gcc version 3.3.5 (Infrant 3.3.5-1)) #1 Fri Sep 19 15:04:06 PDT 2008

After getting 20% through the sync, it locked up again - the last lines in /var/log/syslog were:

Jan 28 21:44:37 pensive RAIDiator: Disk initialization started. The estimated time of completion is 8 hour(s) and 49 minute(s), at which time you will be notified via email. You can also check the progress in Frontview in the Volumes -> RAID Settings tab. Please do not shutdown the system while the initialization is in progress.\n\n[Wed Jan 28 21:44:26 PST 2009]
Jan 28 22:37:26 pensive named[377]: listening on IPv4 interface lo, 127.0.0.1#53
Jan 28 22:39:01 pensive cnid_dbd[27277]: [main.c:207]: I:CNID: Setting uid/gid to 0/0
Jan 28 22:39:01 pensive cnid_dbd[27277]: [main.c:305]: I:CNID: Startup, DB dir /backup/.AppleDB

Another hard boot/ssh/kill quotacheck. This time I'm leaving the volume mounted read only, and killing off all unnecessary services (afpd, samba, mt-daapd, fuppesd).

If it's died again when I get up in the morning I'll try going back to the Seagate 500GB drive. I don't believe the new drives are defective, as I've tried two of them, but perhaps there's some interop issue with 4.1.4 and the Samsung Drives.
dberger
ReadyNAS Newbie
 
Posts: 31
Joined: Thu Oct 25, 2007 3:25 pm
ReadyNAS: NV+

Re: NV+ lockup during quotacheck [Assistance Needed]

Postby dberger » Thu Jan 29, 2009 7:34 am

Came in this morning to see "Kernel Panic" displayed on the LCD.

Hard power cycle displays "Welcome to the ReadyNAS World", then "Booting..." but no drive activity and the unit hasn't started answering pings.

A tail of the logs shows that the init completed around 2 am, and that the new drive started experiencing errors seconds after it was added to the array.

Jan 28 22:49:06 pensive RAIDiator: Disk initialization started. The estimated time of completion is 8 hour(s) and 49 minute(s), at which time you will be notified via email. You can also check the progress in Frontview in the Volumes -> RAID Settings tab. Please do not shutdown the system while the initialization is in progress.\n\n[Wed Jan 28 22:48:54 PST 2009]
Jan 28 23:42:47 pensive named[373]: listening on IPv4 interface lo, 127.0.0.1#53
Jan 29 01:36:01 pensive mountd[727]: /c/media exported to both 192.168.0.4 and 192.168.0.0/24, arbitrarily choosing options from first
Jan 29 01:36:03 pensive mountd[727]: /c/home exported to both 192.168.0.4 and 192.168.0.0/24, arbitrarily choosing options from first
Jan 29 01:52:38 pensive kernel: Put drive into channel 0, allow access to it.
Jan 29 01:52:38 pensive kernel: Sending init done event.
Jan 29 01:52:38 pensive kernel: disk_init_finished on 0,how_many=1,selected=1,Present=1, state=0, init_hd=0, sync=0, change_cfg=0
Jan 29 01:52:38 pensive kernel: 1 new drive checked on chn 0.
Jan 29 01:52:38 pensive kernel: WRITE erased flag 0=0
Jan 29 01:52:38 pensive kernel: COD changed position, new drive is smaller: 1953525168, 976773168.
Jan 29 01:52:38 pensive kernel: Drive 0 passed init and test, add to array.
Jan 29 01:52:38 pensive kernel: sata_hotplug: /sbin/hotplug xraid_disk_init_finished hdcUser mode helper start.
Jan 29 01:52:38 pensive kernel: done do_sata_hotplug

then these three lines repeat a bunch:

Jan 29 01:52:40 pensive kernel: hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
Jan 29 01:52:40 pensive kernel: ide: failed opcode was: unknown
Jan 29 01:52:40 pensive kernel: Read command with DRQ, should have try to flush leftover data, do nothing for XR.

... repeats ...

Jan 29 01:53:22 pensive kernel: hdc: SAMSUNG HD103UJ (s/n:S13PJDWS108544), ATA DISK drive (ATAEXT)
Jan 29 01:53:22 pensive kernel: hdc: max request size: 512KiB
Jan 29 01:53:22 pensive kernel: hdc: use capacity 1953525168 sectors (1000204 MB)
Jan 29 01:53:22 pensive kernel: Drive support hpa, still should not change max addr.
Jan 29 01:53:22 pensive kernel: hdc: 1953108616 sectors (999991 MB) w/32767KiB Cache, CHS=65535/255/63
Jan 29 01:53:22 pensive kernel: 2req = 0 00000000,00000000,00000000,00000000 0
Jan 29 01:53:22 pensive last message repeated 124 times
Jan 29 01:53:22 pensive kernel: Waiting for non-IO result from PIO on hdc
Jan 29 01:53:22 pensive kernel: hdc: lost interrupt
Jan 29 01:53:22 pensive kernel: Unknow token for padre_startstop, ignore...
Jan 29 01:53:22 pensive kernel: (caller: 801b802c,80036a1c,80032140,800321a4,80014948,800362e8,802882d8,f810632c).
Jan 29 01:53:22 pensive kernel: enable_irq after time out on irq 32
Jan 29 01:53:22 pensive kernel: enable_irq(32) unbalanced from 801b7ed4
Jan 29 01:53:22 pensive kernel: hdc: cache flushes supported
Jan 29 01:53:22 pensive kernel: New drive information on chn 0 updated.
Jan 29 01:53:22 pensive kernel: Saved ID buffer from 1
Jan 29 01:53:22 pensive kernel: Saved ID buffer from 2
Jan 29 01:53:22 pensive kernel: Saved ID buffer from 3
Jan 29 01:53:22 pensive kernel: ext.
Jan 29 01:53:22 pensive last message repeated 2 times
Jan 29 01:53:22 pensive kernel: 1 WARNING: COD diff,disk 0 and 8, raid_disks 0 != 2
Jan 29 01:53:22 pensive kernel: 1 WARNING: COD diff,disk 0 and 9, raid_disks 0 != 2
Jan 29 01:53:22 pensive kernel: 1 WARNING: COD diff,disk 0 and 10, raid_disks 0 != 2
Jan 29 01:53:22 pensive kernel: 1 WARNING: COD diff,disk 0 and 11, raid_disks 0 != 2
Jan 29 01:53:22 pensive kernel: Auto sync is on
Jan 29 01:53:22 pensive kernel: X_RAID: recovery thread got woken up ...
Jan 29 01:53:22 pensive kernel: New = 0, source drives = 6, current/active=2/3
Jan 29 01:53:22 pensive kernel: Prepare sync raid 1, source_image=6, sync=2, ready=0
Jan 29 01:53:22 pensive kernel: Start sync:0, source_image=2/0, between 2 to 4096002
Jan 29 01:53:22 pensive kernel: sata_hotplug: /sbin/hotplug xraid_sync_started hdcUser mode helper start.
Jan 29 01:53:22 pensive kernel: done do_sata_hotplug
Jan 29 01:53:22 pensive kernel: hdc: cache flushes supported
Jan 29 01:53:22 pensive kernel: hde: cache flushes supported
Jan 29 01:53:22 pensive kernel: hdg: cache flushes supported
Jan 29 01:53:23 pensive RAIDiator: RAID sync started on volume C.\n\n[Thu Jan 29 01:53:21 PST 2009]
Jan 29 01:54:16 pensive kernel: Sync:2, source_image=2/2, between 2 to 4096002,all_tgt=6
Jan 29 01:55:13 pensive kernel: Start sync raid 4 from 4096002 to 976752688

Jan 29 01:57:47 pensive kernel: hdc: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
Jan 29 01:57:47 pensive kernel: hdc: drive_cmd: error=0x04 { DriveStatusError }
Jan 29 01:57:47 pensive kernel: ide: failed opcode was: 0xb0
Jan 29 01:57:47 pensive kernel: Drive at position 0 found status 51, need retry.

then later

Jan 29 04:00:31 pensive RAIDiator: Reallocated sector count has increased in the last day.\n\nDisk 2:\n Previous count: 0\n Current count: 23\n\nATA error count has increased in the last day.\n\nDisk 2:\n Previous count: 0\n Current count: 41\n\nGrowing SMART errors indicate a disk that may fail soon. If the errors continue to increase, you should be prepared to replace the disk.

and just before the tail stops

Jan 29 05:03:38 pensive kernel: Drive at position 1 found status 51, need retry.
Jan 29 05:03:38 pensive kernel: hdc: status timeout: status=0xd0 { Busy }
Jan 29 05:03:38 pensive kernel: ide: failed opcode was: 0xe5
Jan 29 05:03:38 pensive kernel: hdc: status timeout: status=0xd0 { Busy }

I pulled the new disk and put back the 500GB seagate, and the system started to boot, but locked up about 1% into the quota check, before I could get an ssh session established.

A second attempt and I've managed to kill quotacheck - top is still refreshing (so the unit is still on the network), but tail -f /var/log/syslog hasn't returned yet, and I can't establish another ssh session.
dberger
ReadyNAS Newbie
 
Posts: 31
Joined: Thu Oct 25, 2007 3:25 pm
ReadyNAS: NV+

Re: NV+ lockup during quotacheck [Assistance Needed]

Postby dberger » Thu Jan 29, 2009 8:58 am

Some additional data. I got the ReadyNAS booted with two drives after killing quotacheck, and did a graceful reboot with volume scan. The fsck has been running for 50 minutes over a 1TB volume, and has found illegal blocks in a handful of inodes and restarted after clearing them.

It's at 89% as I type, and found that cnid2.db in /a/frequently/referenced/path referenced one of the cleared inodes (so far, only two files have been affected, this one, and a photo which I should have backed up thanks to CrashPlan).

The fsck has stopped updating at 89.8%, the drive activity light is quiet. The unit is still responding to pings, but my ssh sessions have both become non-responsive.

Now I'm starting to seriously worry about the safety of my data, and the trustworthiness of the ReadyNAS in general.

After several minutes, my ssh sessions responded again - the fsck completed (and left the process list) but the unit still displays FS check: 100.0%, and shortly thereafter my ssh sessions locked up again. Still answering pings, still no disk activity light.
dberger
ReadyNAS Newbie
 
Posts: 31
Joined: Thu Oct 25, 2007 3:25 pm
ReadyNAS: NV+

Re: NV+ lockup during quotacheck [Assistance Needed]

Postby dberger » Thu Jan 29, 2009 10:46 am

After about 30 minutes of no disk activity, I powered down the ReadyNAS.

My current theory around what's happening is that one of my 500GB Seagate drives chose his exact moment to fail. There were no SMART (or other) warnings on the drive prior to the log entry I posted above, so pretty much no warning. I had a 1/3 chance of randomly choosing that drive to replace, but I didn't - so now the array has one failed drive, and one failing drive.

Which means I'm getting comfortable with the idea that I'll need to get my offsite backup onsite soon.

In a last-ditch effort to save the data on the volume that isn't in the offsite backup set, I've removed the drive that reported SMART errors, and am currently in the process of dd_rescue'ing it to another drive (the 500GB drive I removed last night, when this shower of sparks started).

I'm not holding my breath, though - as when I attached the drive to another linux box, it's got an unknown partition table.

My plan is to see how well dd_rescue does, and where it finds errors - if I'm lucky, and the drive trouble is in the first couple hundred megs of the disk, I'll re-create the partition table by hand, and maybe copy over the system partition from the last known good drive.
dberger
ReadyNAS Newbie
 
Posts: 31
Joined: Thu Oct 25, 2007 3:25 pm
ReadyNAS: NV+

Re: NV+ lockup during quotacheck [Assistance Needed]

Postby dberger » Sat Jan 31, 2009 12:11 pm

The story doesn't have a happy ending - myrecovery managed to recover much of the failing drive, but in the end, there was too much file system corruption to get anything back. It looks like Murphy made sure that many of the failed blocks (media errors) were in the middle of ext3 filesystem metadata - after fsck had cleared several thousand inodes, there was nothing left on the drive.

I have most of the data backed up, but of course I'm realizing the handful of things I didn't.

There are two morals to the story:

1. before attempting an in place upgrade, reboot your readynas and perform a volume check - in my case, it might have forced the failing drive to reveal itself before I tried swapping out one of the others.

2. a RAID system that doesn't provide disk scrubbing is only teasing you with data security, not actually providing it.
dberger
ReadyNAS Newbie
 
Posts: 31
Joined: Thu Oct 25, 2007 3:25 pm
ReadyNAS: NV+


Return to General Issues with Boot, Installation, Upgrade, and Expansion

Similar topics


Who is online

Users browsing this forum: No registered users and 6 guests