I got home to find my 3 brand new Samsung 1TB drives had been delivered. I pulled the first disk on my 3x500GB X-RAID NV+, mounted a new drive, and inserted it, and went down to make dinner.
About an hour later I came up to discover the NV+ was non-responsive - no response to network pings, frontview, or ssh, one flashing drive led and two solid ones, but no activity.
With little other choice, I pulled the plug and did a hard restart. The unit started checking quotas, got to about 25%, and locked up - again vanishing off the net.
Another hard boot - with a bit of panic setting in.
Again, got to about 25% on the quota check, and suddenly stopped responding to pings.
Now I'm getting seriously nervous.
I hard-boot it again, and this time as soon as it starts responding to pings, I start attempting to ssh in.
It gets to about 10% on the quota check before I get a shell. top reveals that quotacheck is running, amd mount reveals that my data volume is mounted. I start looking around to see if all the data looks in-tact, when the system again drops off the net - the LCD panel reads 24.7%.
One more hard-boot, ssh in again, this time, kill quotacheck.
Boot proceeds "normally" - and the raid volume starts re-syncing. I noticed that the volume was actually mounted read-only (despite the mount entry claiming read-write), but mount -o remount,rw works, and the volume is writable.
As I type, the re-sync is 5% done, the load average on the box is around 4, and I'm crossing everything I can cross that the resync finishes successfully.
If anyone from Netgear wants additional data for further diagnostics, I'm more than willing to provide it.
The unit is running Radiator 4.1.4 - a couple of potentially relevant log entries:
* this is me - removing a disk *
Wed Jan 28 21:02:46 PST 2009 Disk remove event occurred on SATA channel 1.
Wed Jan 28 21:02:46 PST 2009 Disk fail event occurred on SATA channel 1. If this disk is used in a redundant volume (RAID level 1, 5, or X-RAID), that volume is unprotected, and an additional disk failure may render that volume dead. You should replace the failed disk as soon as possible. Note that some disks may inadvertantly report failure. If you feel this is the case, rebooting the NAS device will automatically resync the disk to the RAID volume. If you get further failure messages, you should replace the disk immediately. If this disk is used in a RAID 0 volume, your volume is now dead as RAID 0 does not provide disk failure protection.
Wed Jan 28 18:47:31 PST 2009 Disk initialization started. The estimated time of completion is 8 hour(s) and 49 minute(s), at which time you will be notified via email. You can also check the progress in Frontview in the Volumes -> RAID Settings tab. Please do not shutdown the system while the initialization is in progress.
Wed Jan 28 18:50:38 PST 2009 A SATA reset has been performed on one or more of your disks that may have affected the RAID parity integrity. It is recommended that you perform a RAID volume resync from the RAID Settings tab ( accessible in the Volumes page => Volume tab in FrontView ). The resync process will run in the background, and you can continue to use the ReadyNAS in the meantime.
Wed Jan 28 19:33:25 PST 2009 Access to the disk on channel (??) is producing I/O errors. Although the array is still redundant, please replace this drive as soon as possible, as it is likely to fail soon.
Edit - the unit once again vanished from the net about 10% into the resync. Nothing in /var/log/syslog between
Jan 28 21:04:38 pensive RAIDiator: Disk initialization started. The estimated time of completion is 8 hour(s) and 49 minute(s), at which time you will be notified via email. You can also check the progress in Frontview in the Volumes -> RAID Settings tab. Please do not shutdown the system while the initialization is in progress.\n\n[Wed Jan 28 21:04:28 PST 2009]
And
Jan 28 21:38:14 pensive syslogd 1.4.1#10: restart.
Jan 28 21:38:14 pensive kernel: klogd 1.4.1#10, log source = /proc/kmsg started.
Jan 28 21:38:14 pensive kernel: Linux version 2.6.17.8ReadyNAS (root@calzone) (gcc version 3.3.5 (Infrant 3.3.5-1)) #1 Fri Sep 19 15:04:06 PDT 2008
After getting 20% through the sync, it locked up again - the last lines in /var/log/syslog were:
Jan 28 21:44:37 pensive RAIDiator: Disk initialization started. The estimated time of completion is 8 hour(s) and 49 minute(s), at which time you will be notified via email. You can also check the progress in Frontview in the Volumes -> RAID Settings tab. Please do not shutdown the system while the initialization is in progress.\n\n[Wed Jan 28 21:44:26 PST 2009]
Jan 28 22:37:26 pensive named[377]: listening on IPv4 interface lo, 127.0.0.1#53
Jan 28 22:39:01 pensive cnid_dbd[27277]: [main.c:207]: I:CNID: Setting uid/gid to 0/0
Jan 28 22:39:01 pensive cnid_dbd[27277]: [main.c:305]: I:CNID: Startup, DB dir /backup/.AppleDB
Another hard boot/ssh/kill quotacheck. This time I'm leaving the volume mounted read only, and killing off all unnecessary services (afpd, samba, mt-daapd, fuppesd).
If it's died again when I get up in the morning I'll try going back to the Seagate 500GB drive. I don't believe the new drives are defective, as I've tried two of them, but perhaps there's some interop issue with 4.1.4 and the Samsung Drives.
