Other, less common, failure modes include returning bad data successfully, which only ZFS can detect; returning inaccurate sense data, precluding correct telemetry generation; and, most infuriatingly of all, working correctly but with excessively high latency. This is perhaps the most common, and also among the worst, disk drive failure modes: the endless timeout. This failure mode seems to be caused by firmware or controller issues; the drive simply never responds to any request, or to certain requests. Sometimes, whether because of a firmware fault or a hardware defect or fault, a disk drive will simply “go away”. When they all work properly, the determination that a disk has failed is very easy to make: fmd and ZFS agree that the device is broken, all documented tools report that fact, and the device is automatically taken out of use and its fault LED turned on. This mechanism is responsible for (don’t laugh) turning on the LED for the faulty disk and turning it back off again when the disk is replaced.

Bear in mind when employing a plunger, you wish to think of it as pulling the block back again up, not pushing it lower to the pipes. Our HBA will transport the error status (normally CHECK CONDITION for SCSI devices) back to sd(7D), where we will generate a REQUEST SENSE command to obtain further details from the disk drive. Most enterprise drives will retry for a few seconds before giving up; some consumer-grade devices will keep trying more or less forever. This won’t happen if the request that triggered it is eligible to be retried, and by default, illumos’s SCSI stack will retry most commands many, many times before giving up (we’ve greatly reduced this behaviour in SmartOS). FAILFAST option used by ZFS will abort commands immediately if it can be determined that the underlying device has been physically removed from the system or is otherwise known to be unreachable. Another option is to declare that a given hardware configuration requires the presence of disk drives in certain bays, and treat the absence of one as a fault. One option is to ignore the surprise hotplug requirement in favour of something like cfgadm(1M).

One additional discourteous failure mode highlights the fundamental challenge of diagnosis especially well. Unfortunately, the courteous failure mode I’ve just detailed is exceedingly rare. These failure modes are not academic; I’ve seen at least three of them in the field. It’s easy to see that satisfying this requirement and diagnosing the vanishing disk drive failure are mutually exclusive. The service panel, or “breaker box,” is where you go after part of your home loses electricity, to see which circuit was tripped. As part of retiring the broken device, ZFS will also select a spare and begin resilvering onto it from the other devices in the same vdev (whether by mirroring or reconstructing from parity). Notice that many of these root causes actually have nothing to do with the disk drive and will recur (often intermittently) on the same phy or bay, or in some cases on arbitrary phys or bays, even after the “faulty” disk drive is replaced. But few if any enclosures have this feature, and a quick scan of our list of possible root causes shows that it wouldn’t be terribly effective anyway: even distinguishing the removal case doesn’t tell us whether the disk drive, the enclosure, the HBA, the backplane, or one of several firmware and software components is the true source of the problem.

Our commercial plumbers work with businesses of all stripes, offering ongoing inspections that proactively identify and fix potential issues before they force your business to come to a grinding halt. 70 reviews / 4.92 out of 5 George initiated contact soon after posting required work.

