Last weekend, an Exadata storage server flashdisk entered the predictive failure state.
The flashdisk is used by the flashcache and has a griddisk which is a member of a normal
Identify the four steps you must perform to replace this flashdisk.
Identify the griddisk on the predictive failure flashdisk and drop it from the associated ASM
Verify that the griddisk located on the predictive failure flashdisk has been successfully dropped
from the associated ASM diskgroup.
Drop the flashcache on the cell and re-create it using all but the predictive failure flashdisk.
Safely power off the cell containing the predictive failure flashdlsk.
Replace the predictive failure flashdisk.
Power up the cell containing the replaced flashdlsk and activate all grlddlsks.
Drop the flashcache on the cell and re-create it using all flashdlsks.
Create a new griddisk on the replaced flashdisk.
Add the griddisk back into the ASM diskgroup to which it belonged.
*Exadata monitors for the number of media and other disk/flash failures (e.g. an I/O write failure
due to physical media damage). If there are too many of those, Exadata is ‘predicting’ that it will
soon fail and it takes it out of the system.
*Exadata Server, that runs on the storage cells, monitors disk health and performance. If the disk
performance degrades it can put it into proactive failure mode. It also monitors for predictive
failures based on the disk’s SMART (Self-monitoring, Analysis and Reporting Technology) data. In
both cases, the Exadata Server notifies XDMG to take those disks offline.
When a faulty disk is replacedf on the storage cell, the Exadata Server will recrate all grid disks on
a new disk. It will then notify XDMG to bring those grid disks online or add them back to disk
groups, in case they were already dropped.
*ASM is a critical component of the Exadata software stack. It is also a bit different – compared to
non-Exadata environments. It still manages your disk groups, but builds those with grid disks. It
still takes care of disk errors, but also handles predictive disk failures. It doesn’t like external
redundancy and ACFS, but it makes the disk group smart scan capable.