Friday, February 8, 2013

LSI MegaRAID failed drive replacement

Installing command line tool

The command line tool to manage raid devices should be located at /opt/MegaRAID/MegaCli/MegaCli64. If not, you can download the RPM from IBM. It is an architecture independant package. I personnaly used it on Redhat and opensuse successfully. The one I'm using for this documentation is ibm_utl_sraidmr_megacli-8.04.08_linux_32-64.zip. So you have to untar, then do rpm -Uvh on both packages ( one for library and one for the actual command line tool ).

Getting the big picture

To start with, the following command queries all adapters and returns information about the virtual drives defined, their status and all physical drives that they are made of. The command output a lot of information so I usually grep some keywords to shorten the text. Here is the command and an extract of the output :
./MegaCli64 -LDPDInfo -aALL | egrep "Adapter|Virtual Disk|Name|RAID|State|^Number|^Span|PD:|^Device|Firmware|^$"
Adapter #0

Number of Virtual Disks: 2
Virtual Disk: 0 (target id: 0)
Name:
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
State: Optimal
Number Of Drives:9
Span Depth:1
Number of Spans: 1
Span: 0 - Number of PDs: 9
PD: 0 Information
Device Id: 15
Firmware state: Online

PD: 1 Information
Device Id: 16
Firmware state: Online

PD: 2 Information
Device Id: 17
Firmware state: Online

...

PD: 8 Information
Device Id: 23
Firmware state: Online

Virtual Disk: 1 (target id: 1)
Name:data2
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
State: Degraded
Number Of Drives:24
Span Depth:1
Number of Spans: 1
Span: 0 - Number of PDs: 24
PD: 0 Information
Device Id: 42
Firmware state: Online

PD: 1 Information
Device Id: 43
Firmware state: Online

...

PD: 22 Information
Device Id: 64
Firmware state: Online

PD: 23 Information
Device Id: 65
Firmware state: Online

Adapter #1

Number of Virtual Disks: 1
Virtual Disk: 0 (target id: 0)
Name:
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
State: Optimal
Number Of Drives:11
Span Depth:1
Number of Spans: 1
Span: 0 - Number of PDs: 11
PD: 0 Information
Device Id: 8
Firmware state: Online

PD: 1 Information
Device Id: 9
Firmware state: Online

...

PD: 10 Information
Device Id: 18
Firmware state: Online

Finding the drive to replace

It's a good idea to start gathering info about the Adapter :


/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -AdpAllInfo -aALL

Adapter #0

==============================================================================
                    Versions
                ================
Product Name    : PERC H700 Integrated
Serial No       : 18P02M3
FW Package Build: 12.10.2-0004

...

                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 0
  Offline         : 0
Physical Devices  : 5
  Disks           : 4
  Critical Disks  : 0
  Failed Disks    : 0
...



So we know now that the current machine has one adapter : Adapter 0. So in the following command, we will specify -a0 for adpater 0. Then we get enclosure information :
/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -EncInfo -a0

    Number of enclosures on adapter 0 -- 1

    Enclosure 0:
    Device ID                     : 32
    Number of Slots               : 6
    Number of Power Supplies      : 0
    Number of Fans                : 0
    Number of Temperature Sensors : 0
    Number of Alarms              : 0
    Number of SIM Modules         : 0
    Number of Physical Drives     : 4
    Status                        : Normal
    Position                      : 0
    Connector Name                : Unavailable
    Enclosure type                : SES
    FRU Part Number               : N/A
    Enclosure Serial Number       : N/A
    ESM Serial Number             : N/A
    Enclosure Zoning Mode         : N/A
    Partner Device Id             : 65535

    Inquiry data                  :
        Vendor Identification     : DP
        Product Identification    : BACKPLANE
        Product Revision Level    : 1.07
        Vendor Specific           : 18NJ5VP


Exit Code: 0x00
So we have one adapter a0 and one enclosure with an id of 32. We now query for the logical drive information :
/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -LDInfo -LALL -a0


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :server
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 557.75 GB
Mirror Data         : 557.75 GB
State               : Degraded
Strip Size          : 64 KB
Number Of Drives per span:2
Span Depth          : 2
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only
This shows us the RAID level used ( 1-0 so a mirror of strippes ) and the status of this raid device : Degraded.
So now we look for the deffective drive with the following command. The field we need to watch is Firmware State : Failed.
/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -PDList -a0

Adapter #0

...

Enclosure Device ID: 32
Slot Number: 3
...

Firmware state: Failed

...
The output has been truncated to show only relevant information. So in our case, it's the drive in slot number 3 of enclosure ID 32 that needs to be replaced.

Replacing the drive

We prepare the drive for replacement :
/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -PDOffline -PhysDrv\[32:3\] -a0

Adapter: 0: EnclId-32 SlotId-3 state changed to OffLine.

Exit Code: 0x00
/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -PDMarkMissing -PhysDrv\[32:3\] -a0

EnclId-32 SlotId-3 is marked Missing.

Exit Code: 0x00
/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -PDPrpRmv -PhysDrv\[32:3\] -a0


Prepare for removal Success

Exit Code: 0x00
Now it's time for the physical replacement.

Then if everything went smoothly, you should see the array being rebuild :
/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -PDInfo -PhysDrv\[32:3\] -a0

Enclosure Device ID: 32
Slot Number: 3
Drive's postion: DiskGroup: 0, Span: 1, Arm: 1
Enclosure position: N/A
Device Id: 3

...
Firmware state: Rebuild

...
We can query the controler to see the actual rebuild progress :
/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -PDRbld -ShowProg -PhysDrv\[32:3\] -a0

Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 7% in 3 Minutes.

Exit Code: 0x00

Checking

Eventually, you should see something like that :
/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -PDRbld -ShowProg -PhysDrv\[32:3\] -a0

Device(Encl-32 Slot-3) is not in rebuild process

Exit Code: 0x00 

/opt/MegaRAID/MegaCli> sudo ./MegaCli64 -LDInfo -LALL -a0


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :server
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 557.75 GB
Mirror Data         : 557.75 GB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives per span:2
Span Depth          : 2
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only



Exit Code: 0x00