Uncovering SAN: 2010

Storage virtualization is now new wave in IT industry, most of the customers are now migrating to virtualized platform so that they can prepare themself for another big thing, Cloud !

Now when it comes to moving the data from physical storage to virtualized platform, the pain point is how to do a migration with less downtime or no downtime

1) Migration with less downtime: Do the cabling and zoning prior to taking the downtime reflecting the new target pwwn's. Now when the maintenance window is scheduled, shutdown the application using the volume being migrated ( this would clear the SCSI reservation if any). Present the LUN to the virtualization appliance and create the virtual volume out of it and present this to host. For Virtual volume creation - refer the array config guide as applicable.
Now login to host and scan for the new volumes, the path for the volumes may change depending on the OS. Make necessary changes from host perspective so that the application can use the newly presented virtualized LUNs.

2) Migration with no downtime: Migration can be done with no downtime using Vmware Storage vmotion. Present the new virtualised LUN to ESX host and then migrate the physical datastore to virtualised datastore using Storage VMotion.

If a customer complains about performance issue and if it comes to analysis from SAN perspective, then i would suggest to get a complete topology map of customers environment. Identify the hosts which is having issues and see the ports on switch where this is logged in. To check this, get the pwwn of HBA port using the utilities like HBAnyware (for emulex) or using SANsurfer (for Qlogic) and identify the port on switch where this is logged in. To check this use the command, switchshow (Brocade) or show flogi database (cisco). Now check the zoning done for these initiators and identify the target pwwn the initiators are zoned with. Identify if the initiator and target is logged into same switch or other switch in fabric. Once initiator switch ports, target switch ports and ISLs are identified, we need to check for port errors.

For Brocade the command is porterrshow and for cisco we need to check for the stats for individual switch ports using the command show interface <interface #>. Following are the explanation for the counters and the action that needs to be taken if the counter is increasing.

Frames tx/rx – Counters representing the number of frames transmitted. This would be a place to gauge the traffic.

enc_in - 8bit/10bit encoding errors inside frame. Words inside of frames are encoded, if this encoding is corrupted or an error is detected enc_in is generated. If this counter is increasing, SFP/cable needs to be checked/replaced.
crc_err - A mathematical formula generates counters at the sending port. The receiving port uses the same formula to check and compare. Also see bad_eof below. This is generally a sign of an external hardware problem. Suggested actions would be to replace the cable or SFP, move cable to another port, or run porttest.
bad_eof - After a loss of synchronization error continuous mode alignment allows the receiver to reestablish word alignment at any point in the incoming bit stream while the receiver is operational. Such realignment is likely (but not guaranteed) to result in code violations and subsequent loss of synchronization. Under certain conditions, it may be possible to realign an incoming bit stream without loss of synchronization. If such a realignment occurs within a received frame, detection of the resulting error condition is dependent upon higher-level function (e.g., invalid CRC,missing EOF Delimiter).
enc_out - 8bit/10bit encoding errors occurred in words (ordered sets) outside the Fibre Channel frame. Words outside of frames are encoded, if this encoding is corrupted or an error is detected enc_out is generated. This is a sign of a hardware problem. Suggested actions would be to replace the cable or SFP, move cable to another port, or run porttest.
Disc c3 – Discard class 3 errors could be generated by a switch when devices send frames without performing a FLOGI first or send frames to an invalid destination. This error is just reporting that such a discard occurred.
Link fail – If a port remains in the LR Receive State for a period of time greater than a timeout period (R_T_TOV), a Link Reset Protocol Timeout shall be detected which results in a Link Failure condition (enter the NOS Transmit State). The link failure also indicates that loss of signal or loss of sync lasting longer than the R_T_TOV value was detected while not in the Offline state.
Loss sync – Synchronization failures on either bit or transmission word boundaries are not separately identifiable and cause loss-of synchronization errors.
=========
Output of porterrshow and show interface <interface #> pasted below.

sw1:root> porterrshow
          frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy
       tx     rx      in    err    g_eof shrt   long   eof     out   c3    fail    sync   sig
     =========================================================================================================

sw1MDS9509# show interface fc4/1
fc4/1 is up
    Hardware is Fibre Channel, SFP is short wave laser w/o OFC (SN)
    Port WWN is 20:c1:00:0c:85:72:86:00
    Admin port mode is FX
    snmp link state traps are enabled
    Port mode is F, FCID is 0x1d0000
    Port vsan is 4
    Speed is 2 Gbps
    Transmit B2B Credit is 7
    Receive B2B Credit is 16
    Receive data field Size is 2112
    Beacon is turned off
    5 minutes input rate 256 bits/sec, 32 bytes/sec, 1 frames/sec
    5 minutes output rate 256 bits/sec, 32 bytes/sec, 1 frames/sec
      1956158 frames input, 62600416 bytes
        0 discards, 0 errors
        0 CRC, 0 unknown class
        0 too long, 0 too short
      1956158 frames output, 62600632 bytes
        0 discards, 0 errors
      10 input OLS, 3 LRR, 0 NOS, 0 loop inits
      10 output OLS, 6 LRR, 7 NOS, 6 loop inits
      16 receive B2B credit remaining
      7 transmit B2B credit remaining
      7 low priority transmit B2B credit remaining
    Interface last changed at Thu Aug 25 01:01:51 2011

Additional command output that needs to be checked for port errors.

Brocade - porterrshow, portshow <port#>
Cisco - show port internal all interface fc4/1 ( to display all internal counters for specified interface)

Above troubleshooting approach is to identify any physical layer issues. Refer below links for advanced fabric troubleshooting.

Congestion http://ranjith-san.blogspot.com/2011/09/identifying-and-troubleshooting.html

Credit Starvation

Slow Draining devices

Marginal links

Uncovering SAN

Friday 22 October 2010

Encapsulating to virtualized storage platform with no downtime

Wednesday 25 August 2010

Troubleshooting SAN Performance issues