FC trouble

**jmo** · 04-29-2009, 11:59 AM

Hello all,

I'm facing a serious problem I haven't been able to track down

:

one DSS server, still at 5.0.DB49000000.3278, with QLE2462 FC HBA with single uplink in terminator mode, various (approx 20) FC groups (each with one or two LVs and a single FC node assigned). Some otherNAS shares, no iSCSI. Though it shouldn't matter here, we're using an external LDAP for authentication.
QLogic FC switch
one active SLES10SP2 server (current patch level), QLE2462 HBA with single link to FC, Xen 3.2, several Xen DomUs configured to use NPIV-based disks (one virtual FC adapter per DomU)

Upon medium FC load (ie (re-)installing a VM) I have to face the situation that suddenly the mapping of (DSS) LVs to disk (DomU) changes, while the DomUs are up

. This situation then remains until I fully power-down the Xen server and the DSS.

When the problem occurs, I see the following after power-cycling the Xen server:

the Dom0 creates the NPIV adapter with the proper FC address
the DSS logs show no change in FC target mappings
the UID of the disk (via "lsscsi" on the Dom0) is wrong (it's that of another LV of the DSS)
the VM boots and uses the wrong disk image (thus booting from the "wrong disk") with the corresponding side effects.
the DomU still uses the same NPIV device (same virtual FC HBA address) as configured.

No matter what I do the the Dom0, the problem persists.
As soon as I take down the DSS as well, the problems goes away... until... suddenly... it's back. It currently looks to be related to the amount of stress put onto the FC link, but:

The DSS system (dual quad-core Xeon) is mostly idle, load max'es out at 5 to 6, usually we're at a load of 1 to 2. CPU is always low.

The Xen server (dual quad-core Xeon, too) is mostly idle as well.

From time to time I've noticed the following "extremes":

The LV-to-VMdisk relation gets messed up so badly that (while VMs are running) the VMs write to the absolutely wrong file systems/disk - after system recovery, I have to do fscks all over the place, sometimes need to recover from backup because things are messed up too much.
At least once, all of the sudden the FC connection from the Dom0's point of view broke away - syslog on Dom0 was spilling over with SCSI write error messages, the VMs reacted accordingly :-/.

Now comes the hard part - isolating the root cause.

First assumption (from the "FC breaks down" event): the FC terminator on the DSS restarted, messing up the link or re-assigning internal mappings.
Pro: the SCSI link broke away once, the allocations were messed up a number of times now.
Contra: I see no messages in the DSS debug logs - there *should* be messages in fc_target.log when the fc target restart, shouldn't they? And why do I need to restart the DSS to clear the mess, shouldn't restarting the Xen server re-associate NPIV-to-LUNs correctly?

Second assumption: The DSS FC target gets corrupted internally.
Pro: sudden wrong mappings, persistent until I reboot the DSS.
Contra: Should that have hit others as well?

Third assumption: The FC subsystem (Qlogic driver incl. NPIV support) on the Xen server messes up.
Pro: I've seen these problems when running everything on a single Xen server, but not when doing the "heavy workload" on a second (identical) Xen server on the same SAN/DSS.
Contra: Why doesn't a reboot of the Xen server clear the situation?

Could it be that a bogus driver on the Xen server messes up the HBA on the DSS so badly that I need to restart the DSS, too?

Has anyone seen effects similar to mine?

Does anyone at Open-E have information that the "new" DSS level includes fixes that apply to the symptoms I see?

I'm at loss - I simply cannot deduce nor debug the cause of this. HELP!

With regards,

Jens

Thread: FC trouble

Thread Tools

Display

Threaded View

FC trouble

Posting Permissions