Visit Open-E website
Results 1 to 10 of 10

Thread: FC trouble

  1. #1
    Join Date
    May 2008
    Location
    Hamburg, Germany
    Posts
    108

    Default FC trouble

    Hello all,

    I'm facing a serious problem I haven't been able to track down :

    • one DSS server, still at 5.0.DB49000000.3278, with QLE2462 FC HBA with single uplink in terminator mode, various (approx 20) FC groups (each with one or two LVs and a single FC node assigned). Some otherNAS shares, no iSCSI. Though it shouldn't matter here, we're using an external LDAP for authentication.
    • QLogic FC switch
    • one active SLES10SP2 server (current patch level), QLE2462 HBA with single link to FC, Xen 3.2, several Xen DomUs configured to use NPIV-based disks (one virtual FC adapter per DomU)


    Upon medium FC load (ie (re-)installing a VM) I have to face the situation that suddenly the mapping of (DSS) LVs to disk (DomU) changes, while the DomUs are up . This situation then remains until I fully power-down the Xen server and the DSS.

    When the problem occurs, I see the following after power-cycling the Xen server:

    • the Dom0 creates the NPIV adapter with the proper FC address
    • the DSS logs show no change in FC target mappings
    • the UID of the disk (via "lsscsi" on the Dom0) is wrong (it's that of another LV of the DSS)
    • the VM boots and uses the wrong disk image (thus booting from the "wrong disk") with the corresponding side effects.
    • the DomU still uses the same NPIV device (same virtual FC HBA address) as configured.

    No matter what I do the the Dom0, the problem persists.
    As soon as I take down the DSS as well, the problems goes away... until... suddenly... it's back. It currently looks to be related to the amount of stress put onto the FC link, but:

    The DSS system (dual quad-core Xeon) is mostly idle, load max'es out at 5 to 6, usually we're at a load of 1 to 2. CPU is always low.

    The Xen server (dual quad-core Xeon, too) is mostly idle as well.

    From time to time I've noticed the following "extremes":

    • The LV-to-VMdisk relation gets messed up so badly that (while VMs are running) the VMs write to the absolutely wrong file systems/disk - after system recovery, I have to do fscks all over the place, sometimes need to recover from backup because things are messed up too much.
    • At least once, all of the sudden the FC connection from the Dom0's point of view broke away - syslog on Dom0 was spilling over with SCSI write error messages, the VMs reacted accordingly :-/.

    Now comes the hard part - isolating the root cause.

    First assumption (from the "FC breaks down" event): the FC terminator on the DSS restarted, messing up the link or re-assigning internal mappings.
    Pro: the SCSI link broke away once, the allocations were messed up a number of times now.
    Contra: I see no messages in the DSS debug logs - there *should* be messages in fc_target.log when the fc target restart, shouldn't they? And why do I need to restart the DSS to clear the mess, shouldn't restarting the Xen server re-associate NPIV-to-LUNs correctly?

    Second assumption: The DSS FC target gets corrupted internally.
    Pro: sudden wrong mappings, persistent until I reboot the DSS.
    Contra: Should that have hit others as well?

    Third assumption: The FC subsystem (Qlogic driver incl. NPIV support) on the Xen server messes up.
    Pro: I've seen these problems when running everything on a single Xen server, but not when doing the "heavy workload" on a second (identical) Xen server on the same SAN/DSS.
    Contra: Why doesn't a reboot of the Xen server clear the situation?

    Could it be that a bogus driver on the Xen server messes up the HBA on the DSS so badly that I need to restart the DSS, too?

    Has anyone seen effects similar to mine?

    Does anyone at Open-E have information that the "new" DSS level includes fixes that apply to the symptoms I see?

    I'm at loss - I simply cannot deduce nor debug the cause of this. HELP!

    With regards,

    Jens

  2. #2

    Default

    Send the logs in to support I want to take a look at them to see if there is anything in the dmesg, error, critical error and test logs to help out.
    All the best,

    Todd Maxwell


    Follow the red "E"
    Facebook | Twitter | YouTube

  3. #3

    Default

    What Lun Numbers are you using ?
    I had problems with VMware where I was using the same lun numbers, Vmware did not like that.
    Check the knowledgebase there was a post about the lun number problem.

  4. #4
    Join Date
    May 2008
    Location
    Hamburg, Germany
    Posts
    108

    Default

    Quote Originally Posted by To-M
    Send the logs in to support I want to take a look at them to see if there is anything in the dmesg, error, critical error and test logs to help out.
    Logs will be coming up on Saturday or Monday, I currently have no access to them.

    I have had a close look at those files and couldn't find anything obvious nor related to the affected LVs (although I'm no experienced *DSS* engineer, I feel knowledgable in the Linux area - I've seen it break it's egg shell and was on the mailing list when Andrew S T. was arguing with Linus T. ).

    Thanks for looking into this

    Jens

  5. #5
    Join Date
    May 2008
    Location
    Hamburg, Germany
    Posts
    108

    Default

    Quote Originally Posted by symm
    What Lun Numbers are you using ?
    I had problems with VMware where I was using the same lun numbers, Vmware did not like that.
    Check the knowledgebase there was a post about the lun number problem.
    As all the (first) disks are boot disks, they're on LUN 0 of the respective group. I even receive a warning (or was that an error?) when there's no LUN 0 defined

    I've checked the knowledge base, couldn't find the arcticle you were refering to, but stumbled over article 162... not the one you meant, but possibly related to this case. Although I've seen absolutely no errors in the logs, especially no kernel errors.

    I'll get back when I have more information.

    With regards
    Jens

  6. #6

    Default

    Hi jmo

    I know your using xen server.
    and not sure if this will apply to you, but i've seen this in both vmware and virtual iron.
    The lun number must be different across all luns, even in different targets.

    http://communities.vmware.com/message/701181#701181

  7. #7

    Default

    Yes, this is even true for iSCSI with VMware, example below:

    Target 0 :

    Lun 1
    Lun 2

    Target 1:

    Lun 3
    Lun 4

    and so on.
    All the best,

    Todd Maxwell


    Follow the red "E"
    Facebook | Twitter | YouTube

  8. #8

    Lightbulb

    Quote Originally Posted by To-M
    Yes, this is even true for iSCSI with VMware, example below:

    Target 0 :

    Lun 1
    Lun 2

    Target 1:

    Lun 3
    Lun 4

    and so on.
    Is this true even when using Fibre Channel?

  9. #9

    Lightbulb

    Is it true that we can't use lun0 in VMWARE under fibre channel? I have a customer right now doing that and so far (since yesterday) this hasn't been an issue. Is it an issue?

  10. #10

    Default

    It was for iSCSI and this does not apply to the new DSS V6.
    All the best,

    Todd Maxwell


    Follow the red "E"
    Facebook | Twitter | YouTube

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •