Visit Open-E website
Results 1 to 8 of 8

Thread: How to return failed node in the cluster

  1. #1

    Default How to return failed node in the cluster

    I have in production two node Active-Active cluster.

    Open-E v7.0up10.9101.7637

    First node named "dss1"
    Second node named "dss2"

    Network interfaces on both nodes:
    - management GUI = eth0
    - Storage Client Access= bond0
    - Storage Client Access= bond1
    - Volume Replication = bond2

    Logical Volumes:
    dss1: lv0000, lv0001
    dss2: lv0002, lv0003, lv0004

    iSCSI targets:
    dss1: dss1.target0
    dss2: dss2.target0

    Replication sources settings:
    lv0000 dss1=source dss2=destination
    lv0001 dss1=source dss2=destination
    lv0002 dss1=destination dss2=source
    lv0003 dss1=destination dss2=source
    lv0004 dss1=destination dss2=source

    Replication tasks:
    dss1: "VM-Data1" for lv0000, Status = Running
    "VM-File-Data" for lv0001, Status = Running
    "VM-Arh-Data_reverse" was automatically created by the system for reverse lv0002 replication from dss2, Status = Stopped
    "VM-Sql-Data_reverse" was automatically created by the system for reverse lv0003 replication from dss2, Status = Stopped
    "VM-Data2_reverse" was automatically created by the system for reverse lv0004 replication from dss2, Status = Stopped

    dss2: "VM-Data1_reverse" was automatically created by the system for reverse lv0000 replication from dss1, Status = Stopped
    "VM-File-Data_reverse" was automatically created by the system for reverse lv0001 replication from dss1, Status = Stopped
    "VM-Arh-Data" for lv0002, Status = Running
    "VM-Sql-Data" for lv0003, Status = Running
    "VM-Data2" for lv0004, Status = Running

    !!!!!

    A few days ago "dss1" node fails. "Dss1" resources came under the control of "dss2" node. Now "dss2" node is a host for:
    - Virtual IPs: All
    - iSCSI targets: dss1.target0 (lv0000, lv0001)
    dss2.target0 (lv0002, lv0003, lv0004)
    - Replication tasks: "VM-Data1_reverse", Status = Running
    "VM-File-Data_reverse", Status = Running
    "VM-Arh-Data", Status = Running
    "VM-Sql-Data", Status = Running
    "VM-Data2", Status = Running

    On the failed node had to change the motherboard and disk controller.
    After that, I re-install Open-E repeating the original configuration:
    - Open-E version, build and licences
    - Network connections in the same ports of the same switches
    - Network settings
    - Host name
    - Volume group
    - Logical volumes (the same name, type and size). The only difference from initial config is that i setup lv0000 and lv0001 on "dss1" as "destination" for replication tasks, because I'm afraid that will replicate in the wrong direction.

    Now, i want to return "dss1" in the cluster.
    I can ping all IP Adresses from one node to another.
    Host binding from "dss1" to "dss1" is reacheable.

    BUT

    1) I can not setup host binding from "dss1" to "dss2". The error is "Too many bound hosts on remote host."
    2) No one of replication tasks is working on node "dss2". When i tried manualy restart replication task "VM-Arh_Data", it was failed with error "Status: Error 32: Cannot find mirror server logical volumes"
    3) Every 10 minutes "dss2" node writes an error in the log "Connection to host 'dss1-host' lost. Please check communication route between local computer and host 'dss1-host'", although all pings in all directions for all ip addresses is successful.
    4) I have some recomendations from support team:
    1. recreate the RAID. = done
    2. re-install the OPEN-E DSS (activate the license, apply small updates if needed). = done
    3. the old configuration should be applied automatically on the new OPen-E. if not, create logical volumes, exact the same size as on primary node = done
    4. create volume replication tasks. please note, on which node it should be configured as sources, and on which as destinations. start the replications tasks. = can not do because of error
    5. if possible, wait untill the data will be consistent. = can not do
    6. verify the failover configuration. if it's ok, start the failover on the "primary node". = can not do

    BUT they are not working, no autamatically, no manually
    5) I have saved settings of "dss1" node in the .cnf file. Can i use it?

    Anyone know, how to return failed node in the cluster?

  2. #2
    Join Date
    Oct 2010
    Location
    GA
    Posts
    935

    Default

    If you have previously saved settings and configuration, you should restore this. Be sure that logical volumes are same size as before as well before your restore.
    This process may or may not work. It will depend on what has changed since before the rebuild.
    If this does not work, you will need to stop the cluster services and manually create what is missing, then restart the cluster.

  3. #3

    Default

    Respectable Gr-R, thank you for the answer.
    Excuse me, but your recommendations is applicable for the test lab, not for production. You position your product as "Enterprise-Class Storage Software for Every Business".
    What means "may or may not work" in production environment with many virtual machines hosted on the storage?
    What means "stop the cluster services" in production?
    It means "Denial of Service" for all servicies with unpredictable results at unpredictable times.
    My cluster config is fully consistent with the manufacturer's recommendations.
    Failure of one of the nodes in a two-node cluster is fully supported event.
    I think, that any enterprise solution must have stadard procedure for returning failed node in the cluster.
    My new node hardware configuration is absolutelly identic to the old hardware configuration.
    My new node software version is absolutelly identic to the old software version.
    I described the current configuration of each of the nodes.

    Now, as a customer, i want to recieve a detailed correct instructions how to return new node in the cluster according to my config, not according to any other situation.
    Instead of this, you and your support team gives me a vague instructions like "manually create what is missing".
    In my post i described in detail what i have.
    I just ask to specify what I am missing?
    What is the error "Too many bound hosts on remote host."?
    What is the error "Connection to host 'dss1-host' lost", although all pings in all directions for all ip addresses is successful and Host binding from "dss2" to "dss1" is reacheable?
    I would gladly to send a detailed diagram of my configuration and any logs, but do not have the rights to post attachments. Ready to explain everything that is not clear, in words.

    Now will review your basic recommendation to restore config from saved state. It is consistent with production logical configuration. I was not doing any changes in configuration after saving.
    BUT, as I wrote above, I fear replication in the wrong direction. Today all source volumes are hosted on "dss2" node as a sources. After restoring config from saved state, lv0000 and lv0001 will be hosted on "dss1" node as a sources. But on "dss1" node they are blank volumes. If, for some reason, replication goes in the wrong direction, I will lose all data on the lv0000 and lv0001.

    How safe is recovery from saved state in this situation???

    In fact, I have already collected manually configure identical to settings that were saved, except for setting replication origins. Why do you think, that restoring is better than my manual config?

    What is safer, restoring or manual config???

    With respect
    Andy_kr

  4. #4
    Join Date
    Oct 2010
    Location
    GA
    Posts
    935

    Default

    I say may or may not work, because there is no way to add a clean node (not-configured) to a production cluster. In some cases this is possible. For instance; replacing a secondary node in a active / passive configuration with a valid settings.cnf to restore is possible. When replacing a primary node, its a bit more complicated because the resources are not in the default locations and trying to bind a host to a secondary(destination by default) system without tasks does not work. This is why you have the bound host error.
    The connection to host is lost message is the secondary system trying to reach the primary system, but the nodes are not bound for replication and no source tasks are configured on the primary node.
    If the current primary machine is marked as destination, you should do your restore. The idea here is to get the replication tasks and host bindings back in proper order so you can sync data from secondary to primary.

    EDIT** If you have a open support case with our support team, let me know via private message and I will investigate if you like.
    Last edited by Gr-R; 11-26-2014 at 05:16 PM.

  5. #5

    Default will continue

    Yes, Gr-R, i have opened support case (Ticket#1042512), but support team answers are very-very slow.
    So I'm trying to find my own solution. Yesterday in my test lab i tried to repeat situation with failure:

    1. Create test А-А cluster:
    Open-E v7.0up10.9101.7637

    First node named "dss-t1"
    Second node named "dss-t2"

    Network interfaces:
    dss-t1:
    - management GUI: eth0 = 192.168.2.217
    - Volume Replication: eth1 = 192.168.134.217
    - Storage Client Access: bond0 = 192.168.81.217
    dss-t2:
    - management GUI: eth0 = 192.168.2.218
    - Volume Replication: eth1 = 192.168.134.218
    - Storage Client Access: bond0 = 192.168.81.218

    Logical Volumes:
    dss-t1: lv0000, lv0001
    dss-t2: lv0000, lv0001

    iSCSI targets:
    dss-t1: dss-t1.target0
    dss-t2: dss-t2.target0

    Replication sources settings:
    lv0000 dss-t1=source dss-t2=destination
    lv0001 dss-t1=destination dss-t2=source

    Replication tasks:
    dss-t1: "lv0000-repl" for lv0000, Status = Running, Consistent
    "lv0001-repl_reverse" was automatically created by the system for reverse lv0001 replication from dss-t2, Status = Stopped

    dss-t2: "lv0000-repl_reverse" was automatically created by the system for reverse lv0000 replication from dss-t1, Status = Stopped
    "lv0001-repl" for lv0001, Status = Running, Consistent

    2 auxiliary paths by eth1, bond0 on earch node
    1 ping node = IP Address of iSCSI client
    dss-t1 VIP = 192.168.181.217
    dss-t2 VIP = 192.168.181.218

    dss-t1 iSCSI resources: dss-t1.target0
    dss-t2 iSCSI resources: dss-t2.target0

    2. Start the cluster
    3. Connect iSCSI client to the cluster
    4. Save configuration of earch node in .cnt file
    5. Start some disk operations on client with lv0000 from "dss-t1.target0"
    6. Powerfail on "dss-t1"
    Results: All resources without stopping of disk operations came under the control of "dss-t2"

    Then (on the failed node)
    7. Re-create RAID
    8. Re-install Open-E repeating the original configuration:
    - Open-E version, build
    - Network connections in the same ports of the same switches
    - Network settings
    - Host name
    - Volume group
    - Logical volumes (the same name, type and size).

    Now, i want to return "dss-t1" in the cluster.
    I can ping all IP Adresses from one node to another.
    Host binding from "dss-t2" to "dss-t1" is reacheable.

    BUT

    1) I can not setup host binding from "dss-t1" to "dss-t2". The error is "Too many bound hosts on remote host."
    2) No one of replication tasks is working on node "dss-t2".

    So, in the laboratory situation is exactly the same situation in the production.


    Then i tried your reccomendation to restore "dss-t1" from saved configuration.
    After restoring:
    - Host binding from "dss-t1" to "dss-t2" is reacheable.
    - Host binding from "dss-t2" to "dss-t1" is reacheable.
    - lv0000 on "dss-t1" is not attacheded to dss-t1.target0. But i can not attache it with the same SCSI ID as on "dss-t2" because of error "This SCSI ID is used for another logical volume. Please choose another SCSI ID". Left without attachment.
    - lv0001 on "dss-t1" is not attacheded to dss-t2.target0. But i can not attache it with the same SCSI ID as on "dss-t2" because of error "This SCSI ID is used for another logical volume. Please choose another SCSI ID". Left without attachment.
    - Replication sources settings:
    lv0000 dss-t1=source dss-t2=source
    lv0001 dss-t1=source dss-t2=source

    fixed on
    lv0000 dss-t1=destination dss-t2=source
    lv0001 dss-t1=destination dss-t2=source

    but
    - No one of replication tasks is working on node "dss-t2". When i tried manualy restart replication task, it was failed with error "Status: Error 25: Cannot find mirror server logical volumes"
    - When i tried to restart cluster on "dss-t2", it fails with error "Failover cluster could not perform requested operation (error code #VI1000)". And now i can not start cluster not on "dss-t1", not on "dss-t2" node.
    In fact i lose my cluster!!! Thank goodness this is only the test cluster. I think, that today the only way to return cluster functionality is to newly re-create all cluster configuration. This also means "Denial of Service" for all clients at the times = time of volume syncronisation, that may be very long for big volumes.

    What do you think about this?
    What is missing in my actions?
    What is wrong in my actions?

    And from the beginning, what is the way to return cluster functionality without stopping and recreating the cluster?

  6. #6

    Default

    In addition

    Just I tried to re-create cluster. With attempt to remove iSCSI resources from "dss-t2" resource pool, I received an error "Failover cluster could not perform requested operation (error code #HARCE03)"

  7. #7

    Default will continue

    Yes, Gr-R, i have opened support case (Ticket#1042512), but support team answers are very-very slow.
    So I'm trying to find my own solution. Yesterday in my test lab i tried to repeat situation with failure:

    1. Create test А-А cluster:
    Open-E v7.0up10.9101.7637

    First node named "dss-t1"
    Second node named "dss-t2"

    Network interfaces:
    dss-t1:
    - management GUI: eth0 = 192.168.2.217
    - Volume Replication: eth1 = 192.168.134.217
    - Storage Client Access: bond0 = 192.168.81.217
    dss-t2:
    - management GUI: eth0 = 192.168.2.218
    - Volume Replication: eth1 = 192.168.134.218
    - Storage Client Access: bond0 = 192.168.81.218

    Logical Volumes:
    dss-t1: lv0000, lv0001
    dss-t2: lv0000, lv0001

    iSCSI targets:
    dss-t1: dss-t1.target0
    dss-t2: dss-t2.target0

    Replication sources settings:
    lv0000 dss-t1=source dss-t2=destination
    lv0001 dss-t1=destination dss-t2=source

    Replication tasks:
    dss-t1: "lv0000-repl" for lv0000, Status = Running, Consistent
    "lv0001-repl_reverse" was automatically created by the system for reverse lv0001 replication from dss-t2, Status = Stopped

    dss-t2: "lv0000-repl_reverse" was automatically created by the system for reverse lv0000 replication from dss-t1, Status = Stopped
    "lv0001-repl" for lv0001, Status = Running, Consistent

    2 auxiliary paths by eth1, bond0 on earch node
    1 ping node = IP Address of iSCSI client
    dss-t1 VIP = 192.168.181.217
    dss-t2 VIP = 192.168.181.218

    dss-t1 iSCSI resources: dss-t1.target0
    dss-t2 iSCSI resources: dss-t2.target0

    2. Start the cluster
    3. Connect iSCSI client to the cluster
    4. Save configuration of earch node in .cnt file
    5. Start some disk operations on client with lv0000 from "dss-t1.target0"
    6. Powerfail on "dss-t1"
    Results: All resources without stopping of disk operations came under the control of "dss-t2"

    Then (on the failed node)
    7. Re-create RAID
    8. Re-install Open-E repeating the original configuration:
    - Open-E version, build
    - Network connections in the same ports of the same switches
    - Network settings
    - Host name
    - Volume group
    - Logical volumes (the same name, type and size).

    Now, i want to return "dss-t1" in the cluster.
    I can ping all IP Adresses from one node to another.
    Host binding from "dss-t2" to "dss-t1" is reacheable.

    BUT

    1) I can not setup host binding from "dss-t1" to "dss-t2". The error is "Too many bound hosts on remote host."
    2) No one of replication tasks is working on node "dss-t2".

    So, in the laboratory situation is exactly the same situation in the production.


    Then i tried your reccomendation to restore "dss-t1" from saved configuration.
    After restoring:
    - Host binding from "dss-t1" to "dss-t2" is reacheable.
    - Host binding from "dss-t2" to "dss-t1" is reacheable.
    - lv0000 on "dss-t1" is not attacheded to dss-t1.target0. But i can not attache it with the same SCSI ID as on "dss-t2" because of error "This SCSI ID is used for another logical volume. Please choose another SCSI ID". Left without attachment.
    - lv0001 on "dss-t1" is not attacheded to dss-t2.target0. But i can not attache it with the same SCSI ID as on "dss-t2" because of error "This SCSI ID is used for another logical volume. Please choose another SCSI ID". Left without attachment.
    - Replication sources settings:
    lv0000 dss-t1=source dss-t2=source
    lv0001 dss-t1=source dss-t2=source

    fixed on
    lv0000 dss-t1=destination dss-t2=source
    lv0001 dss-t1=destination dss-t2=source

    but
    - No one of replication tasks is working on node "dss-t2". When i tried manualy restart replication task, it was failed with error "Status: Error 25: Cannot find mirror server logical volumes"
    - When i tried to restart cluster on "dss-t2", it fails with error "Failover cluster could not perform requested operation (error code #VI1000)". And now i can not start cluster not on "dss-t1", not on "dss-t2" node.
    In fact i lose my cluster!!! Thank goodness this is only the test cluster. I think, that today the only way to return cluster functionality is to newly re-create all cluster configuration. This also means "Denial of Service" for all clients at the times = time of volume syncronisation, that may be very long for big volumes.

    What do you think about this?
    What is missing in my actions?
    What is wrong in my actions?

    And from the beginning, what is the way to return cluster functionality without stopping and recreating the cluster?

  8. #8
    Join Date
    Oct 2010
    Location
    GA
    Posts
    935

    Default

    As I mentioned previously, there is not a way to completely restore a failed primary node, but only a small chance to restore a secondary node. This does depend on circumstances as well.
    Due to needed communication and updating configuration files, it will need to be done manually. To avoid downtime, you may configure your hosts to connect to the physical IP temporarily for this.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •