Visit Open-E website
Results 1 to 4 of 4

Thread: A-A-Sync + failover with power-off

  1. #1

    Default A-A-Sync + failover with power-off

    Hi,

    this is a general question. We had three times the same recovery problem after a complete power loss.

    Our equipment: two Open-e DSS V7 in different buildings, working as A-A-Sync cluster.
    - sync-line point-to-point 10Gbit
    - storage network line 10Gbit
    - admin network line 2Gbit
    We've 3 aux paths and some ping nodes in the storage network
    The cluster is working well: When we trigger a failover, it works, and we can maintain the passive system and then failback again. All well.

    Euch node is secured by its own local UPS devices. They provide battery power for 15-30 mins in case of line power loss.

    The problem:

    We had three global power losses during the last months in all of our buildings (during night, at weekend, of course, bad luck ;-)
    And the behavior of the nodes/clusterwas the same:

    After ten+x minutes, the first node felt when its batteries became empty.
    Another ten minutes later(!), the second one did, after its UPS shut down.
    The time between should be enough to complete the failover.

    In our expectation, we should be able to restart the cluster and failback after repowering the nodes.
    But this didn't happen.

    We succeeded to re-instantiate all sync tasks after we repowered the nodes.
    But we found the cluster unable to start and needed the help of the technical support (they did something via remote support line what we can't do via admin interface).

    Question: Bug or feature?

    Our idea, now: The nodes seemed not to keep the cluster's state from before the powerloss of the second node.
    Thus I think that I misunderstand how failover works, at least im some detail. Any hints?

    Robert

  2. #2

    Default

    If a source node is rebooted, and it can find its destination, replication will resume. This can be verified in the webGUI at STATUS-->Tasks. Email are also sent when the replication is interrupted, as well as the various steps until synchronization.

    If the destination node is rebooted alone, replication will resume when it is back online.

    If for some reason both machines are rebooted together (not recommended) the volumes will be marked as destination on each side in order to protect data integrity due to possible split-brain scenario between nodes. This was similar to your case as the last one that was standing lost power and when they came back online volumes where set to destination as we dont want start failover causing a possible split brain.

    In a failover configuration, if the source node is rebooted before the destination, this will cause a failover event and will require human intervention to move resources back to the original source node.
    All the best,

    Todd Maxwell


    Follow the red "E"
    Facebook | Twitter | YouTube

  3. #3

    Default

    Thanks for your answer. But not clear yet, sorry.

    Two questions:

    1.. general case: When we know about an oncoming power loss (line works, f.i.), is there a way we can regulary shutdown both nodes and bring them up after powerback as a cluster? In which order must we shutdown, and in which order must we repower? Or have we to stop the cluster, to stop all raplication tasks before, and to re-instantiate all single facts after repowering?

    2.. in our case of a arbitrary power loss, we found all volumes on both sides as destinations. That's right. We then changed the appropriate ones to sources and started all replication tasks sucessfully. All good until then. After some time, when all replication tasks were in sync, we started the cluster, and this failed, more exactly, the process of starting never ended up, so we had to reboot the systems to finish this deadlock. After reboot, the state was again with destinations on both sides. We had to do the same again, with no success. Then we called Open-e support, and they succeeded after hours and two more failed tries with some magic via remote support. Is there any way to solve this without Open-e remote support? Remember that we successfully reinstantiated all replications, and all was in sync. All things looked same like when we created the cluster. But here, we couldn't start it.
    The Open-e technician didn't understand this too.

  4. #4
    Join Date
    Nov 2015
    Posts
    4

    Default

    Quote Originally Posted by the_nipper View Post
    1.. general case: When we know about an oncoming power loss (line works, f.i.), is there a way we can regulary shutdown both nodes and bring them up after powerback as a cluster? In which order must we shutdown, and in which order must we repower? Or have we to stop the cluster, to stop all raplication tasks before, and to re-instantiate all single facts after repowering?
    In such case it would be best to stop the cluster and replication, then bring both nodes down. Alternatively, you can failover to one machine, bring down passive machine, and finally shut the active machine down. Then you can start the active one again, wait a while, and start passive node.

    Quote Originally Posted by the_nipper View Post
    2.. in our case of a arbitrary power loss, we found all volumes on both sides as destinations. That's right. We then changed the appropriate ones to sources and started all replication tasks sucessfully. All good until then. After some time, when all replication tasks were in sync, we started the cluster, and this failed, more exactly, the process of starting never ended up, so we had to reboot the systems to finish this deadlock. After reboot, the state was again with destinations on both sides. We had to do the same again, with no success. Then we called Open-e support, and they succeeded after hours and two more failed tries with some magic via remote support. Is there any way to solve this without Open-e remote support? Remember that we successfully reinstantiated all replications, and all was in sync. All things looked same like when we created the cluster. But here, we couldn't start it.
    In such cases clearing replication cache could be helpful, as both nodes went down suddenly without having a chance to pass over resources.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •