Hello,

one of my OE7 environment crashed during the weekend. I don't know exactly what happend (it's on the other side of the world, cannot take a close look) but I think they lost power and the UPS kept the OE7 nodes running but now the Switch (it looks like that the guy there changed the power cables)...

It all started with a hell of emails like:
System:Cluster: Heartbeat service packets not delivered on time. This might be caused by one or both nodes under heavy load or by current failover triggering policy with its timing settings being too low for the current network configuration.

I logged on and realized that the store is still there but noch accessable. So I rebooted the Node2 (passive) but this does not solve my problem because Node1 was slow as hell. So I deceided to boot Node1 too.

After that, it was not possible to start the cluster
Cluster:Logical volume lv0000 can not be served on dss123123123 because it is outdated and this operation could put your data at risk. If you wish to activate this logical volume anyway please go to Volume Replication Manager and switch replication to source manually.
and
sshd:fatal: Write failed: Broken pipe [preauth]


Both Nodes was configured as Source and it was not able to change it. I tried it for maybe 20 minutes to set Node2 back as Destination but I failed.
Finally I stopped the whole cluster configuration, and reconfigured it (Virtal IPs, Relication, AuxPaths, ...) and since then the Storage is back online and working fine.



What do you think what happend?
As I wrote, I think the Switch lost power. The two OE Nodes use a bond for Storage Replication and the cables are connected directly without a switch...


In the past I got some power issues and total crashes there but in all cases I had no problems to start the Volume Replication and Cluster correctly. This time it was different and I try to understand what happend (and solve it ).


Best regards,
Manuel


Additional Question:
What will happen if there are some files written to the Cluster and it failed before all data was replicated (not consistend). Can I switch Source/Destination by mistake and start the replication or will OE realize with the metadata that I have to switch Source/Destination? Background: In most cases after a power-related crash both nodes will be back online as "Source" in VolRep.