How to return failed node in the cluster

**andy_kr** · 11-28-2014, 07:49 AM

Yes, Gr-R, i have opened support case (Ticket#1042512), but support team answers are very-very slow.
So I'm trying to find my own solution. Yesterday in my test lab i tried to repeat situation with failure:

1. Create test А-А cluster:
Open-E v7.0up10.9101.7637

First node named "dss-t1"
Second node named "dss-t2"

Network interfaces:
dss-t1:
- management GUI: eth0 = 192.168.2.217
- Volume Replication: eth1 = 192.168.134.217
- Storage Client Access: bond0 = 192.168.81.217
dss-t2:
- management GUI: eth0 = 192.168.2.218
- Volume Replication: eth1 = 192.168.134.218
- Storage Client Access: bond0 = 192.168.81.218

Logical Volumes:
dss-t1: lv0000, lv0001
dss-t2: lv0000, lv0001

iSCSI targets:
dss-t1: dss-t1.target0
dss-t2: dss-t2.target0

Replication sources settings:
lv0000 dss-t1=source dss-t2=destination
lv0001 dss-t1=destination dss-t2=source

Replication tasks:
dss-t1: "lv0000-repl" for lv0000, Status = Running, Consistent
"lv0001-repl_reverse" was automatically created by the system for reverse lv0001 replication from dss-t2, Status = Stopped

dss-t2: "lv0000-repl_reverse" was automatically created by the system for reverse lv0000 replication from dss-t1, Status = Stopped
"lv0001-repl" for lv0001, Status = Running, Consistent

2 auxiliary paths by eth1, bond0 on earch node
1 ping node = IP Address of iSCSI client
dss-t1 VIP = 192.168.181.217
dss-t2 VIP = 192.168.181.218

dss-t1 iSCSI resources: dss-t1.target0
dss-t2 iSCSI resources: dss-t2.target0

2. Start the cluster
3. Connect iSCSI client to the cluster
4. Save configuration of earch node in .cnt file
5. Start some disk operations on client with lv0000 from "dss-t1.target0"
6. Powerfail on "dss-t1"
Results: All resources without stopping of disk operations came under the control of "dss-t2"

Then (on the failed node)
7. Re-create RAID
8. Re-install Open-E repeating the original configuration:
- Open-E version, build
- Network connections in the same ports of the same switches
- Network settings
- Host name
- Volume group
- Logical volumes (the same name, type and size).

Now, i want to return "dss-t1" in the cluster.
I can ping all IP Adresses from one node to another.
Host binding from "dss-t2" to "dss-t1" is reacheable.

BUT

1) I can not setup host binding from "dss-t1" to "dss-t2". The error is "Too many bound hosts on remote host."
2) No one of replication tasks is working on node "dss-t2".

So, in the laboratory situation is exactly the same situation in the production.

Then i tried your reccomendation to restore "dss-t1" from saved configuration.
After restoring:
- Host binding from "dss-t1" to "dss-t2" is reacheable.
- Host binding from "dss-t2" to "dss-t1" is reacheable.
- lv0000 on "dss-t1" is not attacheded to dss-t1.target0. But i can not attache it with the same SCSI ID as on "dss-t2" because of error "This SCSI ID is used for another logical volume. Please choose another SCSI ID". Left without attachment.
- lv0001 on "dss-t1" is not attacheded to dss-t2.target0. But i can not attache it with the same SCSI ID as on "dss-t2" because of error "This SCSI ID is used for another logical volume. Please choose another SCSI ID". Left without attachment.
- Replication sources settings:
lv0000 dss-t1=source dss-t2=source
lv0001 dss-t1=source dss-t2=source

fixed on
lv0000 dss-t1=destination dss-t2=source
lv0001 dss-t1=destination dss-t2=source

but
- No one of replication tasks is working on node "dss-t2". When i tried manualy restart replication task, it was failed with error "Status: Error 25: Cannot find mirror server logical volumes"
- When i tried to restart cluster on "dss-t2", it fails with error "Failover cluster could not perform requested operation (error code #VI1000)". And now i can not start cluster not on "dss-t1", not on "dss-t2" node.
In fact i lose my cluster!!! Thank goodness this is only the test cluster. I think, that today the only way to return cluster functionality is to newly re-create all cluster configuration. This also means "Denial of Service" for all clients at the times = time of volume syncronisation, that may be very long for big volumes.

What do you think about this?
What is missing in my actions?
What is wrong in my actions?

And from the beginning, what is the way to return cluster functionality without stopping and recreating the cluster?

Thread: How to return failed node in the cluster

Thread Tools

Display

Threaded View

will continue

Posting Permissions