I say may or may not work, because there is no way to add a clean node (not-configured) to a production cluster. In some cases this is possible. For instance; replacing a secondary node in a active / passive configuration with a valid settings.cnf to restore is possible. When replacing a primary node, its a bit more complicated because the resources are not in the default locations and trying to bind a host to a secondary(destination by default) system without tasks does not work. This is why you have the bound host error.
The connection to host is lost message is the secondary system trying to reach the primary system, but the nodes are not bound for replication and no source tasks are configured on the primary node.
If the current primary machine is marked as destination, you should do your restore. The idea here is to get the replication tasks and host bindings back in proper order so you can sync data from secondary to primary.
EDIT** If you have a open support case with our support team, let me know via private message and I will investigate if you like.
Yes, Gr-R, i have opened support case (Ticket#1042512), but support team answers are very-very slow.
So I'm trying to find my own solution. Yesterday in my test lab i tried to repeat situation with failure:
1. Create test А-А cluster:
Open-E v7.0up10.9101.7637
First node named "dss-t1"
Second node named "dss-t2"
Replication tasks:
dss-t1: "lv0000-repl" for lv0000, Status = Running, Consistent
"lv0001-repl_reverse" was automatically created by the system for reverse lv0001 replication from dss-t2, Status = Stopped
dss-t2: "lv0000-repl_reverse" was automatically created by the system for reverse lv0000 replication from dss-t1, Status = Stopped
"lv0001-repl" for lv0001, Status = Running, Consistent
2 auxiliary paths by eth1, bond0 on earch node
1 ping node = IP Address of iSCSI client
dss-t1 VIP = 192.168.181.217
dss-t2 VIP = 192.168.181.218
2. Start the cluster
3. Connect iSCSI client to the cluster
4. Save configuration of earch node in .cnt file
5. Start some disk operations on client with lv0000 from "dss-t1.target0"
6. Powerfail on "dss-t1"
Results: All resources without stopping of disk operations came under the control of "dss-t2"
Then (on the failed node)
7. Re-create RAID
8. Re-install Open-E repeating the original configuration:
- Open-E version, build
- Network connections in the same ports of the same switches
- Network settings
- Host name
- Volume group
- Logical volumes (the same name, type and size).
Now, i want to return "dss-t1" in the cluster.
I can ping all IP Adresses from one node to another.
Host binding from "dss-t2" to "dss-t1" is reacheable.
BUT
1) I can not setup host binding from "dss-t1" to "dss-t2". The error is "Too many bound hosts on remote host."
2) No one of replication tasks is working on node "dss-t2".
So, in the laboratory situation is exactly the same situation in the production.
Then i tried your reccomendation to restore "dss-t1" from saved configuration.
After restoring:
- Host binding from "dss-t1" to "dss-t2" is reacheable.
- Host binding from "dss-t2" to "dss-t1" is reacheable.
- lv0000 on "dss-t1" is not attacheded to dss-t1.target0. But i can not attache it with the same SCSI ID as on "dss-t2" because of error "This SCSI ID is used for another logical volume. Please choose another SCSI ID". Left without attachment.
- lv0001 on "dss-t1" is not attacheded to dss-t2.target0. But i can not attache it with the same SCSI ID as on "dss-t2" because of error "This SCSI ID is used for another logical volume. Please choose another SCSI ID". Left without attachment.
- Replication sources settings:
lv0000 dss-t1=source dss-t2=source
lv0001 dss-t1=source dss-t2=source
fixed on
lv0000 dss-t1=destination dss-t2=source
lv0001 dss-t1=destination dss-t2=source
but
- No one of replication tasks is working on node "dss-t2". When i tried manualy restart replication task, it was failed with error "Status: Error 25: Cannot find mirror server logical volumes"
- When i tried to restart cluster on "dss-t2", it fails with error "Failover cluster could not perform requested operation (error code #VI1000)". And now i can not start cluster not on "dss-t1", not on "dss-t2" node.
In fact i lose my cluster!!! Thank goodness this is only the test cluster. I think, that today the only way to return cluster functionality is to newly re-create all cluster configuration. This also means "Denial of Service" for all clients at the times = time of volume syncronisation, that may be very long for big volumes.
What do you think about this?
What is missing in my actions?
What is wrong in my actions?
And from the beginning, what is the way to return cluster functionality without stopping and recreating the cluster?
Just I tried to re-create cluster. With attempt to remove iSCSI resources from "dss-t2" resource pool, I received an error "Failover cluster could not perform requested operation (error code #HARCE03)"
Yes, Gr-R, i have opened support case (Ticket#1042512), but support team answers are very-very slow.
So I'm trying to find my own solution. Yesterday in my test lab i tried to repeat situation with failure:
1. Create test А-А cluster:
Open-E v7.0up10.9101.7637
First node named "dss-t1"
Second node named "dss-t2"
Replication tasks:
dss-t1: "lv0000-repl" for lv0000, Status = Running, Consistent
"lv0001-repl_reverse" was automatically created by the system for reverse lv0001 replication from dss-t2, Status = Stopped
dss-t2: "lv0000-repl_reverse" was automatically created by the system for reverse lv0000 replication from dss-t1, Status = Stopped
"lv0001-repl" for lv0001, Status = Running, Consistent
2 auxiliary paths by eth1, bond0 on earch node
1 ping node = IP Address of iSCSI client
dss-t1 VIP = 192.168.181.217
dss-t2 VIP = 192.168.181.218
2. Start the cluster
3. Connect iSCSI client to the cluster
4. Save configuration of earch node in .cnt file
5. Start some disk operations on client with lv0000 from "dss-t1.target0"
6. Powerfail on "dss-t1"
Results: All resources without stopping of disk operations came under the control of "dss-t2"
Then (on the failed node)
7. Re-create RAID
8. Re-install Open-E repeating the original configuration:
- Open-E version, build
- Network connections in the same ports of the same switches
- Network settings
- Host name
- Volume group
- Logical volumes (the same name, type and size).
Now, i want to return "dss-t1" in the cluster.
I can ping all IP Adresses from one node to another.
Host binding from "dss-t2" to "dss-t1" is reacheable.
BUT
1) I can not setup host binding from "dss-t1" to "dss-t2". The error is "Too many bound hosts on remote host."
2) No one of replication tasks is working on node "dss-t2".
So, in the laboratory situation is exactly the same situation in the production.
Then i tried your reccomendation to restore "dss-t1" from saved configuration.
After restoring:
- Host binding from "dss-t1" to "dss-t2" is reacheable.
- Host binding from "dss-t2" to "dss-t1" is reacheable.
- lv0000 on "dss-t1" is not attacheded to dss-t1.target0. But i can not attache it with the same SCSI ID as on "dss-t2" because of error "This SCSI ID is used for another logical volume. Please choose another SCSI ID". Left without attachment.
- lv0001 on "dss-t1" is not attacheded to dss-t2.target0. But i can not attache it with the same SCSI ID as on "dss-t2" because of error "This SCSI ID is used for another logical volume. Please choose another SCSI ID". Left without attachment.
- Replication sources settings:
lv0000 dss-t1=source dss-t2=source
lv0001 dss-t1=source dss-t2=source
fixed on
lv0000 dss-t1=destination dss-t2=source
lv0001 dss-t1=destination dss-t2=source
but
- No one of replication tasks is working on node "dss-t2". When i tried manualy restart replication task, it was failed with error "Status: Error 25: Cannot find mirror server logical volumes"
- When i tried to restart cluster on "dss-t2", it fails with error "Failover cluster could not perform requested operation (error code #VI1000)". And now i can not start cluster not on "dss-t1", not on "dss-t2" node.
In fact i lose my cluster!!! Thank goodness this is only the test cluster. I think, that today the only way to return cluster functionality is to newly re-create all cluster configuration. This also means "Denial of Service" for all clients at the times = time of volume syncronisation, that may be very long for big volumes.
What do you think about this?
What is missing in my actions?
What is wrong in my actions?
And from the beginning, what is the way to return cluster functionality without stopping and recreating the cluster?
As I mentioned previously, there is not a way to completely restore a failed primary node, but only a small chance to restore a secondary node. This does depend on circumstances as well.
Due to needed communication and updating configuration files, it will need to be done manually. To avoid downtime, you may configure your hosts to connect to the physical IP temporarily for this.