What scenarios of failure do iSCSI failover detect?

**TheEniGMa** · 12-12-2009, 10:17 AM

We have been experimenting with two new boxes with DSS providing iSCSI failover service for our five VMware ESXi 4 servers.

I am just curious about how the iSCSI failover heartbeat work and what kind of diffrent failures it detect. What I really want to know if it detects a simple failure in the iSCSI service (for example if the iSCSI target deamon craches) or if it only detects when the whole box goes down and stops responding to ping.

In other words, does the seconday DSS server check that the iSCSI target service on the primary servers if functional, or does it only monitor if it responds to heartbeat or ping?

To test, I am thinking about placing the primary DSS server behind a firewall allowing all traffic, and put the secondary box and the ESXi servers on the other side. If I then block port 3260 (iscsi service) to the primary DSS server, will the seconday DSS server see that and initiate a failover?

Am I making any sense at all? ;-)

**symm** · 12-12-2009, 10:25 PM

that should make it failover.
Try it let us know the results

**TheEniGMa** · 12-13-2009, 08:10 AM

Thanks for the reply (and the good news). I´ll do some tests next week.

Also, if someone knows how the iscsi service is checked more in detail, please let me know. As a tech junkie, you can´t get enough information =)

**jisaac** · 12-14-2009, 11:18 PM

Originally Posted by TheEniGMa

We have been experimenting with two new boxes with DSS providing iSCSI failover service for our five VMware ESXi 4 servers.

I am just curious about how the iSCSI failover heartbeat work and what kind of diffrent failures it detect. What I really want to know if it detects a simple failure in the iSCSI service (for example if the iSCSI target deamon craches) or if it only detects when the whole box goes down and stops responding to ping.

Well, my experience hasn't been all that confidence-inspiring. See this post, for example:
http://forum.open-e.com/showthread.php?t=1604

I have a very similar setup to what you have; two boxes with iSCSI failover configured, providing LUNs to VMWare ESX 4 servers. In three different occasions, the iSCSI service has failed for some reason. This is characterized by you receiving about 40 e-mail messages that look like this:

Code:

2009/08/31 00:03:01|------------[ cut here ]------------
2009/08/31 00:03:01|CPU 4 
2009/08/31 00:03:01|Pid: 22488, comm: iscsi-scstd Not tainted 2.6.27.10-oe64-00000-g9b2116f #12
2009/08/31 00:03:01|RIP: 0010:[<ffffffffa01cad78>]  [<ffffffffa01cad78>] session_free+0x138/0x140 [iscsi_scst]
2009/08/31 00:03:01|RSP: 0000:ffff88007faed848  EFLAGS: 00210286
2009/08/31 00:03:01|RAX: ffff880106f69088 RBX: ffff880106f68000 RCX: 0000000000000000
2009/08/31 00:03:01|RDX: ffff880106f69000 RSI: 00000000ffffffff RDI: ffff880106f68000
2009/08/31 00:03:01|RBP: 000005d8210f0040 R08: ffff880106f68000 R09: ffff880115f8e920
2009/08/31 00:03:01|R10: ffff88012e6ee4d0 R11: 0000000000000000 R12: ffff88011efa2e20
2009/08/31 00:03:01|R13: ffff88007faedb78 R14: 0000000000000000 R15: ffff88011efa2e08
2009/08/31 00:03:01|FS:  0000000000000000(0000) GS:ffff88012f8fd9c0(0063) knlGS:00000000f7cbb6c0
2009/08/31 00:03:01|CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
2009/08/31 00:03:01|CR2: 00000000093c2000 CR3: 000000007fada000 CR4: 00000000000006a0
2009/08/31 00:03:01|DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2009/08/31 00:03:01|DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2009/08/31 00:03:01|Process iscsi-scstd (pid: 22488, threadinfo ffff88007faec000, task ffff880116d00290)
2009/08/31 00:03:01|Stack:  ffff880106f68000 ffffffffa01cb0d9 0000200000000001 0000000000000000
2009/08/31 00:03:01|0100000200010000 ffff88011c2d8000 000005d8210f0040 0000000100000000
2009/08/31 00:03:01|000005d8210f0040 0000080000000000 0000000000000800 ffff88011efa2e00

accompanied by the sinking feeling in the pit of your stomach as you realize that it's midnight and you have to drive to the datacenter to get things working again. Why? Because at this point I have only been able to get things working again by doing a hard-reset; i.e., push the power button on the primary server.

When the error occurs, all VMWare guests start reporting I/O timeouts and the LUNs disappear and are marked "inaccessible". The open-e boxes are still accessible via web interface and ping. The software does not detect a failure, and it does not fail over from the primary to the secondary. I have attempted to manually failover from primary to secondary, and it doesn't complete - it gets stuck somewhere in the middle. If I attempt to gracefully shut down the primary DSS server, it gets stuck and doesn't completely power off (I can see it hanging in the console messages). The only time that the virtual ip migrates from primary to secondary is when I shut down the primary server using the power button on the front of the box. Once the virtual ip moves to the secondary, I can usually re-scan in VMWare and hey presto, the LUNs are back.

That's my experience, and it's possible that open-e has fixed this or at least made it better. I'm running v6.whatever.3535, since the other issue I've run into is when upgrading the SAN software, the official recommendation is to shut down ALL iSCSI access (i.e., all of the VM's), and then update one DSS, reboot, update the other, reboot, and then restore access. Not very conducive to a 24/7 uptime operation.

Again, this may be an unusual experience. There's certainly more experienced people than me on this forum that I've learned from, and you should take their experiences into account as well.

**TheEniGMa** · 12-15-2009, 01:32 PM

Thanks jisaac, that´s what I was afraid of and I was hoping not to recive any posts like yours ;-)

My guess is that the heartbeat between the primary and secondary server only monitor just eachothers heartbeat. If one of them goes down completely the heartbeat service on that server dies and is detected by the other server. However, if the iscsi service itself crasches the two heartbeat services are still able to ping eachother and the iscsi error is not detected. Is this a correct guess or is the primay actuelly checking if the iscsi-service itself works?

I have not been able to test this yet and apriciate any input on this. If the iSCSI failover do not detect a iSCSI service failure, but only a complete server failure... that´s not really the kind of HA we are looking for.

Anyone else having problems or success with the iSCSI failover in DSS6?

**TheEniGMa** · 12-16-2009, 01:16 PM

On our two iSCSI failover servers we got three bonds:

bond0: LAN för management
bond1: iSCSI inetrface connected to separate iSCSI switches.
bond2: replication between the servers

Heartbeat is running on all three interfaces.

I did a simple test and disconnected all cables from bond0, simulating a network card error. Even if it is unlikely that two network cards should fail at the same time making the bond break, it though that the secondary DSS would see that the virtual IP on that interface is no longer visable and failover, but nothing happend. The DSS signals that the bond0 heartbeat is down, but sine the other two cannels are working it doesn´t do anything.

It just doesn't seem as the seconday servers is actuelly checking the primary servers iSCSI service/status, but only the heartbeat itself and as long as the whole server hasn´t gone down it doesn't failover?

I would therefor also assume that it doesn´t check the status of the iSCSI deamon itself and a crash in the iSCSI service would not make it fail over, right? We need something that fails over in case of any problem that makes the iSCSI non functional, not only if the whole servers burns down.

It just doesn't feel like it is fail-proof at all...

**webguyz** · 12-16-2009, 06:23 PM

Maybe Open-E could do a simple service availability script on the DSS server and if it fails then drop the ethernet interfaces? Or at least the heartbeat interface which would do the same thing and cause it to switch.

**TheEniGMa** · 12-16-2009, 08:34 PM

I agree, some sort of iSCSI service check on each node dropping the eth-interfaces, or replace the heartbeat on each node and insert some sort of monitoring on the partner iscsi port 3260 that checks the the iSCSI target is actually running and is functional.

I actually tough that this was the case until I started testing and looking into how Open-E really works. We will quit our Open-E evaulation and start testing StarWind 5 HA that is running an active-active iSCSI target cluster. A little bit more expensive but still affordable and the design seems to be more fail-proof.

Just for the records, Open-E DSS seems to be a great product with all its different possibilies for NAS, iSCSI etc. We will probably use it for other things like NAS or pure storage, but the iSCSI failover just won´t gain my trust...

Thread: What scenarios of failure do iSCSI failover detect?

Thread Tools

Display

What scenarios of failure do iSCSI failover detect?

Posting Permissions