Originally Posted by
TheEniGMa
We have been experimenting with two new boxes with DSS providing iSCSI failover service for our five VMware ESXi 4 servers.
I am just curious about how the iSCSI failover heartbeat work and what kind of diffrent failures it detect. What I really want to know if it detects a simple failure in the iSCSI service (for example if the iSCSI target deamon craches) or if it only detects when the whole box goes down and stops responding to ping.
Well, my experience hasn't been all that confidence-inspiring. See this post, for example:
http://forum.open-e.com/showthread.php?t=1604
I have a very similar setup to what you have; two boxes with iSCSI failover configured, providing LUNs to VMWare ESX 4 servers. In three different occasions, the iSCSI service has failed for some reason. This is characterized by you receiving about 40 e-mail messages that look like this:
Code:
2009/08/31 00:03:01|------------[ cut here ]------------
2009/08/31 00:03:01|CPU 4
2009/08/31 00:03:01|Pid: 22488, comm: iscsi-scstd Not tainted 2.6.27.10-oe64-00000-g9b2116f #12
2009/08/31 00:03:01|RIP: 0010:[<ffffffffa01cad78>] [<ffffffffa01cad78>] session_free+0x138/0x140 [iscsi_scst]
2009/08/31 00:03:01|RSP: 0000:ffff88007faed848 EFLAGS: 00210286
2009/08/31 00:03:01|RAX: ffff880106f69088 RBX: ffff880106f68000 RCX: 0000000000000000
2009/08/31 00:03:01|RDX: ffff880106f69000 RSI: 00000000ffffffff RDI: ffff880106f68000
2009/08/31 00:03:01|RBP: 000005d8210f0040 R08: ffff880106f68000 R09: ffff880115f8e920
2009/08/31 00:03:01|R10: ffff88012e6ee4d0 R11: 0000000000000000 R12: ffff88011efa2e20
2009/08/31 00:03:01|R13: ffff88007faedb78 R14: 0000000000000000 R15: ffff88011efa2e08
2009/08/31 00:03:01|FS: 0000000000000000(0000) GS:ffff88012f8fd9c0(0063) knlGS:00000000f7cbb6c0
2009/08/31 00:03:01|CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
2009/08/31 00:03:01|CR2: 00000000093c2000 CR3: 000000007fada000 CR4: 00000000000006a0
2009/08/31 00:03:01|DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2009/08/31 00:03:01|DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2009/08/31 00:03:01|Process iscsi-scstd (pid: 22488, threadinfo ffff88007faec000, task ffff880116d00290)
2009/08/31 00:03:01|Stack: ffff880106f68000 ffffffffa01cb0d9 0000200000000001 0000000000000000
2009/08/31 00:03:01|0100000200010000 ffff88011c2d8000 000005d8210f0040 0000000100000000
2009/08/31 00:03:01|000005d8210f0040 0000080000000000 0000000000000800 ffff88011efa2e00
accompanied by the sinking feeling in the pit of your stomach as you realize that it's midnight and you have to drive to the datacenter to get things working again. Why? Because at this point I have only been able to get things working again by doing a hard-reset; i.e., push the power button on the primary server.
When the error occurs, all VMWare guests start reporting I/O timeouts and the LUNs disappear and are marked "inaccessible". The open-e boxes are still accessible via web interface and ping. The software does not detect a failure, and it does not fail over from the primary to the secondary. I have attempted to manually failover from primary to secondary, and it doesn't complete - it gets stuck somewhere in the middle. If I attempt to gracefully shut down the primary DSS server, it gets stuck and doesn't completely power off (I can see it hanging in the console messages). The only time that the virtual ip migrates from primary to secondary is when I shut down the primary server using the power button on the front of the box. Once the virtual ip moves to the secondary, I can usually re-scan in VMWare and hey presto, the LUNs are back.
That's my experience, and it's possible that open-e has fixed this or at least made it better. I'm running v6.whatever.3535, since the other issue I've run into is when upgrading the SAN software, the official recommendation is to shut down ALL iSCSI access (i.e., all of the VM's), and then update one DSS, reboot, update the other, reboot, and then restore access. Not very conducive to a 24/7 uptime operation.
Again, this may be an unusual experience. There's certainly more experienced people than me on this forum that I've learned from, and you should take their experiences into account as well.