iSCSI autofailover is awesome--what are its limitations?

**Robotbeat** · 08-06-2009, 12:11 AM

So, I've been playing with iscsi autofailover. It's awesome, and easy to use, if you follow the step-by-step how-to pdf (the how-to is easier to follow than the white paper). I have some more aggressive timings in failing over, so that windows clients won't time out during a failover (some windows systems will timeout if attached storage doesn't get back within like half a minute).

Failover--from unplugging a network cable until the logs say failover is complete--takes just under 30 seconds. While playing music in windows on the iscsi volume that is failed over, there's a pause for a few seconds, but then the music continues without any errors.
I have:
warn time: 2000ms
dead time: 4000ms
init time: 5000ms
Keep alive time: 500ms

It could be that failover is a little faster than 28 seconds, but I'm not sure. Is data being accessed on the secondary system when the logs on the secondary say "iSCSI Failover: Secondary node: Acquiring resources. Node is now active."? If so, then failover takes only 16 seconds.

What are the limitations of iSCSI failover? Can you add new volumes/tasks to be failed over without disconnecting your clients? Can you update one half of the cluster to a newer version of open-e dss and then failover and upgrade the other half? Does storage expansion work? What sort of performance are people getting when using autofailover?

BTW, if using a SAN, whether iscsi or Fibre Channel, on a Windows platform, it's recommended to add or change the parameter (of data type REG-DWORD) "TimeOutValue" in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Servic es\Disk to 60 (decimal) or more. This is the time in seconds to wait for the storage to return data before Windows starts generating errors. This should be at least twice your expected failover time. So, it's wise to set TimeOutValue to 180 or more.
(more info on this is found here: http://publib.boulder.ibm.com/infoce...ut_198ovw.html )

**To-M** · 08-06-2009, 12:22 AM

Can you add new volumes/tasks to be failed over without disconnecting your clients?

This feature is ETA around December of 09' but could be Jan.

Can you update one half of the cluster to a newer version of open-e dss and then failover and upgrade the other half?

We are looking to do this as well but no ETA, though we might escalate this before Q1 of 09.

Does storage expansion work?

Yes but you still have to stop the service and the replication. No ETA on this yet.

**silicon** · 08-06-2009, 02:38 AM

Can you update one half of the cluster to a newer version of open-e dss and then failover and upgrade the other half?

This is certainly a question I wish to explore as well - having the ability to pull 1/2 of a iSCSI active/standby-system out of production for maintenance (even if it were a manual failover) with no loss what so ever of access to LUN users (esp. ESX hosts etc). Another options being explored in another thread (tho rather poorly by me!), is now to do this more cost effectively, so the standby system is lower rated that the primary (normally ok in DR situations). Cheers.

**To-M** · 08-07-2009, 12:27 AM

Ok, I really need time to place this on the KB but I got this from the engineers on a rough draft. I might have to make some modification but wanted to fire this off just to get it going.

Shon - I might be able to test this with the VMware server that I have but if not can you test for me?

Maybe something in the release notes that volume replication/failover will not work with these versions with this upgrade...

----------------------------------------------------------------
How to update your DSS V6 with Auto Failover.

It is absolutely imperative to verify that both of the DSS V6 is up to a version that allows this and it is highly recommend to stop the Auto Failover service for the updates until officially noted.

If you wish to proceed regardless of our recommendation to shut down the service then proceed with the following:

1) Please save all settings in the Maint. > Misc. and take all options.
2) stop the secondary system,
3) update the secondary system to new version,
4) start the secondary system,
5) wait for replication to be consistent,
6) start manual failover,
7) stop the primary system,
8) upgrade the primary system,
9) start the primary system,
10) wait for replication to be consistent,
11) start failback.

(This is the part - I need to verify with ESX.)

This should not drop any data and all dropped connections should automatically reconnect.

**jsinclair** · 08-08-2009, 09:34 PM

This is a process I have tested over and over...with mixed results...

I have 2 x DSS 6 systems running with iSCSI auto-failover in a vSphere 4 environment. I have had to upgrade these systems 3 times.

DSS 5 to DSS 6 beta
DSS 6 beta to DSS 6 up4
DSS 6 up4 to DSS 6 up6 (only half way completed).

I will tell you this...it hasn't always been as smooth as I would have liked it. The procedure I use it similar to what Todd has layed out. However, some of the issues I have encountered have been uncommon.

The secondary server is always the easiest. Just save config, reboot, install and reload config. After the second reboot you can restart the failover service and all is well.

Upgrading the primary has been cumbersome for me and also fatal at times. I've corrupted 2 vmware active directory servers and a couple misc servers because of downtime. I will say this...this has always been going from a beta version to a final version so this isn't a true estimate.

When upgrading the primary server start the failover to the secondary and reboot, install, reload, etc. Here's where it gets tricky. When I have tried to failback all must be well otherwise it will go belly up trying to failback and my iscsi failover service will stop all together which disconnects all 7 of my ESX servers...not good!

Some of the issues I have had are related to NIC renumbering, defaulting back to 1500 MTU size, and also even switching back to IETD iscsi. I'll say this again...this was beta versions of DSS 6.

I did update my secondary server last night to update 6 and it was about as smooth as it could be. I am going to try the primary server today after I finish storage vmotioning my critical virtual machines to a holdover iscsi. I will post the update on how it went. Keep in mind this is a full DSS v6 up4 to v6 update so there should be no surprises.

Please feel free to contact me if you have any questions about this, as I have had to do it many time on a live production system so I have gone through the pains. Keep in mind we also have an EMC Clariion with SSD for a heavy I/O AIX database and the upgrade/hba replacements for that are just as painful so don't think everything should be a point and click and you're done. If it were always this easy...most of us wouldn't have jobs :-) Planning, testing, planning, and testing can be your best friend for these procedures. Of course the beauty of VMware is storage vmotion. Setup a simple DSS lite iscsi target and move your production servers off the main san while your're upgrading...without downtime!

Anyways..i'll be posting the results.

Take care,

Jason

jsinclair@vmind.com

**webguyz** · 01-03-2010, 07:57 PM

Originally Posted by jsinclair

This is a process I have tested over and over...with mixed results...

Anyways..i'll be posting the results.

Take care,

Jason

jsinclair@vmind.com

Jason, have things improved for you since this initial old post? We are about to get our second Open-E license and getting the hardware together so we can do autofailover.

Thanks for sharing your experience!

**webguyz** · 01-03-2010, 09:19 PM

Also was curious if the 2 DSS servers doing autofailover have to be EXACTLY the same? We are trying to use a 3ware 9690 (with 512 cache) we already have for our secondary DSS and our Primary DSS has an Areca 1680i (with 2 gig cache). Would be great if we could reuse some old hardware and save some bucks. I'm thinking if they are cofigured the same at the volume level they should be OK? Any thoughts otherwise?

Thanks!

**To-M** · 01-04-2010, 01:03 AM

The secondary system does not have to be the same (CPU, RAM, RAID controller...) but should have good performance values similar to the primary.

Thread: iSCSI autofailover is awesome--what are its limitations?

Thread Tools

Display

iSCSI autofailover is awesome--what are its limitations?

Posting Permissions