Storage Saga So far

**ccrichard** · 12-16-2011, 06:00 PM

Pre-Phase: tried using Openfiler on desktops as servers, worked pretty well.
TestPhase: Tried OpenFiler on 2 HP DL 185 G5's with a HP Smartarry P410 with 512MB BBW: Got "false" HD failures that would require a reboot and scsi abort tasks

Phase 1:
Switched to Open-e. Was pretty stable.
Switched from file_io to Block_io. Got HD's getting marked bad again.
Updated to newer Open-e Version, still got HD's getting marked bad, but would come up good on reboot.

Resolved by updating firmware on HP SmartArray P410 and on Sata HD's. This is a known issue with certain firmware version of P410 and Sata HD's.

Phase 2:
About 7 Months ago, Vsphere locked up, called vmware support. Problem traced to Open-e Lun.
Turned off all vms and hosts (some from Vmsphere and some through ssh to console of esxi). Rebooted and resync arrays and was resolved.
-failover would not work.
-All ip's from open-e got added as paths. Possible failover would not finish because non-virutal ip's were active i/o

Phase 3:
about a month ago had same problem.
*Took out dynamic targets for iscsi out of Vmware
* changed ping node list to only switches that should up 100% of the time.

Phase 4:
A few days ago had same problem and had to shutdown all vms.
* Open-e support noticed I had NIC bonding with Nics using two different drivers
* I have changed the bonding so nics are from same driver/device

Question: What other things should I look for to prevent the lun lockup? Failover would not work in these instances.
What other recovery actions should I take?
(if this happens next time I will probably just Physically failover by disconnecting all Network cables from primary).

Thanks

**Gr-R** · 12-16-2011, 06:15 PM

Richard,
make sure you have aux connections on the VIPs, and check the HP array logs, and set up email alerts.
call me if you need to,
Greg

**ccrichard** · 12-16-2011, 07:09 PM

I've got Auxillary connection set on the Virtual IP interfaces, Management interfaces, and the replication intefaces.

I noticed on the Secondary Node that I have eth0 and eth3 set as bonds. I'm going to make sure that they are on the same chipsets. THe Secondary Node has 2 double nic cards, but the numbering may not be as expected. Eth0 might be on port 1 and then eth 2 and 3 an the other and then eth 4 on the first card again. I"m actually pretty sure this is the way because the addon card has both ports connected by crossover.

**ccrichard** · 12-17-2011, 11:19 PM

The other weird things:

1. Failover would not finish

2. The Primary node was showing more than one iscsi connection per lun per client. (Under Maintenance connections). Now it only has one iscsi connection per lun and failover works.

Question:
Is there a way to stop and restart just SCST service and not failover?
Is there a "safe" way to shutdown the primary node in case it does not want to complete failover?

**ccrichard** · 12-19-2011, 06:59 PM

I've noticed the vmware/lun locking tends up happen when very disk intensive operations are attempted.

I believe the November problem happened when I was experimenting with an offsite backup job. The backup job succeeded. Very soon after the job finished the logs stop.

The outage in mid December happened when I was trying to move an entire VM through the Vsphere Client.

**ccrichard** · 12-21-2011, 02:10 AM

I had also deleted some snapshots recently

**Gr-R** · 12-21-2011, 02:13 AM

de-activated from DSS V6?

I suspect a load on the controller, causing the locks.
as things go forward, time will tell.

**ccrichard** · 12-22-2011, 12:07 AM

I am suspecting hp's firmware. There are some updates to be applied.
It looks like their smart array controller does not play nicely with sata disks.

The confusing part is that they have two different revision numbers for the firmware update. It could be based on the HW revision of the controller: I'll try searching by S/N and see if it is more specific. if not, I'll do a chat session with HP support.

I'll also see if I can budget for all sas disks and rebuild the array/re-install Open-e.
FYI the snapshots I deleted were vm's, not open-e snapshots.

**ccrichard** · 01-10-2012, 06:46 PM

HP has an advisory for the SATA disks and Smart Array controller I was using.

http://h20565.www2.hp.com/portal/sit...4892.492883150

**Gr-R** · 01-10-2012, 06:52 PM

http://h20565.www2.hp.com/portal/sit...4892.492883150

Good find!
Seems either a firmware update on the controller, or firmware updates on the drives themselves can solve the HP issue.

Thread: Storage Saga So far

Thread Tools

Display

Storage Saga So far

Posting Permissions