Anyone have any idea as to a timeline when we might expect to see some compatability with VMWare vSphere4?
I recently upgraded ESX 3.5 to ESX 4 to test its compatability with DSS and it seems that they re-wrote their iSCSI stack code and there is some funky stuff going on.
I get dropped connections to LUNs and many path failures and reservation conflicts on the 4.0 server. When it occurs and the LUN mapping is dropped, the ESX 3.5 loses connection to it as well.
I can help the cause by providing Open-E access to my production ESX4 and DSS servers, just let me know.
Thanks for providing us the use of your systems; we may need to back into that later. I did look at your logs and would like to verify some things concerning your Bond. Looks like Eth 2 and 3 have different speeds sets. This can cause issues and are these NIC's the same in terms of chips sets? We noted in the release notes that when using bonding to run a stress test with different chipset as we have seen issues also the speed difference can do this as well. Check the switch or force the speed in the Console from the Console Tools menu in the Modify driver settings. Then test again. Also you might want to change the iSCSI Daemon settings from the link below as I see PDU issues from the dmesg logs. More information about PDU can be found by Google.
This happens for LUN 1,2 and 3
iscsi_trgt: data_out_start(1037) unable to find scsi task 4f1b11f 8a93
iscsi_trgt: cmnd_skip_pdu(454) 4f1b11f 1e 0 4096
Try to set the in the console the following:
Ctl-Alt-W
Select Tuning Options
Select iSCSI Daemon options
Select Target Options
Select a target
Set MaxRecvDataSegmentLength and MaxXmitDataSegmentLength to 65536.
Response from someone in the VMWare community regarding my uploaded VMKernel Logs:
__________________________________________________ _______
There seems to be an issue with storage:
Jun 3 01:01:18 vmhost-1 vmkernel: 3:15:42:27.023 cpu4:4239)WARNING: iscsi_vmk: iscsivmk_TaskMgmtAbortCommands: vmhba33:CH:0 T:1 L:2 : Abort task response indicates task with itt=0x1107006 has been completed on the target but the task response has not arrived ...
Jun 3 01:01:18 vmhost-1 vmkernel: 3:15:42:27.272 cpu4:4239)WARNING: iscsi_vmk: iscsivmk_ConnSetupScsiResp: vmhba33:CH:0 T:1 CN:0: Task not found: itt 17854470
17854470(dec)=110700(hex)
1- There is an IO timeout (i.e storage is not responding to IO on time) which cases ESX iSCSI initiator to send an abort for that IO.
2- It appears that the storage responds to that with "task does not exist" but later the storage sends response the IO task. That is in violation of iSCSI protocol and ESX initiator drops the connection. This seems to keep happening very often.
ESX 3.5 s/w iscsi initator would just ignore that case but ESX 4 initiaror is very strict about protocol violation.
It appears you are using Open-E DSS, I do not think it is certified with ESX4 yet. Could you post the version of DSS you are using ?.
Thanks for the looking into this - I have forwarded this to engineers to verify. We are watching this thread. Also send me new logs once the changes have been made from the tuning options.
We too are facing a similar scenario; however we have not upgraded our ESX environment to vSphere 4.0 as I noticed they were not on the HCL for vSphere 4.0 Storage. However I really want to be able to upgrade to vSphere 4.0 so I am hoping they can resolve your issue and get recertified. However, I was not to optimistic on the speed of this happening as I was told by support that they may look at testing and recertifying sometime later this year, but as of yet no testing had been done with vSphere 4.0. I am going to continue to monitor your thread for progress resolving this issue. Hopefully since you already on vSphere 4 and need DSS to be compatible it doesn’t take until the end of the year to get that.
I am also curious what version of DSS you are running?
Were you experiencing any iSCSI timeout errors when using ESX 3.5?
What RAID level are you running on your DSS SAN?
We had fought iSCSI timeout issues with DSS for many months. I tried configuring the DSS SAN with all of the recommended perfomance settings as specified by Open-e support. Things got a little better, but when the SAN was under moderate to heavy load we would still get CMD Abort and Task Not Found errors. What I found to be our ultimate solution was to rebuild our RAID set as RAID 10. Previously I was using RAID 5 and had my VMs split accross 2 DSS SANs, so each was had 50% of the ESX load. Now with the SANs at RAID 10, I am able to run ALL of my VMs on 1 SAN with better performance and NO iSCSI timeouts or errors, under any load.
I'm running the 3513 build of DSS. I have two servers which I run DSS on, each server has 3 RAID configs and I map their LUNs to a single target. I have, on each server, different RAID setups ranging from RAID1, RAID5, RAID 10.
I also run these targets through a 10GbE Network, so maybe this is why I never ran into timeouts, are you using a similar setup?
On ESX3.5u3 I was not getting this same issue... however, I was experiencing finicky connection drops due to the use of the NetXen 10GbE NICs in my systems.
This weekend I will be installing Intel 2xCX4 10GbE NICs in my machines to alleviate that problem.