We have a DSSv6 with 16 1TB SATA disks set up with RAID 6, we have a single LUN connected to a 3 server 2008 R2 cluster.
Our SAN has 4 1Gbps network cards, 1 for management and 1 connected to each host via a cross-over cable.
The SAN is used to store VHD files for our Hyper-V virtual servers, we have 9 servers on the SAN (3 on each host).
Recently after being up for about 40 days the SAN performance went from good to terrible, checking the statistics page showed me that the load had increased from an average of between 2-4 up to 4-8.
I downloaded the logs and opened up the tests file, hdparm was reporting speeds of around 10MBps on the RAID.
I shut down all of the virtual machine as soon as I had a maintainance window and restated the SAN, hdparm was then reporting around 400MBps on the RAID.
I booted the virtual servers and performance was good again.
This lasted for 7 days until the issue re-occured, I restated the SAN again last night and performance is now fine again.
I've raised a support ticket but haven't heard much back so far.
Anyone else had this or know what the problem might be?
Restarting the SAN once a week or even once a month really isn't a viable solution for me.
Thanks for your post! I've posted the tests you mentioned below, critical_errors was empty and dmesg.2 has quite a lot of text in it - do you know what I should be looking for?
Our hardware is;
1 x Intel Xeon 5410 Quad Core 2.33GHz, 12MB Cache, 1333Mhz FSB
4GB DDR2-667 ECC FB-DIMM
16 x Barracuda ES.2 SATA 3.0-Gb/s 1-TB
LSI MEGARaid SAS 8704EM2 controller
Performance has been poor again today... I can't understand what's going on here, we've made no config changes.
error parsing /proc/net/snmp: Success
Ip:
294868332 total packets received
7 with invalid addresses
0 forwarded
0 incoming packets discarded
294807638 incoming packets delivered
169314537 requests sent out
Icmp:
20 ICMP messages received
1 input ICMP message failed.
ICMP input histogram:
destination unreachable: 20
25 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 25
IcmpMsg:
InType3: 20
OutType3: 25
Tcp:
64551 active connections openings
67024 passive connection openings
46 failed connection attempts
1492 connection resets received
40 connections established
292407920 segments received
166946801 segments send out
29576 segments retransmited
0 bad segments received.
1640 resets sent
Udp:
2395572 packets received
6 packets to unknown port received.
0 packet receive errors
2338133 packets sent
UdpLite:
*-----------------------------------------------------------------------------*
ifconfig -a
*-----------------------------------------------------------------------------*
P.S I've not tried a memtest as the san is in use... although if it's unavoidable I could always try it at the weekend - would prefer not to do this though.
How have you been determining that performance is poor?
Do you have any network monitoring on the switch ports to see the level of network activity during periods of poor performance?
Have you investigated the VMs themselves? Are any of them used as a file server?
Are you running perfmon on the VMs during the poor performance to see if they may be contributing to it?
Nothing appears to be wrong with your setup. I will say this though; don't let hdparm fool you though. Just because hdparm says 400Mbs per second, does not mean your array can sustain that!
Until Open-E provides an iozone type of test, it's difficult to gauge the true performance of your disk subsystem from the local point of view. On new configuration, I will test them using Linux before deploying Open-E on the hardware.
Eth3 is connected to our local network and used to manage the SAN, I've unzipped dmesg2 and posted the bottem section below.
Hi enealDC, all of our VMs are very slow at the moment, one is a file server, one exchange, MSSQL etc. I should point out that they had all been running fine before this.
none of the virtual machines seem to be under any extra load, also the statistics page on the SAN shows that network activity over the links to the hosts is about the same during times of poor performance.
Thanks for the help on this guys,
Jon
.domain unexpectedly closed!
scst: Using security group "Default_iqn.2009-11:san01.target1" for initiator "iqn.1991-05.com.microsoft:hvh01.domain"
iscsi-scst: Negotiated parameters: InitialR2T Yes, ImmediateData No, MaxConnections 1, MaxRecvDataSegmentLength 65536, MaxXmitDataSegmentLength 65536,
iscsi-scst: MaxBurstLength 262144, FirstBurstLength 65536, DefaultTime2Wait 2, DefaultTime2Retain 20,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048
e1000e: eth1 NIC Link is Down
e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
e1000: eth2: e1000_watchdog_task: NIC Link is Down
iscsi-scst: ***ERROR*** Connection with initiator iqn.1991-05.com.microsoft:hvh02.domain unexpectedly closed!
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
scst: Using security group "Default_iqn.2009-11:san01.target1" for initiator "iqn.1991-05.com.microsoft:hvh02.domain"
iscsi-scst: Negotiated parameters: InitialR2T Yes, ImmediateData No, MaxConnections 1, MaxRecvDataSegmentLength 65536, MaxXmitDataSegmentLength 65536,
iscsi-scst: MaxBurstLength 262144, FirstBurstLength 65536, DefaultTime2Wait 2, DefaultTime2Retain 20,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048
iscsi-scst: ***ERROR*** Connection with initiator iqn.1991-05.com.microsoft:hvh03.domain unexpectedly closed!
scst: Using security group "Default_iqn.2009-11:san01.target1" for initiator "iqn.1991-05.com.microsoft:hvh03.domain"
iscsi-scst: Negotiated parameters: InitialR2T Yes, ImmediateData No, MaxConnections 1, MaxRecvDataSegmentLength 65536, MaxXmitDataSegmentLength 65536,
iscsi-scst: MaxBurstLength 262144, FirstBurstLength 65536, DefaultTime2Wait 2, DefaultTime2Retain 20,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048
e1000e: eth1 NIC Link is Down
e1000e: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
0000:04:00.1: eth1: 10/100 speed: disabling TSO
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
e1000e: eth0 NIC Link is Down
e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
0000:04:00.0: eth0: 10/100 speed: disabling TSO
e1000e: eth1 NIC Link is Down
e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000e: eth0 NIC Link is Down
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth2: e1000_watchdog_task: NIC Link is Down
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000e: eth0 NIC Link is Down
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
iscsi-scst: ***ERROR*** Connection with initiator iqn.1991-05.com.microsoft:hvh01.domain unexpectedly closed!
scst: Using security group "Default_iqn.2009-11:san01.target1" for initiator "iqn.1991-05.com.microsoft:hvh01.domain"
iscsi-scst: Negotiated parameters: InitialR2T Yes, ImmediateData No, MaxConnections 1, MaxRecvDataSegmentLength 65536, MaxXmitDataSegmentLength 65536,
iscsi-scst: MaxBurstLength 262144, FirstBurstLength 65536, DefaultTime2Wait 2, DefaultTime2Retain 20,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048
e1000e: eth1 NIC Link is Down
e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
iscsi-scst: ***ERROR*** Connection with initiator iqn.1991-05.com.microsoft:hvh02.domain unexpectedly closed!
btw... on the statistics page, what does "load" mean as CPU, memory and network statistics all seem the same as normal but "load" has pretty much doubled.