just got the message, that the issue was caused by an outdated MegaRAID driver, Revision 6.12 in the latest version of OPEN-V6,
which does not support CacheCade.
I'm sure if there is a new driver we can provide this in a small update (for Premium Support only), but is there release notes from LSI stating this that if you don't have the latest driver this can happen?
me intention was only to benefit by the performance improvement by Cachecade, "just to use".
The hardware vendor wrote me, the data corruption was caused by a outdated driver in OPEN-E (confirmed by OPEN-E). The driver 6.12 was released by LSI mid 2011, LSI CacheCade in Mar 2012. Therefor CacheCade is not supported at the driver version. LSI told him, if disabling Cachecade helped at data corruption, this is a indication of the kind of problems, both OPEN-E and LSI has confirmed the problem.
Just want to know if you can confirm this too (from the OPEN-E standpoint of view) and when a LSI CacheCade compatible version will be released by OPEN-E?
Currently I'am stuck at V6 because I have enabled the NFS HA feature. Is this in the meantime available in V7 (then I can upgrade)?
Sorry, had more than enough trouble with this stuff.
We have 1000's of V6 and V7 in production and we would have updated the LSI Megaraid SAS drivers from the LSI release notes if this was true.
I have checked our support DB and have not found any records of this associated with any support tickets of this issue. What was the ticket that Open-E stated the same as LSI?
We don't make the driver and the Cachecade is an addon feature to certain LSI controllers. Normally the firmware would control the Cachecade along with some parts of the driver to perform the algorithms it uses for hot data.
I would also like to know the case w/ LSI where they stated this, you can submit a support ticket to Open-E with the contact information.
We have several webinars about the LSI Cachecade and I did one with the LSI engineer in March 28th of 2012 and nothing was mentioned about a out dated driver that can cause corruption.
Again we don't make the drivers, the vendors do so DSS would not be involved here. If there was corruption then it could be where the SSD for the cache was a RAID 0 and not a RAID 1 thus if losing the SSD in a RAID 0 then yes you can have corruption for the Writes at that time the SSD went bad but not for the Reads.
Yes, saw the email and saw what LSI stated that the release was in March of last year, so we did have the driver then as I did the video with LSI last year in March of the 28th. Again we get the drivers from LSI and have the small updates for them. Looks like you got the small updates already from Janusz.
I've also had massive problems with data-corruption on a production system. My setup is two DSS 7 boxes as an ISCSI SAN in an active/active fail-over / replication cluster. I think I've now narrowed the cause down to LSI firmware v23.16.0-0018 having a problem with the CacheCade write cache. With the cache turned off, or read only there is no issue. As soon as I set up a write cache my VM's start catastrophically corrupting, particularly after write intensive operations - e.g. Windows Updates. I lost about 20 production servers due to this - complete disaster.
Todd - I opened a support ticket #1035676 on 11th July with this issue so I'm surprised you say you aren't aware of this. I have had CacheCade up and running with no issues prior to this firmware upgrade, so it's true that it's generally very stable, but I'd strongly advise anyone running a CacheCade write cache not to upgrade to the latest firmware.
I'd agree with Todd that it's unlikely to be a driver issue as all the caching is handled by the RAID card and is invisible to the host system. However it seems odd that if you Google "23.16.0-0018 corruption" this thread is one of only two hits you get so I do wonder if there is something about the combination of this firmware and DSS that causes an issue. One would think that if the problem was more general LSI would be aware of the issue and would have pulled the firmware by now.
I'm still testing, but the problem does seem to go away if I revert to an older LSI firmware.
This case of yours is being investigated by the QA team and we did contact LSI here in the USA and they do not know of any issues, they also wanted us or you to reproduce this and let them know. We have a huge customer base using them and I am not defending them just saying until we can prove it is LSI firmware and reproduce the issue then we can inform them. Also did you submit a support ticket with LSI about this? Again we get the drivers from them and until it is reproducible then I cant say it is for sure the driver or firmware.