corrupted data caused by rebuild raid6

Printable View

Show 40 post(s) from this thread on one page

06-02-2010, 07:10 PM
rogerk

corrupted data caused by rebuild raid6

Hello,

two 2 weeks ago we had an totaly corrupted DSS6 Failover System.

Our Hardware:
2 identical Storage Server with Open-E DSS6

1xIntel X5550
24GB RAM
2 1GB Intel NIC on board
1 DualPort 10GBE CX4 Intel
16 1TB HDD from Western Digital RE3
1 ICP5165BR Raid Controller (Adaptec) latest bios build!

RAID Config: Raid6

We use 2 Citrix XEN Server to access DSS6 over 10GBE CX4.
On Citrix XEN are 15 Windows 2003 Server.

The DSS6 Storage Server are in Failover mode with multiple Targets.
so long...
one day, a HDD fails in our Primary DSS6.
So we changed the HDD against a new one of same type.
the change was done with hot pluging in a running system.

anything was fine. the raidcontroller start the rebuild for raid 6.
until this our data was not! corrupted.

suddenly we noticed that some VM on XEN run into Bluescreens.
at this moment the rebuild status shows ca. 30% und the adaptec shows no error!
DSS shows now error! anything ok...

and from now by every new 1%build more VMs have heavy Problems.

in this moment we realize that the rebuild destroys our data.
we run the manuell failover and start replay the backups to DSS Slave.

after replay the backup we don't resume the manuell FAILover to invetigate Primary System "Error" .
3 Days later a HDD fails in the second DSS6, the adaptec starts an automated rebuild (of course during the night...) after noticed that the rebuild status was ca. 30%.
and again many VM's crashed und our data was corrupted again.:mad:

so we make a call to adaptec, send them log files.
we make call to open-e and send them logfiles, too.

there was no error in the logfiles!

adaptec have no idea, and i think they w'ont find anything...
my opinion is that there is a bug in the latest firmware build from the ICP5165BR.
data became corrupted only a rebuild, and rebuild is only done by raidcontroller...

have someone else problems like our problem?
and for those who have the same raidcontroller be aware!

greetings
rogerk
06-08-2010, 03:17 PM
red

hi,

i do have the same controller icp5165br (15753)
since it is in my secondary node i cannot tell if the same issue occurs, but in last failover (quick update of primary node) had no issue with it, even after 2 harddisk failures in r10

but my overall experience with adaptec controllers is rather bad... some random kernel panics in an older controller, crapy management software and such

anyway i will keep an eye on it, thanks for info!
06-08-2010, 06:14 PM
webguyz

:eek:

While we don't run adaptec hardware, your story reminds me again how important it is to test raid rebuild before going into full production. Ouch!
06-15-2010, 11:30 PM
enealDC

Ouch!
Your controller rebuild settings were probably set to High.
With RAID5/6, it is ESSENTIAL that you run scrub checks routinely. We run them twice a month.
We use Areca Controllers..
06-16-2010, 06:14 AM
rogerk

@enealDC
there is an automated check once a month.
what do you mean with rebuild settings to high?
there no setting on the adaptec that allow me to change something like this.

we changd the controller on both dss machines.
because of compatibility of raiddata we use on the first DSS another adaptec (51645).
on the second we use now areca.

but after 3 weeks we have no idea wy this happend.
no answer from open-e and no answer from adaptec.
06-29-2010, 09:06 AM
rogerk

i got an answer from open-e,
they say it is driver related. the driver for the IPC Raid Controller in the actuell dss6 release is buggy and old. the driver will be changed in the next dss6 release.
so be carefull with raid 6 rebuild and this controller!

roger
07-06-2010, 01:30 PM
nightmare99

Really so that is going to be true for Adaptec cards including the 2405 we are using.
02-09-2011, 10:56 PM
markusro

Rogerk, I can confirm your findings

I can reproduce it the following was:

1. Make a raid 6
2. put some data on it
3. then run verify_fix

Do this until it says different sectors found in RaidEvtA.log.
Then the filesystem is damaged and some data is gone, independently of the filesystem (xfs or ext3/ext4).

We had this now every couple of weeks with the ICP5165BR.

There was no indication in the logs, but it was strange that verify_fix finds 100000s of different sector and all disks are reporting fine,i.e. not a single media error etc.

Controllers found: 1
Controller #1
==============
Firmware : 5.2-0 (15753)
Staged Firmware : 5.2-0 (15753)
BIOS : 5.2-0 (15753)
Driver : 1.1-5 (2461)
Boot Flash : 5.2-0 (15753)

According to our vendor, this controller has an error in the paritiy algorithm. There is a none public newer firmware at Adaptec (they are aware of this problem, the vendor got this info from Adaptec) but this new firmware would not be compatible with our mainboard.
Thus, they send us a new Controller Adaptec 51645.

BTW, this cannot be a problem in the OS because the verify_fix runs entirely on the Controller OS (IMHO).

I was wondering why not more people has this problem. We are running Debian Lenny though, because we wanted console access to the NAS.
And that kernel was a 2..6.26, now I am using 2.6.32 from lenny-backport and the problem still exists.
02-10-2011, 12:12 PM
red

after my vortex was randomly failing harddisks and random kernel panics i did throw it trough the window and got myself another areca

it is just unaccaptable that a enterprise class raid controller isnt supported anymore, its not even that old.. :confused:
02-10-2011, 01:08 PM
Gr-R

Quote:

According to our vendor, this controller has an error in the paritiy algorithm. There is a none public newer firmware at Adaptec (they are aware of this problem, the vendor got this info from Adaptec) but this new firmware would not be compatible with our mainboard.

This would appear a difficult fix for adaptec. If at all.
Thanks for the info.

Show 40 post(s) from this thread on one page