Another couple orders of magnitude performance: RDMA with Infiniband or 10Gbe
Okay. So, is there a plan to implement SCSI over RDMA? This, coupled with solid-state disks (either SATA/SAS-based with RAM or with the new Intel SSDs or even pci-e based) or even just a server loaded with (soon, with MetaRAM) up to RAM might bring the IOPS up above 100,000. That would be completely insane. Right now, with a couple of Gbe ports teamed via MPIO (I know, not the best) with 4GB of system RAM, 12-disk RAID 5 array (with 2GB in the RAID controller), we get about 2400 IOPS with SQL io benchmark.
I've read that Infiniband (with RDMA) can, in some cases, provide 3 x fibrechannel IOPS. Has anyone used Infiniband over IP? What sort of IOPS are you getting?
Is Open-E looking at doing any sort of RDMA/iSER sort of work? (Or even FCoE?)
I got to thinking about this because if we're going to SSDs in the next couple years, it'd be nice to have a protocol that can take advantage of the low latencies possible. And x4 Infiniband PCIe cards are about the same price (and bandwidth) as 10Gbe cards are (you can find them pretty easily for under $1000, way less than that refurbished, obviously).
Well, whether you use Infiniband or 10Gbe, RDMA (vs. TCP/IP) should grab you a bunch more IOPS just by getting rid of a lot of protocol that's unnecessary on a little SAN (esp. for point-to-point, no switch) where you aren't likely to drop any data.
Also, what are the highest IOPS that you (i.e. Open-E and anyone of you customers out there) are getting, using FibreChannel or iSCSI?
Also, one feature that'd be AWESOME would be support for making a RAM disk out of a chunk of system memory.
I mean, Sun and HP both have (eight-way quad opteron) boxes with up to 64 DDR2 slots. MetaRAM has 8GB DDR2 RAM sticks, and they're coming out with 16GB ones (that they claim is compatible with such opteron chipsets), so that is potentially 1TB of system RAM!!!
This should enable ridiculous (ludicrous?) SAN/NAS performance. Some customers would like to be able to carve out even a dozen GB of RAM to use as scratch-disk or on some really crazy database application.
This RAM disk would help us to tweak system performance maximally and to help us prepare for SSDs.
And, it'd enable us to "legitimately" claim equally ridiculous performance specs in our marketing .
Eventually, it'd be cool to have the RAM disks replicated via iSCSI failover or something like that to make it partially sane to use even in production environments.
I just wanted to add my voice to all who are asking for RDMA with Infiniband support.
It would be greatly appreciated if the update(s) on ETA of this could be posted as they become available.
Generally, is there roadmap/planned features list available?
We are working on this for our partners but publicly we might not provide this for competitive reason as even some of our competitors don't do this. We have been thinking of providing this information but we need more time to think about the implications that are associated with this type of announcement.
Have been thinking about this (and experimenting on debian linux and our little network), and the very perfect purpose for this sort of thing would be for storing the swap partition (or file) for systems that booted from the SAN already. I mean, if the main server goes down, you're going to shutdown your clients that booted to that server anyways, and you don't have to worry about keeping the data in the swap file, anyway, since it's just working as an extension of the system's RAM. Also, you could relatively easily enable autofailover for this situation, since the ramdisk is just a block device. Granted, you'd have to recreate the ramdisk every time you rebooted, but you could just make that part of the startup procedure. Plus, in linux, it's easy to make multiple swap partitions!
Here's an interesting paper on using an RDMA-connected (infiniband) network block device to store the swap partition on a ramdisk on a remote system:
Apparently, their performance was so good that certain tasks that they tested (like quicksort) were only 40% longer execution time than using local memory, while using a swap file on a disk is 20 times longer. Here's another presentation they did:
I might try setting this up myself in debian (may possibly try to get this working with drbd failover, just to see if it works...). I'll let you guys at open-e know what it's like.
Also, having to use the tcp/ip software stack vs. not (i.e. rdma) means that you have to use 3x the memory bandwidth (I think), which is fine at 100 MB/s, but you start running into problems at 1-2GB/s.
Have you guys heard of "Managed Flash Technology"? It's a software (originally developed for linux) that sits between a hardware flash device (or array of ssds) and any filesystem. Basically, it makes random-writes into sequential writes. It costs money, so Open-E would have to license it, but it could allow open-e to compete with the big boys when it comes to random write IOPS using just regular SSDs.
Thanks!! We are looking into some of these but we found out there is allot of work in development and this is why you pay for those NetApp cost but we are working on it . Then we will list it on our HCL but give us some time it will come!
The managed flash software seems very interesting. I think that SSDs are the next evolution for storage. From a physics perspective, I'm not sure how much more can be done to improve mechanical disks. But right now, it's hardly affordable and because of the life time of the more affordable MLC variety, I have questions about reliability and MTBF.
I think that RDMA would be awesome, but RDMA is going to be for IB and right now, there are some innovative things that can be done to improve iSCSI performance over native IP/Ethernet.
If you have access to this white paper, it talks about a different caching strategy to improve performance of iSCSI by 58 to 70ish percent. I'd like to see some innovation along these lines...