You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cloudstack.apache.org by Trevor Francis <tr...@tgrahamcapital.com> on 2012/10/22 17:16:14 UTC

Primary Storage

We are looking at building a Primary Storage solution for an enterprise/carrier class application. However, we want to build it using a FOSS solution and not a commercial solution. Do you have a recommendation on platform?

We are really interested at putting in a caching layer (SSD) in front of spinning disks to speed up the VM storage layer. I'd really like to stay with a linux variant than doing FreeBSD, although ZFS looks really promising. Also, something distributed would be interesting too with redundancy existing at the application level. Our VMs that will attach to the Primary storage will be very write heavy and not really read heavy (think about lots of logging info).

Linux guys, thoughts? Had good luck with anything recently.


Trevor Francis
Partner
46 Labs | The PeerEdge Cloud
http://www.46labs.com | http://www.peeredge.net
405-362-0046 - Voice  | 405-410-4980 - Cell
trevorgfrancis - Skype
trevor@46labs.com
 
Solutions Provider for the Telecom Industry

Re: Primary Storage

Posted by David Nalley <da...@gnsa.us>.

On Mon, Oct 22, 2012 at 11:16 AM, Trevor Francis <
trevor.francis@tgrahamcapital.com> wrote:

> We are looking at building a Primary Storage solution for an
> enterprise/carrier class application. However, we want to build it using a
> FOSS solution and not a commercial solution. Do you have a recommendation
> on platform?
>
> We are really interested at putting in a caching layer (SSD) in front of
> spinning disks to speed up the VM storage layer. I'd really like to stay
> with a linux variant than doing FreeBSD, although ZFS looks really
> promising. Also, something distributed would be interesting too with
> redundancy existing at the application level. Our VMs that will attach to
> the Primary storage will be very write heavy and not really read heavy
> (think about lots of logging info).
>
> Linux guys, thoughts? Had good luck with anything recently.
>
>
>
Depends on the hypervisor.
Lots of folks do well with GFS2 and OCFS2, but they are showing their age
in many ways.

One of the new hot platforms is ceph, which Wido wrote RBD support for
CloudStack - so assuming you use KVM you can use RBD with Ceph and have a
nice distributed, replicated, and scalable storage platform that is pretty
performant.

--David

Re: Primary Storage

Posted by br...@v365.com.au.

  

Hi Bryan

what kind of speed are you getting with IPOIB? to via GFS
to ZFS.
QDR or DDR? 2044MTU or 64000 ?

What speeds do you get with ZFS,
what sort of ZFS volume do you have? how many discs?

Cheers
Bruce M

On
23.10.2012 14:53, Bryan Whitehead wrote:
>> yes we had the same prob on
GFS 3.3 on Ubuntu 10.04 using SRP
>> we could not get SRP to work well,
very inconstant, always drops or
>> fails. Even IPOIB was faulty. we
also tried 12.04, same deal.
> 
> I currently run glusterfs on IPoIB
using CentOS 6.3 using XFS as the
> native storage (as recommended by
RedHat).
> 
> It is very fault tolerant using the sharedmountpoint
option across all
> my nodes in a zone. Migrations are instantaneous and
I've not had any
> problems so far.
> 
> To be honest my biggest worry
is when I'll need to update from
> cloudstack 3.0.2 -> cloudstack 4.x
once it drops.
> 
> -Bryan

Re: Primary Storage

Posted by Bryan Whitehead <dr...@megahappy.net>.

> yes we had the same prob on GFS 3.3 on Ubuntu 10.04 using SRP
> we could not get SRP to work well, very inconstant, always drops or
> fails. Even IPOIB was faulty. we also tried 12.04, same deal.

I currently run glusterfs on IPoIB using CentOS 6.3 using XFS as the
native storage (as recommended by RedHat).

It is very fault tolerant using the sharedmountpoint option across all
my nodes in a zone. Migrations are instantaneous and I've not had any
problems so far.

To be honest my biggest worry is when I'll need to update from
cloudstack 3.0.2 -> cloudstack 4.x once it drops.

-Bryan

Re: Primary Storage

Posted by br...@v365.com.au.

Hi

yes we had the same prob on GFS 3.3 on Ubuntu 10.04 using SRP
we could not get SRP to work well, very inconstant, always drops or
fails. Even IPOIB was faulty. we also tried 12.04, same deal.

We didn't
get to the point of using ZFS with GFS, as we tried it on OpenIndiana
and it was very slow. only 2-3GB/s on infiniband. 

We then tried ZFS on
Ubuntu and it was slow there also. So gave up on ZFS until it matures
more, and is part of Kernel , and we felt the same for GFS, needs to
mature more, and resolve infiniband RDMA issues. 

Wed be interested to
work on it long term though.

Cheers
Bruce M

On 23.10.2012 14:18, Bryan
Whitehead wrote:
>> I have installed GlusterFS direct on Solaris with a
modified code.
>> Want you build bigger systems for more then 50 VMs it
is better you split the Solaris and GlusterFS with a separte headnode
for GlusterFS
>>
>> That looks like:
>> Solaris ZFS Backendstorage with
a dataset Volume (Thin Provision) --> ( SRP Target attached direct
without Infiniband switch to GF Node) --> GlusterFS Node the srp target
formatted with xfs filesystem, create a GlusterFS Volume --> (
Infiniband over a Mellanox Port Switch) --> Cloudstack Node mount
glusterFS Volume over RDMA
> 
> What version of glusterfs are you
running? I've had difficulty getting
> native rdma working in the
3.2/3.3 series.
> 
> -Bryan

Re: Primary Storage

Posted by Bryan Whitehead <dr...@megahappy.net>.

> I have installed GlusterFS direct on Solaris with a modified code.
> Want you build bigger systems for more then  50 VMs it is better you split the Solaris and GlusterFS with a separte headnode for GlusterFS
>
> That looks like:
> Solaris ZFS Backendstorage with a dataset Volume (Thin Provision) -->  ( SRP Target  attached direct without Infiniband switch to GF Node)  --> GlusterFS Node the srp target formatted with xfs filesystem, create a GlusterFS Volume --> ( Infiniband over a Mellanox Port Switch)  --> Cloudstack Node mount glusterFS Volume over RDMA

What version of glusterfs are you running? I've had difficulty getting
native rdma working in the 3.2/3.3 series.

-Bryan

RE: Primary Storage

Posted by Fabrice Brazier <fa...@apalia.net>.

Hi Andreas,

Yes zeus ram it's really good product.

i know napp-it but the last time i use it, it was just the beginning of the
interface so it missed so much thing. I didn't check it, since 2 years now i
think.

I also start zfs with opensolaris the version 99 if i remember well. And
until nexenta 4, the core of nexenta 3 is still based on opensalors 131b.

About the 15k rpms I really don't agree with this thing. I mean today the
read/write latency are taken by the zil/l2arc (for sync protocol) so the
latency doesn't matter anymore. The only matter now is the iops you can
handle on the spines. Or today a good NL-SAS drive handle 90 Iops where a
15K drive handle 180 Iops.
If you check the prize difference i think it's better to built a good raid
10 with more NL-SAS drive than a raid50/raid60 with 15K discs. Sure the best
is to built a 15K raid 10 but it's quite expensive.

By the way, you use RDMA, i want to know if the zil is necessary in that
case? i mean zil is only for sync writes (mostly nfs), for example with
iscsi or fc it's useless

Thanks for the exchange it was really nice
Fabrice

-----Message d'origine-----
De : Andreas Huser [mailto:ahuser@7five-edv.de]
Envoyé : mardi 23 octobre 2012 14:56
À : cloudstack-users@incubator.apache.org
Objet : Re: Primary Storage

Hi Fabrice,

i know OpenSolaris/Solaris   Oracle it's so a thing.
I'm for more then 10 years a open source user and that with oracle - i did
no like at the beginning of this constallation.
But Oracle makes his work good i know that. The cost of one socket are 700$
and you can use so much quantity of TB  as you will.
And you can use the full premier Support from Oracle.
Nexenta develop with the Illumos code. And the Licence are TB based.
That is not my favorite. As well the pool version from Nexenta comes
not after. Current Nexenta   Infiniband are not a usable solution.
But every can use what he will. Everyone must decide for themselves.

SRP Targets or iser are not difficult to configure. Use the SRP for the
Storage unit connection. Solaris and GlusterFS builds one Storage unit.
The GlusterFS Server export the final Volume to the Clients as well KVM,
VMWare, Hyper-V etc.
You can use native GlusterFS, RDMA, NFS ore CIFS to export the Volume.
SRP have nothing to do with VmWare.

When you are use a 7200 SAS drive the access time are the same as a SATA
drive only the quality of the hardware are better. When you need Performance
you must use SAS drives with 15000U/m. But it's not needed when you install
SSD for ZIL/L2ARC.  ZeusRAM rocks :-)

I use dedup only at secondary stroage or on Backupserver not on Primary
Storage.
When you use SSD   SATA drives then you have an cheap an fast strorage.
1TB drive cost unter 100$. Current i'm not need to save storage volume.

Which application use atime? I know "find -atime N". atime on a Storage that
stores only Virtual Disks?
I don't need the information when i have last opening the disk :-)

For a Solaris GUI take a look at napp-it http://www.napp-it.org/

greetings from germany
Andreas

----- Ursprüngliche Mail -----

Von: "Fabrice Brazier" <fa...@apalia.net>
An: cloudstack-users@incubator.apache.org
Gesendet: Dienstag, 23. Oktober 2012 12:30:50
Betreff: RE: Primary Storage

Hi Andreas,

Hum that's pretty cool, i know they have still trouble with infiniband on
nexenta but it's clearly a priority on their roadmap (plus i trust more the
nexenta team to evolve zfs than oracle).

I agree iscsi over infiniband increase the latency but most of the time it's
just more simple to use IPOIB than IB, for example if you use vmware the
support of infiniband it's assured by mellanox and not vmware so if you have
an issue the vmware support probably don't help you anymore.

About the raid i'm a fan of raid 10, i prefer build a raid 10 with 7200 sas
drive than multiple vdev raidz/raidz2 15k sas drive. Particularly for
virtual environment with a random workload and multiple access.

I'm a fan of nfs so agreed about the zil, and zeus ram are the only one for
that (with classical ssd you can hit the write hole).

Agreed for compression too (but only lzjb level gzip use too much cpu),
disable atime permit to decrease the iops load but i'm not sure is really a
best practice. About the deduplication I don't totally agree. It really
depends of your array, the workload and the VM type on your cloudstack.
Actually when i build a zfs array i count 1Gb ram for 1Tb disks. With
deduplication i count 2Gb ram for 1 Tb disks (plus in nexenta 4 they will
add a possibility to use ssd for the deduplication metada, like the l2arc
for the read cache).

The last point it's about your drives, why sata drives? I mean sata doesn't
respect their latency, and the reliability of a sas drive is 10x the sata.
Plus now you can found many NL-sas drive at low cost

But it's really a nice architecture, i never try glusterfs (for the moment)
plus in that case it's really a good way to have a replacement to a
metro-cluster for free, i try one of is competitor (onefs) and clearly
clustered filesystem are the futur.

Cheers,
Fabrice

-----Message d'origine-----
De : Andreas Huser [mailto:ahuser@7five-edv.de] Envoyé : mardi 23 octobre
2012 11:37 À : cloudstack-users@incubator.apache.org
Objet : Re: Primary Storage

Hi Fabrice,

I don't know what other people do but i have no problems with Infiniband +
GlusterFS + Cloudstack I'm not use Nexenta it's based on Illumos and work
not fine with Infiniband.
I have two different clusters in productiv enviroments.

The first: Solaris 11 with built-in GlusterFS 3.3 export Gluster Vol. with
RDMA. > performance is okey you can use that for smaller enviroments The
Second: Is a little bit complex with a GlusterFS Server in the middle

ZFS Server: based on Solaris 11
1.) Create a zpool min. two vdevs and SSD read/write cache
2.) Create a thin provisioning Volume "zfs create -V" disable atime and
enable compression (do not enable dedup!) and export as (iWarp) SRP Target
to the GlusterFS Server use a direct connection without IB Switch.

GlusterFS Server:
1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6)
2.) Use OFED driver from https://www.openfabrics.org
3.) Import the SRP Target from ZFS Server and format as xfs
4.) Create a Glustervolume "volume create xy transport rdma " (use only
rdma)
5.) Connect with the second IB Port to a IB Switch

Cloudstack Hypervisor Node:
1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6)
2.) Use OFED driver from https://www.openfabrics.org
3.) Import the Glustervolume

ZFS Thin Volume ---- Infinband SRP ----> GlusterFS ---- GFSVol rdma ----> IB
Switch ----> Clients

The ZFS and GlusterFS Server form a Storage unit connect direct with 40Gbit
Infiniband Point-to-Point You do not feel that is a cable between!

Important: When you have Infiniband use not IPoIB with iSCSI! If one already
has Infiniband then you should also use the advantage.
IPoIB have a higher Latency as iWarp SRP!

SRP = usec

-- SRP --
local address: LID 0x01 QPN 0x44004b PSN 0xf3265b RKey 0x9804237c VAddr
0x00000001dda000 remote address: LID 0x0a QPN 0x10004a PSN 0x44072e RKey
0x1c0f115 VAddr 0x000000088e6000
------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
2 1000 1.29 125.30 1.31
------------------------------------------------------------------

-- IPoIB ---
[root@sv01sfogaa ~]# ping 10.200.0.10
PING 10.200.0.10 (10.200.0.10) 56(84) bytes of data.
64 bytes from 10.200.0.10: icmp_seq=1 ttl=255 time=0.147 ms
64 bytes from 10.200.0.10: icmp_seq=2 ttl=255 time=0.116 ms

When you put load on IPoIB then increases the latency that is not good

That is my recommendation for a simple GlusterFS Mirror:

- Supermicro Server with Intel Hardware and Expander Backplane
- 1x Crucal-M4 SSD Read Cache
- 2x ZeusIOPs SSD Write cache (mirror)
- SATA 24/7 Harddrives
- LSI HBA 9207 or 9211
- ConnectX-2 QDR Dualport Infiniband Adaper (HP refurbished with full
warrany for 100$) Importent > Flash newest Firmware from Mellanox!
- Mellanox IB Swich
- Solaris 11
- GlusterFS 3.3 compiled with ib_verbs
- Gluster Volume transport only rdma

<blockquote>
Throughput constant up to 200 Mbyte/s
more throughput with more storage servers or more harddrives on jbod

</blockquote>

Info:

- I have some problems with infiniband RDMA or SRP with OpenIndiana or
Illumos or Nexenta. Some adapters have a high latency or not a stable
connection. Use Solaris that's the right way!
- OpenIndiana is Beta! Infiniband ib_verbs works not or not fine!
- Use Solaris 11 Infiniband ib_verbs are native and stable
- Don't use Ubuntu Client Server for Infiniband! Use RedHat, Fedora or
CentOS and install the right drivers from
https://www.openfabrics.org/downloads/OFED/
- You have not SSD Cache? Disable at the ZFS Volume the sync! Important! You
loose Security for your Data but some protocols use sync flags in transport.
Example NFS use by default fsync. write cache are not active. NFS writes
data direct to the Harddrive. For Data Security and Performance give the
Storage Server a SSD write cache. ZFS works at default with sync=standard
that prevent write holes. (COW System)

I hope that I could help a little

Greeting from Germany
Andreas

----- Ursprüngliche Mail -----

Von: "Fabrice Brazier" <fa...@apalia.net>
An: cloudstack-users@incubator.apache.org
Gesendet: Dienstag, 23. Oktober 2012 09:55:15
Betreff: RE: Primary Storage

Hi Andreas,

Hello i just see your configuration, it seems quite interesting.
If i understand well you want to build some zfs array on the backend.
Export luns (probably by iscsi over infiniband) to you linux cluster, and on
the linux cluster you put glusterFS.
I can understand the point, with that you can have very good performance and
reliability (zfs), scalability and redundancy (gluster) for very low cost.
So just one question, did you try the global namespace implementation from
nexenta?
If yes can you tell me what configuration is the best for you?
I mean the fact you have a gluster cluster in the middle must impact the
overral performance no?

Fabrice

-----Message d'origine-----
De : Andreas Huser [mailto:ahuser@7five-edv.de] Envoyé : mardi 23 octobre
2012 05:40 À : cloudstack-users@incubator.apache.org
Objet : Re: Primary Storage

Hi,

for Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband (RDMA). That
gives the best performance and most scalable Storage.
I have tasted some different solutions for primary Storage but the most are
to expensive and for a CloudStack Cluster not economic or have a poor
performance.

My Configuration:
Storage Node:
Supermicro Server (Intel Hardware) with Solaris 11 with SSD write and read
cache (read crucial-m4, write ZeusIOPS) GlusterFS and dualport ConnectX
40Gbit/s Infiniband adapter.

I have installed GlusterFS direct on Solaris with a modified code.
Want you build bigger systems for more then 50 VMs it is better you split
the Solaris and GlusterFS with a separte headnode for GlusterFS

That looks like:
Solaris ZFS Backendstorage with a dataset Volume (Thin Provision) --> ( SRP
Target attached direct without Infiniband switch to GF Node) --> GlusterFS
Node the srp target formatted with xfs filesystem, create a GlusterFS
Volume --> ( Infiniband over a Mellanox Port Switch) --> Cloudstack Node
mount glusterFS Volume over RDMA

For the Dataset Volume at the ZFS Storage, disable atime and enable
compression.
(Space reclaim) With compression you can shrink the ZFS Volume with command
at Linux dd /dev/zero or In a Windows VM with sdelete That gives you space
left on the Primary Storage for deleted Files in a VM or for deleted vhd's
or vm's in the cloudstack

greeting Andreas

Mit freundlichen Grüßen

Andreas Huser
Geschäftsführer
System Engineer / Consultant
(Cisco CSE, SMBAM, LCSE, ASAM)
---------------------------------------
Zellerstraße 28 - 77654 Offenburg
Tel: +49(781) 12786898
Mobil: +49(176) 10308549
ahuser@7five-edv.de

----- Ursprüngliche Mail -----

Von: "Outback Dingo" <ou...@gmail.com>
An: cloudstack-users@incubator.apache.org
Gesendet: Dienstag, 23. Oktober 2012 02:15:16
Betreff: Re: Primary Storage

On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez <iv...@gmail.com> wrote:

<blockquote>
Solaris 11 ZFS and yes we tried different setups, raids levels number
of SSD cache, ARC zfs options etc etc etc.

Cheers

VMWare ??
</blockquote>

<blockquote>
On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo
<ou...@gmail.com>wrote:

<blockquote>
On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com>
wrote:
> We are using ZFS, with jbod, not in production yet exporting NFS to
> cloudstack, I'm not really happy about the performance
> but I think is related to the hardware itself rather than technology,
> we
> are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster
> as
> well, but we decided to move away from that path since gluster nfs is
still
> performing poorly, plus we would like to see cloudstack integrating the
> gluster-fuse module, we haven't decided the final storage setup but at
the
> moment we had better results with ZFS.
>
>

question is whos ZFS and have you "tweaked" the zfs / nfs config for
performance

>
> On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <nik.martin@nfinausa.com
>wrote:
>
>> On 10/22/2012 05:49 PM, Trevor Francis wrote:
>>
>>> ZFS looks really interesting to me and I am leaning that way. I am
>>> considering using FreeNAS, as people seem to be having good luck with
>>> it. Can anyone weigh in here?
>>>
>>>
>> My personal opinion, I think FreeNAS and OpenFiler have horrible,
horrible
>> User Interfaces - not very intuitive, and they both seem to be file
servers
>> with things like iSCSI targets tacked on as an afterthought.
>>
>> Nik
>>
>>
>>> Trevor Francis
>>> Partner
>>> 46 Labs | The PeerEdge Cloud
>>> http://www.46labs.com <http://www.46labs.com/> |
http://www.peeredge.net
>>> <http://www.peeredge.net/>
>>> 405-362-0046 - Voice | 405-410-4980 - Cell
>>> trevorgfrancis - Skype
>>> trevor@46labs.com <ma...@46labs.com>
>>> Solutions Provider for the Telecom Industry
>>>
>>> <http://www.twitter.com/**peeredge
>>> <http://www.twitter.com/peeredge>><
>>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
>>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
>>> http://**www.facebook.com/PeerEdge
>>> <http://www.facebook.com/PeerEdge>>
>>>
>>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:
>>>
>>> ZFS would be an interesting setup as you can do the cache pools like
you
>>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is
>>>> that
>>>> they
>>>> really don't scale out well if you are looking for something with a
>>>> unified
>>>> name space. I'll say however that ZFS is a battle hardened FS with
tons
>>>> of
>>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things
>>>> these
>>>> smaller start up companies are hocking are just ZFS appliances.
>>>>
>>>> RBD looks interesting but I'm not sure if I would be willing to put
>>>> production data on it, I'm not sure how performant it is IRL. From a
>>>> purely technical perspective, it looks REALLY cool.
>>>>
>>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is
another
>>>> option although historically small/random IO has not been it's
>>>> strong
>>>> point.
>>>>
>>>> If you are ok spending money on software and want a scale out block
>>>> storage
>>>> then you might want to consider HP LeftHand's VSA product. I am
>>>> personally
>>>> partial to NFS plays:) I went the exact opposite approach and
>>>> settled
on
>>>> Isilon for our primary storage for our CS deployment.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin
>>>> <nik.martin@nfinausa.com
>>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote:
>>>>
>>>> On 10/22/2012 10:16 AM, Trevor Francis wrote:
>>>>>
>>>>> We are looking at building a Primary Storage solution for an
>>>>>> enterprise/carrier class application. However, we want to build it
>>>>>> using
>>>>>> a FOSS solution and not a commercial solution. Do you have a
>>>>>> recommendation on platform?
>>>>>>
>>>>>>
>>>>>> Trevor,
>>>>>
>>>>> I got EXCELLENT results builing a SAN from FOSS using:
>>>>> OS: Centos
>>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives. LSI
MegaRAID
>>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD
>>>>> caching
>>>>> (alternately, look at FlashCache from Facebook)
>>>>> intel 10GB dual port nics, one port for crossover, on port for up
link
>>>>> to
>>>>> storage network
>>>>>
>>>>> DRBD for real time block replication to active-active
>>>>> Pacemaker+corosync for HA Resource management
>>>>> tgtd for iSCSI target
>>>>>
>>>>> If you want file backed storage, XFS is a very good filesystem on
Linux
>>>>> now.
>>>>>
>>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but
that
>>>>> setup gave me a VERY high performance SAN. The downside is it is
>>>>> entirely
>>>>> managed by CLI, no UI whatsoever.
>>>>>
>>>>>
>>>>> Trevor Francis
>>>>>> Partner
>>>>>> 46 Labs | The PeerEdge Cloud
>>>>>> http://www.46labs.com <http://www.46labs.com/> |
>>>>>> http://www.peeredge.net
>>>>>> <http://www.peeredge.net/>
>>>>>>
>>>>>> 405-362-0046 - Voice | 405-410-4980 - Cell
>>>>>> trevorgfrancis - Skype
>>>>>> trevor@46labs.com <ma...@46labs.com> <mailto:
trevor@46labs.com
>>>>>> >
>>>>>>
>>>>>>
>>>>>> Solutions Provider for the Telecom Industry
>>>>>>
>>>>>> <http://www.twitter.com/****peeredge<
http://www.twitter.com/**peeredge><
>>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
>>><
>>>>>> http://www.twitter.**com/**peeredge <
http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge>
>>>>>> >><
>>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge>
>>>>>> <
>>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
>>><
>>>>>> http://**www.facebook.com/**PeerEdge<
http://www.facebook.com/PeerEdge><
>>>>>> http://www.facebook.com/**PeerEdge <
http://www.facebook.com/PeerEdge>
>>>>>> >>
>>>>>>
>>>>>>
>>>>>>
>>>>>

</blockquote>

</blockquote>

Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

Posted by Andreas Huser <ah...@7five-edv.de>.

Hi Trevor, 

1.) that is not a good configuration. The ZIL (slog) must be a mirror. when this 60GB SSD drive crashed you lost up to 60GB data! 

2.) the second bad idea is FreeBSD, FreeBSD is good but it's not usable for a storage. 

my recommendation use enough RAM min 8GB better 12GB or more. 
Install four harddrives and create two mirrors in pool. 
eg. zfs create tank mirror disk1 disk2 mirror disk3 disk4 
then you have two vdevs and every vdevs will striped. 
When you have two vdevs this like a raid10. 
Have you three vdevs then the data striped over three vdevs (more then raid10) etc. etc. 
Every vdev bring more throughput IOPS and performance. 

Would you not buy Solaris then use OpenIndiana, but use version oi_151a3! Thats important! 
Newer versions of OI have a new ZFS Pool version no longer compatible with standard ore other zfs systems. 
Solaris version 33, old OI and FreeBSD 28 and new OI have 1000. You can upgrade the Pool Version but not downgrade. 

eg. You buy new hardware with more slots for harddrives and want use solaris with premier support to use it for productiv enviroments. 
Drives you can not put in to the new machine and do not an zpool import! That not work because the pool version is higher! 

i use OI 151a3 for a 40TB Mirrored Storage with glusterFS (18TB Usable). The Machine runs stable. Not one restart are needed. 

can you tell me your hardware and plan for Storage connection and Cloudserver? 
Which environment you plan? 


greetings from germany 
Andreas 



----- Ursprüngliche Mail -----

Von: "Trevor Francis" <tr...@tgrahamcapital.com> 
An: cloudstack-users@incubator.apache.org 
Gesendet: Mittwoch, 24. Oktober 2012 22:24:04 
Betreff: Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT 

So it is more and more looking like I am going to be using ZFS over FreeBSD. 


I am space limited on my storage server to 4 drives. I was thinking of putting 2 x 1TB 10k RPM 64-mb Cache SATA drives in and a 60GB SSD for Zil and 256G SSD for L2ARC. Most of our VMs will have very small (1k) writes, but they will grow into the thousands per second. Any recommendations for or against this setup? I would put FreeBSD on a bootable flash drive. 


Also, unrelated, will XS 6.1 work over CS4? 










Trevor Francis 
Partner 
46 Labs | The PeerEdge Cloud 
http://www.46labs.com | http://www.peeredge.net 
405-362-0046 - Voice | 405-410-4980 - Cell 
trevorgfrancis - Skype 
trevor@46labs.com 

Solutions Provider for the Telecom Industry 




On Oct 24, 2012, at 1:20 PM, Clayton Weise < cweise@iswest.net > wrote: 





You’re going to see a dip in performance with KVM primarily because of the qcow2 format (KVM people please correct me if I’m wrong because I’m new to KVM). Due to time constraints I wasn’t able to run any tests with writeback cache turned on or off but I’ve seen people saying both in forums that having it on improves performance, and having it off improves performance to who knows. 

What I can tell you is that in my tests with our current setup I was able to achieve about ~4-5k IOPS for random reads/writes using a multitude of different benchmark utilities in both Linux and Windows with KVM’s raw disk format. When I switched over to qcow2 it went down to about 2-2.5k. With the way our test RAID system is constructed our theoretical max is about 5k IOPS so I’m not sure if the drop is something related to the overall potential (e.g. 50% of the total) or simply a hard cap of that number but I haven’t had a chance to really kick things around much. This test was done using an iSCSI LUN mounted from a Compellent SAN on 10gbit Ethernet and OCFS2 on top of it. I know there was a comment earlier saying that OCFS2 was old but what I can say about it is that it was extremely easy to configure and get up and running and seemed to be pretty solid when we tested it about a year ago as well as the other week. 

Regarding your ZFS on Linux question. ZFS on Linux just isn’t there yet. It’s too new and the work that’s being done on it right now is primarily around stability and integrating features and not about performance so if you’re going to do ZFS do it on either BSD or Solaris/Illumos. The alternative to that is the hardware equivalent with something like LSI’s cachecade which will do basically the same thing and it will work just fine with Linux. 



From: Trevor Francis [mailto:trevor.francis@ tgrahamcapital.com ] 
Sent: Wednesday, October 24, 2012 8:27 AM 
To: cloudstack-users@incubator.apache.org 
Subject: Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT 

So, since ZFS is super slow on Centos, would it be better to use ext4 and bcache over the native linux NFS daemon? 



Also, I am reading the KVM is much more disk performant using writeback caching. Does Cloudstack deploy the KVM VM instances with this enabled by default? 








Trevor Francis 

Partner 

46 Labs | The PeerEdge Cloud 

http://www.46labs.com | http://www.peeredge.net 

405-362-0046 - Voice | 405-410-4980 - Cell 

trevorgfrancis - Skype 

trevor@46labs.com 



Solutions Provider for the Telecom Industry 






On Oct 24, 2012, at 8:50 AM, Nik Martin < nik.martin@nfinausa.com > wrote: 



On 10/24/2012 08:29 AM, Trevor Francis wrote: 


Seems my original post has taken a detour. Who knew storage could be so 
controversial! 

Since we are a Centos house and it would be nice to have a standard OS 
across our platform, but many people have been singing ZFS' praises. To 
bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS 
for linux. 
Does anyone have experience with this? Would it be better to deploy 
Centos with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or 
bcache? 


I inquired about ZFS on Linux in the #cloudstack IRC channel, and the quote I got was "ZFS is slow on Linux". I then asked "Slow relative to what, ZFS on Solaris, ext4 on Linux, ZFS on Linux?", and the response was "yes", so I'm being led to believe that it does not perform as well as other filesystems. For large file storage on Linux, which Xen VHDs are, I tend to use XFS if the network protocol is NFS. 



Also, is the standard linux NFS daemon sufficient to handle a 
significant load or do you recommend a different NFS daemon....or go iSCSI ? 
Our biggest concern is that we have most VMs doing many sequential small 
writes (log files). 

When I first started with CS on Xenserver in January, I thought thin provisioning was important, so I built file system based storage using NFS, with XFS as the underlying filesystem, on LVM. NFS (v3) on Centos 6 became processor bound VERY quickly, running %300 CPU load on a quad core Xeon. 

I switched to iSCSI with LVM as backing store, and havwe had zero issues, except I'm currently investigating lower than desired random read/write on < 4k random. Using tests like Iozone in a VM, and running iostat -t1 on the hypervisor console itself, i can get sustained 2500 IOPS on iSCSI, and it drops to 300 with < 4k blocks of data. I expect to resolve this soon though. 




Thoughts before I make a decision? 



Trevor Francis 
Partner 
46 Labs | The PeerEdge Cloud 
http://www.46labs.com < http://www.46labs.com/ > | http://www.peeredge.net 
< http://www.peeredge.net/ > 
405-362-0046 - Voice | 405-410-4980 - Cell 
trevorgfrancis - Skype 
trevor@46labs.com < mailto:trevor@46labs.com > 
Solutions Provider for the Telecom Industry 

< http://www.twitter.com/peeredge >< http://www.twitter.com/peeredge >< http://www.twitter.com/peeredge >< http://www.facebook.com/PeerEdge > 

On Oct 24, 2012, at 2:53 AM, Fabrice Brazier < fabrice.brazier@apalia.net 
< mailto:fabrice.brazier@apalia.net >> wrote: 



Hi, 

@bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i 
or the 8i mean 8 internals ports (so with 2 lsi you can have only 16 
disks). 

After that a velociraptor disk it's only a 10k discs so we can count 120 
IOPs by disks. If we put them in a raid 5 (by the way everybody know that 
raid 5 don't scale very well after 12 disk the performance doesn't 
increase 
so much but nevermind you built your raid 5 of 30 disks). 
So you have a maximum of (30-1)*120=3480 iops for read and 
((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only 
32Gb ram wich is peanuts to caching your read iops. 

So the only way you have to make performance is your cahecade. 
Now a cachecade is made of ssd (weird actually it's the same as zil and 
l2arc in zfs xD) plus the max size of a cachecade and the maximum size 
of a 
cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade 
max. 

So you say your storage doesn't cost anything but you don't talk about the 
size and the price of your cachecade. 
And another thing the day you will have a real workload (not only 50 
Vms...) 
and this workload will be larger than 1Tb come again and explain us your 
very large performance. 

Me too I can built a zfs server with 4 zeusram in zil and 5 disks 
behind and 
make a benchmark on files of 2Gb and say hey see I have 500 000 Iops 
and 4Gb 
of throughput... 

Regards 

-----Message d'origine----- 
De : bruce.m@v365.com.au < mailto:bruce.m@v365.com.au > 
[mailto:bruce.m@ v365.com.au < http://v365.com.au >] 
Envoyé : mercredi 24 octobre 2012 06:04 
À : cloudstack-users@incubator.apache.org 
< mailto:cloudstack-users@incubator.apache.org > 
Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT 



These Sans we have build have saved us a fortune $ compared to FC 
Block I/O 
Sans from IBM, HP etc 

Ive been a contractor for IBM and HP 
for 20 years. I know how much they are $$$$$ and what they can do. 

A 
Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your 
lucky you may get 100,000 perhaps 150,000 IOPS 

What ive build is 
better. cheaper, and frankly as i support Linux and SCST, much more 
reliable. 

the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800 
($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam 
$800, 
Dual PSU $650, about $14,000 for a much superior SAN. 

As i explained another $15K for infiniband network. which does SRP and IP 

As for your issues with cache misses, the CacheCade doesn't Miss much from 
our tests. with WriteBack enabled, BBU enabled, and the SANs are on our 
primary 22KVA UPS It never fails. 

Also Its not 
doing very much unless a user in the cloud is hitting sequential blocks or 
R/W the same data repeatedly for some weird reason. Even then we see 
little 
lagg or slow down, when as you say its hitting the Disc. 
Infiniband with LSI Just rocks even more than we had thought! 

The R/W 
pattern from groups of 50 VM.s appear to be completely random, web 
servers, 
virtual firewalls, email servers, DNS etc, FTP, windows domains, 
sharepoint, 
MSSQL, MYSQL working on the same Lun which are not considered critical (as 
stated by clients) are each VM getting over 300-1200MB/s constantly even 
when other users are pulling or pushing a file copy @ 250Mb/s between VMs 
etc. Weve actually had to figure out how to slow down the SRP connections. 
If a user is provisioned with 2 Luns from each SAN, in windows you can 
copy 
between VM Drives @ 1220MB/s!!!!! 
faster than SSD in Linux its even faster, and that is in ESXi4.1 with 
15~20 Servers per VM Host. 

we have snmp monitoring the 2 primary sans, and they never go above 30% 
bandwidth. so the point is the Cache has no real bearing on Cloud I/O. 
cloud 
IO is fairly low. We have LSI supplying us with the new PICE3 cards for a 
trial, so i expect even better performance. So long STORY SHORT. GET 
LSI AND 
16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s 
SSD's its SLOWER! MORE SPINDLES = MORE SPEED! 

the main reason we wanted fast IO is for backups, snapshots, so as not to 
interfere with the daily operational workings of the cloud VM guests. 
Which 
is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will 
work, and some smart ass does a restore or backup in prod hours, only 
to see 
the whole cluster slow down to a crawl. on iSCSI its even more pathetic. 

With SRP on LSI " WE DONT 
HAVE THAT PROBLEM!" 

I did a snapshot restore for 2 machines yesterday for a client, it took 2 
mins to restore 100GB. And none of the other servers were affected. SNMP 
latency on all VM's no impact. Virtually 0CPU cycles on the VMware 
hosts for 
the IO move. IO latency/storage path for other VM's from snmp data from 
VMserver was 2ms which is normal. the machine doing the copy crept up to 
10ms but no more. 

So my conclusion 
ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING! 

Cache misses we don't have a problem there at all. Were more worried about 
Data Rot, which is why were interested in GFS to ZFS Luns. 
But ZFS is still slow as compared to what we have..... 

we are going to 
launch a new script which moves a file full of 16-bit random io for 100MB, 
and 1TB, it will create a binary checksum for each file, and we can 
compare 
it over time to check for data rot. 

Just so you can 
appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts. 
the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started 
the 2nd, while the 1st was running, same results across the 2 VM's. 
then the 
3rd concurrently. they all dropped to 2000MB/s and IOPS went down to 
220,000 
per host. so yes there is a penalty. But the way infiniband works is that 
the switch balances the load across all servers. 4th server online same 
test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s 
@ 128K 
blocks 50%50% random seq. change the distribution to 100% random makes 
little difference. We confirmed all the drives were being accessed! all 
were. we even took a look at the camera out DC to monitor the array 
LEDS... 
alll Flickers like mad! we took a look at the metrics of the Megaraid 
as we 
got it to work on CentOS, no errors, no issues or problems, no dropped 
reads 
or writes. 

we set the file on IO meter as high as poss 900000000 units. We then fired 
up 2 more hosts and did real world file copy, no real impact still get 
2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent 
file copies between the 2 sans and each host blew us away at 2000MB/s read 
and write across all 6 hosts.!! FXXK we were amazed 

Aside 
from that LSI have done a super job on the Cache modules on their cards. 
I dont see how ZFS will ever approach this kind of performance. 

Anyway 
take from what ive said what you like. We have an awsum SAN... and its 
CHEAP 
and Amazingly fast! 

were in the process of developing a public Cloud, so when the web site is 
finn we will publish all of this in a whitepaper there. 

On 24.10.2012 10:57, Jason Davis wrote: 


Passionate 
are you? Take it easy I am not knocking on what you are doing. 



The 
term "spinning rust" is a catch all for hard drives, nothing personal 



at all, in fact the setup you have is impressive. 



Impressive until 
you have a cache miss. Cache misses are definitely an 


issue even in 
"capital E" enterprise storage. 



Big bad XIV disk arrays or 3par are 
impressive, fast arrays (crazy 


expensive ;]) when you are hitting DRAM 
or SSD but if you are running cache 


unfriendly workloads then 
ultimately your array is only as fast as the 


spinning disk sitting at 
the lowest tier. 



I agree that openstorage is the way to go, from a 
service providers 


perspective it's hard being competitive with AWS or 
RackSpace if you are 


using expensive traditional SAN. 

So anyway, 
my question wasn't a poke at what you are doing, just curious 


about 
how big of an IOP/latency drop you see once a read or write has to 



actually see the spinning disk. :) 


On Oct 23, 2012 8:10 PM, wrote: 



<blockquote>



oh well if your negative, and make rude comments then that's 


your 


<blockquote>

problem, i don't care about you. 

If you don't believe that 
</blockquote>

RDMA can 


<blockquote>

perform this fast then again, your problem. 

Ask anyone 
</blockquote>

with a decent 


<blockquote>

12+ Disc Raid Array, and RDMA and they will tell you it 
</blockquote>

will kick you in 


<blockquote>

YOUR ballz! Anyone in storage knows to get iops and 
</blockquote>

performance you need 


<blockquote>

12 preferably 16 spindles. 

the spinning 
</blockquote>

rust as you call it, are the 


<blockquote>

newest 200MB/s WD Veloci Raprors with 
</blockquote>

64MB Cache. @ 10KRPM 2.5" dics in 


<blockquote>

32 Bay Storage Chassis. 

we've 
</blockquote>

used them in the past and they are as 


<blockquote>

reliable as the SAS drives we 
</blockquote>

use, but faster! 


<blockquote>


RAID 10 IS BULLSHIT! 
Bloody slow and 50% goes 
</blockquote>

up in smoke for nothing. All you get is the 


<blockquote>

mirror speed of 2 drives 
</blockquote>

which is barley 400MB/s tops! I Spit on RAID 10 


<blockquote>



We also tried 
</blockquote>

Raid 60 and it was very good as well. But no point to 


<blockquote>

use it as we 
</blockquote>

replicated to 2nd SAN 


<blockquote>


When we tested our new SANs, we 
failed 1 
</blockquote>

drive and set the rebuild rate to 30% on raid 6, with 40% scan 


<blockquote>

rate. 
</blockquote>

The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we 


<blockquote>


</blockquote>

removed a 2nd disc, and replaced with blank, and it was rebuild in 


<blockquote>


</blockquote>

another 4 hours. With no real impact to performance tests. 


<blockquote>


</blockquote>

Separating 


<blockquote>

the underlying 6Gb/s Sata ports x 8 into effectively 16 x 
</blockquote>

3Gb/s Channels 


<blockquote>

using a SAS Expander gives each Disc 300MB/s of 
</blockquote>

bandwidth. They cant 


<blockquote>

physically perform better than 200~220MB/s so 
</blockquote>

there is enough bandwidth 


<blockquote>

on the SATA and PCIE Bus to cope. LSI rate 
</blockquote>

the 9280-8i @ 2500MB/s but it 


<blockquote>

does work faster with Cache Cade. up to 
</blockquote>

3200MB/s in that test. 


<blockquote>


So a 
real SAN has many spindles of high 
</blockquote>

performance, and the WD VRs are 


<blockquote>

better than some cheap SSD 
</blockquote>

drives. 


<blockquote>


RDMA is very fast, low cpu cycles, 
and reads and write 
</blockquote>

directly to RAM@ 40Gb/s, we created a RAM drive and 


<blockquote>

tested it up to 
</blockquote>

3200MB/s which is as fast as the PCIE 2 Bus / LSI could 


<blockquote>

Handle. 


</blockquote>

If you cant afford this realtivley cheap equipment (compated to 


<blockquote>

IBM 
</blockquote>

DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast 


<blockquote>


</blockquote>

performance. Any real Cloud would have what we have built. None of our 


<blockquote>

customers, ALL 300 Of them! have lost a single bit of data, all 
</blockquote>

VMs are 


<blockquote>

running very quickly, and with no problems now for 3 
</blockquote>

months. 


<blockquote>


So If you 
dont appreciate being shown how to build a 
</blockquote>

kick ASS SAN then PISS 


<blockquote>

OFF! 

On 24.10.2012 08:17, Jason Davis 
</blockquote>

wrote: 


<blockquote>

<blockquote>

How often do you have 
</blockquote>

folks cache missing and hitting the 
</blockquote>

spinning rust 


<blockquote>

<blockquote>

below the cache cade 
</blockquote>

tier? 


On Oct 23, 2012 
</blockquote>

6:25 PM, wrote: 


<blockquote>

<blockquote>






Hi 

with our 
</blockquote>

SCST 
</blockquote>

RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid 


<blockquote>

<blockquote>

<blockquote>

Director 4036 
</blockquote>

</blockquote>


</blockquote>

switches + HP Gen2 QDR Chassis Switches. 


<blockquote>

<blockquote>

<blockquote>


Our San achieves 
</blockquote>

</blockquote>



295,000 - 350,000 Iops max per LSI controller, depending on 
</blockquote>

block 


<blockquote>

size 


<blockquote>

for i/o meter test. LSI9280-8i with battery cache. 
</blockquote>

</blockquote>

+ 


<blockquote>

cachecade2.0 x 2 in 


<blockquote>

PCIE 16x slot on supermicro 
</blockquote>

</blockquote>

motherboard. 


<blockquote>

<blockquote>

<blockquote>


1 
</blockquote>

</blockquote>

x 250GB SSD for CACHE Cade + 


<blockquote>

16 1tb 
</blockquote>

</blockquote>

WD VR drives (16 - hot stand buy 


<blockquote>

- 2 for raid 6 parity) = 


<blockquote>


</blockquote>

</blockquote>

2600MB/s 


<blockquote>

<blockquote>

<blockquote>


most of our test are from 4K 
</blockquote>

</blockquote>

to 128K block size, 
</blockquote>

@ 4K we get 


<blockquote>

<blockquote>

<blockquote>

295,000 Iops @ 1590 MB/s and @ 128K 
</blockquote>

</blockquote>

350,000 iops @ 
</blockquote>

2400~2600MB/s 


<blockquote>

<blockquote>

<blockquote>


we 
have tuned 64K and 128K block 
</blockquote>

</blockquote>

size 
</blockquote>

on different luns. 64K for database 


<blockquote>

<blockquote>

<blockquote>

and 128K for general 
</blockquote>

</blockquote>


</blockquote>

file. 


<blockquote>

<blockquote>

<blockquote>


The best thing about Infiniband is low CPU 
</blockquote>

</blockquote>

</blockquote>

cycles. only 


<blockquote>

5~6% during these tests. 


<blockquote>


latency is as low 
</blockquote>

</blockquote>

as 4-6ms 


<blockquote>

<blockquote>

<blockquote>

average 
</blockquote>

</blockquote>

read time. concurrent response times are from 
</blockquote>

5-12ms. even under 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

heavy load its below 20ms 


<blockquote>


</blockquote>

</blockquote>

Infiniband latency is below 0.01 us 


<blockquote>

Which is 


<blockquote>

why we chose 
</blockquote>

</blockquote>

it. 


<blockquote>

<blockquote>

<blockquote>


we run MSSQL on the 64K formatted 
</blockquote>

</blockquote>

Luns, its 
</blockquote>

massively 


<blockquote>

<blockquote>

<blockquote>

fast. 

If we copy the same data twice the 
</blockquote>

</blockquote>


</blockquote>

cache cade kicks in and we 


<blockquote>

<blockquote>

<blockquote>

achieve even better speeds. 


</blockquote>

</blockquote>

</blockquote>

Ive 


<blockquote>

compared this to block io @ 8Gb/s Fiber 


<blockquote>

channel, and it 
</blockquote>

</blockquote>

barely gets 


<blockquote>

120,000 IOPS. @ much higher latency and 


<blockquote>


</blockquote>

</blockquote>

bandwidth. 


<blockquote>

<blockquote>

<blockquote>


So for our 
</blockquote>

</blockquote>

money RDMA wins! 


<blockquote>


</blockquote>

</blockquote>

Bruce 


<blockquote>

<blockquote>

<blockquote>


On 23.10.2012 21:20, 
Jason Davis 
</blockquote>

</blockquote>

wrote: 


<blockquote>


</blockquote>

</blockquote>

Out of curiosity, is there any quick performance 


<blockquote>

<blockquote>

numbers 
</blockquote>

</blockquote>

for 
these ZFS + 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

GlusterFS mashups you guys are talking 
</blockquote>


</blockquote>

</blockquote>

</blockquote>

about? 


<blockquote>

<blockquote>

<blockquote>


</blockquote>




<blockquote>

Specifically, IOPs and latency? Sequential 
</blockquote>

</blockquote>

</blockquote>

read/write 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

performance honestly 


<blockquote>

<blockquote>

isn't a very good 
</blockquote>

</blockquote>

</blockquote>

benchmark to determine 


<blockquote>

your 


<blockquote>

SANs performance. It's like 



</blockquote>

</blockquote>

comparing CPUs based solely on 


<blockquote>

how many 


<blockquote>

GHz it runs at. Sure 
</blockquote>

</blockquote>

you can get 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

great MB or GB/s with 
</blockquote>

</blockquote>

</blockquote>

SATA disk but 


<blockquote>

I'd 
</blockquote>

</blockquote>

reckon that IOP performance is 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

abismal. If you 
</blockquote>

</blockquote>

</blockquote>

are 
</blockquote>

utilizing 


<blockquote>

<blockquote>

<blockquote>

GlusterFS without the cache pooling magic 


that 
</blockquote>

</blockquote>

</blockquote>

is 


<blockquote>

ZFS then I would 


<blockquote>

imagine that latency can be an issue. 
</blockquote>

</blockquote>




<blockquote>

<blockquote>

<blockquote>


</blockquote>


</blockquote>




<blockquote>

On Tue, Oct 23, 2012 at 
</blockquote>

7:56 AM, 
</blockquote>

Andreas Huser wrote: 


<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>


Hi Fabrice, 


<blockquote>


i 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

know 


<blockquote>

<blockquote>

<blockquote>

OpenSolaris/Solaris Oracle it's so 
</blockquote>

</blockquote>

a thing. 


<blockquote>

<blockquote>

<blockquote>

I'm 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

for more then 10 


<blockquote>

<blockquote>

<blockquote>

years a open source user and 
</blockquote>

</blockquote>

that 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

with oracle - i did no like at the 


<blockquote>

<blockquote>

<blockquote>

beginning of this 
</blockquote>

</blockquote>


</blockquote>

constallation. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

But Oracle makes his work good i 
</blockquote>

</blockquote>

know 
</blockquote>

</blockquote>

</blockquote>

that. The 


<blockquote>

cost of one socket 


<blockquote>

<blockquote>

<blockquote>

are 700$ and you can use so 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

much 


<blockquote>

<blockquote>

<blockquote>

quantity of 
</blockquote>

</blockquote>

TB as you will. 


<blockquote>

<blockquote>

<blockquote>

And you can use the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

full premier Support 


<blockquote>

<blockquote>

<blockquote>

from 
</blockquote>

</blockquote>

Oracle. 


<blockquote>

<blockquote>

<blockquote>

Nexenta develop with 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

the Illumos code. And the Licence 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

are TB based. 


<blockquote>

<blockquote>

<blockquote>

That 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

is not my favorite. As well the pool version 


<blockquote>

from 


<blockquote>

Nexenta 
</blockquote>

</blockquote>

comes 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

not after. Current Nexenta Infiniband are 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

not a 
</blockquote>

usable 


<blockquote>

<blockquote>

<blockquote>

solution. 


<blockquote>

But every can use what he will. 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Everyone 


<blockquote>

must decide for 


<blockquote>

themselves. 


<blockquote>


SRP 
</blockquote>

</blockquote>

</blockquote>

Targets or iser are not 


<blockquote>

difficult to configure. 


<blockquote>

Use the SRP 
</blockquote>

</blockquote>

for 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

the Storage unit 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

connection. Solaris and GlusterFS 


<blockquote>


</blockquote>

</blockquote>

builds one Storage unit. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

The 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

GlusterFS Server export the 
</blockquote>

final Volume 


<blockquote>

<blockquote>

<blockquote>

to the Clients as well 
</blockquote>

</blockquote>

KVM, 


<blockquote>

<blockquote>

<blockquote>

VMWare, 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Hyper-V etc. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

You can use 
</blockquote>

</blockquote>

native GlusterFS, 
</blockquote>

</blockquote>

RDMA, NFS 
</blockquote>

ore CIFS to export the Volume. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

SRP have 
</blockquote>

</blockquote>

nothing to do 
</blockquote>

</blockquote>


</blockquote>

with VmWare. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


When you are use a 7200 SAS drive the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>


access 
time are the same as a 


<blockquote>

<blockquote>

<blockquote>

SATA drive only the quality 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

of the 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

hardware are better. When you need 


<blockquote>

<blockquote>

<blockquote>

Performance 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

you must use SAS 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

drives with 15000U/m. But it's not needed 



<blockquote>

when you install SSD 
</blockquote>

for 


<blockquote>

ZIL/L2ARC. ZeusRAM rocks :-) 
</blockquote>




<blockquote>




<blockquote>

I use dedup only at 
</blockquote>

</blockquote>

secondary stroage 


<blockquote>

or on 
</blockquote>

</blockquote>

Backupserver not on Primary 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Storage. 
</blockquote>

</blockquote>


When you use 
</blockquote>

</blockquote>

</blockquote>

SSD SATA 


<blockquote>

<blockquote>

<blockquote>

drives then you have an cheap an fast 
</blockquote>

</blockquote>

strorage. 


<blockquote>


</blockquote>

1TB drive cost unter 


<blockquote>

100$. Current i'm not need to 
</blockquote>

save 
</blockquote>

storage volume. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


Which 
</blockquote>

</blockquote>

application use atime? I 
</blockquote>

</blockquote>

</blockquote>

know 


<blockquote>

"find -atime N". atime on a Storage 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

that stores 
</blockquote>

</blockquote>

only Virtual 


<blockquote>

Disks? 


<blockquote>

<blockquote>

<blockquote>

I don't need the information when i 
</blockquote>

</blockquote>

</blockquote>


have last opening 
the disk :-) 


<blockquote>

<blockquote>

<blockquote>


For a Solaris GUI 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

take a look at 


<blockquote>

<blockquote>

<blockquote>

napp-it 

http://www.napp-it.org/ 
</blockquote>

</blockquote>




<blockquote>

<blockquote>

<blockquote>

greetings from germany 
</blockquote>

</blockquote>


</blockquote>

Andreas 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>




<blockquote>




<blockquote>



----- Ursprüngliche Mail ----- 
</blockquote>


</blockquote>





<blockquote>

<blockquote>

<blockquote>

Von: 
</blockquote>

</blockquote>

"Fabrice Brazier">> An: 
</blockquote>


</blockquote>

cloudstack-users@incubator.apache.org 
< mailto:cloudstack-users@incubator.apache.org > 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Gesendet: Dienstag, 
</blockquote>

</blockquote>

</blockquote>

23. 


<blockquote>

Oktober 2012 12:30:50 


<blockquote>

<blockquote>

<blockquote>

Betreff: RE: Primary 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

Storage 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

Hi Andreas, 


<blockquote>

<blockquote>

<blockquote>


Hum that's 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

pretty cool, i know they have 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

still trouble with infiniband 
</blockquote>

on 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

nexenta but it's clearly a 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

priority 


<blockquote>

on their 
</blockquote>

</blockquote>

roadmap (plus i trust more the 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

nexenta team 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

to evolve zfs 

than oracle). 


<blockquote>

<blockquote>

<blockquote>


I agree iscsi over infiniband 
</blockquote>

</blockquote>

</blockquote>


</blockquote>

increase the latency 


<blockquote>

<blockquote>

<blockquote>

but most of the time 


<blockquote>

it's 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

just more 


<blockquote>

simple to use IPOIB than IB, 


<blockquote>

for example if you use 
</blockquote>

</blockquote>

vmware the 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

support of infiniband it's assured 


<blockquote>

by 
</blockquote>

</blockquote>

mellanox and not vmware so if 


<blockquote>

you 


<blockquote>

<blockquote>

<blockquote>

have 
an issue the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

vmware 


<blockquote>

<blockquote>

<blockquote>

support probably don't help 
</blockquote>

</blockquote>

you anymore. 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>


</blockquote>

About the raid i'm a fan 


<blockquote>

of raid 10, i 
</blockquote>

prefer build a raid 
</blockquote>

10 with 7200 sas 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

drive than 
</blockquote>

</blockquote>

multiple vdev 
</blockquote>

</blockquote>

raidz/raidz2 
</blockquote>

15k sas drive. Particularly for 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

virtual 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

environment 
</blockquote>

with a random workload and multiple access. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


I'm 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

a 
</blockquote>

fan 


<blockquote>

<blockquote>

<blockquote>

of nfs so agreed about the zil, and zeus ram are the only 
</blockquote>

</blockquote>

</blockquote>

one 


<blockquote>

for 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

that (with classical ssd you can hit the write 
</blockquote>

</blockquote>

hole). 


<blockquote>

<blockquote>

<blockquote>





<blockquote>

Agreed for 
</blockquote>

compression too (but only 
</blockquote>

</blockquote>

</blockquote>

lzjb level gzip use too 


<blockquote>

much cpu), 


<blockquote>

<blockquote>

<blockquote>

disable 
</blockquote>

</blockquote>

atime 
</blockquote>

</blockquote>

permit to decrease the iops load but 


<blockquote>

i'm not sure is really a 


<blockquote>


</blockquote>




<blockquote>

best practice. About the deduplication 
</blockquote>

I don't totally 
</blockquote>

agree. It 


<blockquote>

<blockquote>

<blockquote>

really 


<blockquote>

depends of your array, the 
</blockquote>

</blockquote>

</blockquote>

workload 
</blockquote>

and the VM type on your 


<blockquote>

<blockquote>

<blockquote>

cloudstack. 


<blockquote>

Actually when i 
</blockquote>

</blockquote>

</blockquote>


</blockquote>

build a zfs array i count 1Gb ram for 1Tb 


<blockquote>

<blockquote>

<blockquote>

disks. With 


<blockquote>


</blockquote>

</blockquote>

</blockquote>


</blockquote>

deduplication i count 2Gb ram for 1 Tb disks (plus in 


<blockquote>

<blockquote>

<blockquote>

nexenta 4 
</blockquote>

</blockquote>

</blockquote>

they 


<blockquote>

will 


<blockquote>

<blockquote>

<blockquote>

add a possibility to use ssd for the 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

deduplication metada, 


<blockquote>

like the l2arc 


<blockquote>

<blockquote>

<blockquote>

for the read cache). 
</blockquote>

</blockquote>

</blockquote>




<blockquote>




<blockquote>

The 
</blockquote>

last point 
</blockquote>

it's about your drives, why sata 
</blockquote>

drives? I mean sata 


<blockquote>

<blockquote>

<blockquote>

doesn't 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

respect their latency, and 
</blockquote>

the reliability of a sas drive is 


<blockquote>

<blockquote>

<blockquote>

10x the 
</blockquote>

</blockquote>

sata. 


<blockquote>

<blockquote>

<blockquote>

Plus 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

now you can found many NL-sas drive at low 


<blockquote>

<blockquote>

<blockquote>

cost 



</blockquote>

</blockquote>

But it's really a nice architecture, i never try glusterfs 


<blockquote>


</blockquote>


</blockquote>

(for the moment) 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

plus in that case it's really a good way to 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

have 


<blockquote>

a 


<blockquote>

replacement to a 


<blockquote>

metro-cluster for free, i try 
</blockquote>

</blockquote>

</blockquote>

one of is 


<blockquote>

competitor 


<blockquote>

(onefs) and clearly 


<blockquote>

clustered 
</blockquote>

</blockquote>

</blockquote>

filesystem are the 


<blockquote>

futur. 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Cheers, 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

Fabrice 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


-----Message 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

d'origine----- 


<blockquote>

<blockquote>

<blockquote>

De : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Andreas 


<blockquote>

<blockquote>

<blockquote>

Huser 
</blockquote>

</blockquote>

[ mailto:ahuser@7five-edv.de ] 


<blockquote>

<blockquote>

<blockquote>

Envoyé : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

mardi 23 octobre 2012 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

11:37 


<blockquote>

<blockquote>

<blockquote>

À : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

cloudstack-users@incubator.apache.org 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Objet : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Re: 


<blockquote>


</blockquote>

</blockquote>

Primary Storage 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


Hi Fabrice, 

I don't 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

know 


<blockquote>

what other people 


<blockquote>

do but i have no problems with 
</blockquote>

</blockquote>

Infiniband + 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

GlusterFS + Cloudstack 


<blockquote>

I'm not use 
</blockquote>

</blockquote>

Nexenta it's based on Illumos and 


<blockquote>

work 


<blockquote>

<blockquote>

<blockquote>

not fine with 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

Infiniband. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

I have two different 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

clusters in productiv 


<blockquote>


</blockquote>

</blockquote>

enviroments. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


The first: Solaris 11 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

with built-in 
</blockquote>

GlusterFS 3.3 


<blockquote>

<blockquote>

<blockquote>

export Gluster Vol. with 


<blockquote>

RDMA. > 
</blockquote>

</blockquote>

</blockquote>


</blockquote>

performance is okey you can use that 


<blockquote>

<blockquote>

<blockquote>

for smaller enviroments 
</blockquote>

</blockquote>

</blockquote>

The 


<blockquote>

<blockquote>

<blockquote>


Second: Is a little bit complex with a GlusterFS 
</blockquote>

</blockquote>

</blockquote>

Server in the 


<blockquote>

middle 


<blockquote>

<blockquote>

<blockquote>


ZFS Server: based on Solaris 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

11 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

1.) Create a 
</blockquote>

</blockquote>

zpool min. two vdevs and SSD 
</blockquote>

read/write cache 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

2.) Create 
</blockquote>

</blockquote>

a thin 
</blockquote>

</blockquote>

provisioning Volume 
</blockquote>

"zfs create -V" disable atime and 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

enable 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

compression 
</blockquote>

(do not enable dedup!) and export as (iWarp) SRP Target 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>


to the GlusterFS Server use a direct connection without IB 
</blockquote>

Switch. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

GlusterFS Server: 


<blockquote>

1.) Use 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

RedHat, CentOS or 


<blockquote>

Fedora (I use CentOS 5 


<blockquote>

and 6) 


<blockquote>

2.) 
</blockquote>

</blockquote>

</blockquote>

Use OFED driver from 


<blockquote>

https://www.openfabrics.org 


<blockquote>

<blockquote>

<blockquote>

3.) 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

Import the SRP Target from ZFS 


<blockquote>

Server and format as xfs 


<blockquote>

<blockquote>

<blockquote>

4.) 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Create a 


<blockquote>

<blockquote>

<blockquote>

Glustervolume "volume 
</blockquote>

</blockquote>

create xy transport rdma " (use 
</blockquote>

only 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

rdma) 
</blockquote>

</blockquote>

5.) Connect 
</blockquote>

</blockquote>

with the second IB 
</blockquote>

Port to a IB Switch 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


Cloudstack 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

Hypervisor 
</blockquote>

Node: 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

1.) Use RedHat, CentOS or Fedora (I use CentOS 5 
</blockquote>

</blockquote>

</blockquote>

</blockquote>


</blockquote>

and 


<blockquote>

<blockquote>

<blockquote>

6) 


<blockquote>

2.) Use OFED driver from 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

https://www.openfabrics.org 


<blockquote>

<blockquote>

<blockquote>


3.) Import 
the 
</blockquote>

</blockquote>

</blockquote>

Glustervolume 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


ZFS Thin Volume ---- 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Infinband SRP 
</blockquote>

----> 


<blockquote>

<blockquote>

<blockquote>

GlusterFS ---- GFSVol rdma ----> 


<blockquote>

IB Switch 
</blockquote>

</blockquote>

</blockquote>

----> 
</blockquote>

Clients 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


The 
</blockquote>

</blockquote>

ZFS and GlusterFS Server form a 
</blockquote>

</blockquote>

</blockquote>

Storage 


<blockquote>

unit connect direct with 


<blockquote>

40Gbit 


<blockquote>

Infiniband 
</blockquote>

</blockquote>

</blockquote>

Point-to-Point You 


<blockquote>

do not feel that is a cable 


<blockquote>

between! 
</blockquote>




<blockquote>

<blockquote>

<blockquote>

Important: When you 
</blockquote>

</blockquote>

</blockquote>

have Infiniband use not IPoIB with 

iSCSI! If one 


<blockquote>

<blockquote>

<blockquote>

already 
</blockquote>

</blockquote>

</blockquote>

has Infiniband then you 
</blockquote>

should also use the 


<blockquote>

<blockquote>

<blockquote>

advantage. 


<blockquote>

IPoIB 
</blockquote>

</blockquote>

</blockquote>

have a higher 
</blockquote>

Latency as iWarp SRP! 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>



SRP = 
</blockquote>

</blockquote>

usec 
</blockquote>

</blockquote>





<blockquote>




<blockquote>

-- SRP -- 
local address: LID 0x01 QPN 
</blockquote>

</blockquote>

</blockquote>

0x44004b PSN 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

0xf3265b RKey 0x9804237c VAddr 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

0x00000001dda000 remote address: 


<blockquote>

LID 


<blockquote>

0x0a QPN 0x10004a PSN 
</blockquote>

</blockquote>

0x44072e RKey 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

0x1c0f115 VAddr 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

0x000000088e6000 


<blockquote>


</blockquote>




<blockquote>


</blockquote>


</blockquote>

------------------------------------------------------------------ 


<blockquote>





<blockquote>


#bytes #iterations t_min[usec] t_max[usec] 
</blockquote>

</blockquote>

t_typical[usec] 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

2 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

1000 


<blockquote>

1.29 125.30 1.31 


<blockquote>


</blockquote>

</blockquote>




</blockquote>

------------------------------------------------------------------ 


<blockquote>





<blockquote>




<blockquote>


</blockquote>

-- IPoIB --- 


<blockquote>

[root@sv01sfogaa ~]# ping 
</blockquote>

</blockquote>

</blockquote>

10.200.0.10 


<blockquote>

<blockquote>

<blockquote>


PING 
10.200.0.10 (10.200.0.10) 56(84) 
</blockquote>

</blockquote>

</blockquote>

bytes of data. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

64 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

bytes from 


<blockquote>

10.200.0.10: icmp_seq=1 
</blockquote>

</blockquote>

ttl=255 time=0.147 ms 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

64 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

bytes from 


<blockquote>

10.200.0.10: 
</blockquote>

</blockquote>

icmp_seq=2 ttl=255 time=0.116 ms 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

When you put load 
</blockquote>

on 


<blockquote>

<blockquote>

<blockquote>

IPoIB then increases the latency that is not 
</blockquote>

</blockquote>

good 


<blockquote>


</blockquote>




<blockquote>

<blockquote>

<blockquote>


That is my 
</blockquote>

</blockquote>

recommendation for a simple 
</blockquote>


</blockquote>

GlusterFS Mirror: 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


- Supermicro Server 
</blockquote>

</blockquote>

with Intel 
</blockquote>

</blockquote>

</blockquote>

Hardware 


<blockquote>

and Expander Backplane 


<blockquote>

<blockquote>

<blockquote>

- 1x Crucal-M4 SSD Read 
</blockquote>

</blockquote>

</blockquote>


Cache 


<blockquote>

<blockquote>

<blockquote>

- 2x 
</blockquote>

</blockquote>

</blockquote>

ZeusIOPs SSD Write cache (mirror) 


<blockquote>

<blockquote>

<blockquote>

- 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

SATA 24/7 


<blockquote>

<blockquote>

<blockquote>

Harddrives 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

- LSI HBA 9207 or 9211 


<blockquote>

<blockquote>

<blockquote>

- 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

ConnectX-2 QDR Dualport 


<blockquote>

<blockquote>

<blockquote>

Infiniband 
</blockquote>

</blockquote>

Adaper (HP refurbished with 
</blockquote>

full 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

warrany for 100$) 
</blockquote>

</blockquote>

Importent > 
</blockquote>

</blockquote>

Flash newest 
</blockquote>

Firmware from Mellanox! 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

- Mellanox IB 
</blockquote>

</blockquote>

Swich 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

- 
</blockquote>

Solaris 11 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

- GlusterFS 3.3 compiled with ib_verbs 
- 
</blockquote>

</blockquote>

</blockquote>

</blockquote>



Gluster Volume transport only rdma 


<blockquote>

<blockquote>

<blockquote>




</blockquote>

</blockquote>

</blockquote>




<blockquote>




<blockquote>


</blockquote>





Throughput 
constant up to 200 
</blockquote>

</blockquote>

Mbyte/s >> more throughput with 


<blockquote>

more storage servers 


<blockquote>

or more 
</blockquote>

</blockquote>

harddrives on jbod >> >> >> 


<blockquote>

<blockquote>

<blockquote>




<blockquote>


</blockquote>





<blockquote>

Info: 
</blockquote>

</blockquote>

</blockquote>




<blockquote>




<blockquote>

- I have some 
</blockquote>

problems with infiniband 
</blockquote>

RDMA or 
</blockquote>

SRP with OpenIndiana or 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Illumos or 
</blockquote>

</blockquote>

Nexenta. Some 
</blockquote>

</blockquote>


</blockquote>

adapters have a high latency or not a stable 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

connection. 
</blockquote>

</blockquote>

</blockquote>

Use 


<blockquote>

Solaris that's the right way! 


<blockquote>

<blockquote>

<blockquote>

- OpenIndiana is Beta! 
</blockquote>

</blockquote>

</blockquote>


Infiniband 
ib_verbs works not or not fine! 


<blockquote>

<blockquote>

<blockquote>

- Use Solaris 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

11 


<blockquote>

<blockquote>

<blockquote>

Infiniband 
</blockquote>

</blockquote>

ib_verbs are native and stable 


<blockquote>

<blockquote>

<blockquote>

- Don't 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

use Ubuntu Client 


<blockquote>

<blockquote>

<blockquote>

Server 
</blockquote>

</blockquote>

for Infiniband! Use RedHat, Fedora 
</blockquote>

or 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

CentOS and install the 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

right drivers from 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

https://www.openfabrics.org/downloads/OFED/ 


<blockquote>

<blockquote>

<blockquote>


- 
You have 
</blockquote>

</blockquote>

</blockquote>

not SSD Cache? Disable at the ZFS Volume the sync! 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>


</blockquote>

Important! 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

You 
loose Security for your Data but 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

some 


<blockquote>

protocols use 


<blockquote>

sync flags in 


<blockquote>

transport. 
</blockquote>

</blockquote>

</blockquote>

Example NFS use by 


<blockquote>

default fsync. write 


<blockquote>

cache are not active. 
</blockquote>

</blockquote>

NFS writes 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

data 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

direct to the Harddrive. For 


<blockquote>

Data 
</blockquote>

</blockquote>

Security and Performance give 


<blockquote>

the 


<blockquote>

<blockquote>

<blockquote>

Storage Server a SSD 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

write 


<blockquote>

<blockquote>

<blockquote>

cache. ZFS works at default with 
</blockquote>

</blockquote>

sync=standard 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

that prevent write 


<blockquote>

<blockquote>

<blockquote>

holes. (COW System) 


<blockquote>


</blockquote>


I hope 
</blockquote>

</blockquote>

</blockquote>

that I could help a little 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Greeting from 
</blockquote>

</blockquote>


</blockquote>

Germany 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Andreas 



----- 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

Ursprüngliche 


<blockquote>

Mail 


<blockquote>

----- 


<blockquote>


Von: "Fabrice 
</blockquote>

</blockquote>

</blockquote>

Brazier" 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

An: 
</blockquote>

</blockquote>


</blockquote>

</blockquote>

cloudstack-users@incubator.apache.org 



<blockquote>

Gesendet: Dienstag, 23. 
</blockquote>

Oktober 


<blockquote>

2012 09:55:15 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

Betreff: RE: Primary Storage 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

Hi Andreas, 


<blockquote>


</blockquote>




<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Hello i just see your configuration, it seems 
</blockquote>

quite 
</blockquote>

interesting. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

If i 
</blockquote>

</blockquote>

understand well you want to build 
</blockquote>

</blockquote>

</blockquote>

some 


<blockquote>

zfs array on the backend. 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Export luns (probably 
</blockquote>

</blockquote>

by iscsi over 


<blockquote>

infiniband) to you linux cluster, 


<blockquote>

and 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

on 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

the linux 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

cluster you put glusterFS. 


<blockquote>

<blockquote>

<blockquote>

I can 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

understand 


<blockquote>

<blockquote>

<blockquote>

the point, with 
</blockquote>

</blockquote>

that you can have very good 
</blockquote>

performance 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

and 
</blockquote>

</blockquote>

reliability 
</blockquote>

</blockquote>

(zfs), 
</blockquote>

scalability and redundancy (gluster) for very low 


<blockquote>

<blockquote>

<blockquote>

cost. 
</blockquote>

</blockquote>


So just one question, did you try the global namespace 


<blockquote>


</blockquote>

</blockquote>

implementation 


<blockquote>

from 


<blockquote>

<blockquote>

<blockquote>

nexenta? 
If yes can you tell me 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

what 


<blockquote>

<blockquote>

<blockquote>

configuration 
</blockquote>

</blockquote>

is the best for you? 


<blockquote>

<blockquote>

<blockquote>

I mean the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

fact you have a gluster 


<blockquote>

<blockquote>

<blockquote>

cluster 
</blockquote>

</blockquote>

in the middle must impact 
</blockquote>

the 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

overral performance no? 
</blockquote>

</blockquote>




</blockquote>

</blockquote>

</blockquote>

Fabrice 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


-----Message d'origine----- 
De : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

Andreas 


<blockquote>

Huser 


<blockquote>

[ mailto:ahuser@7five-edv.de ] Envoyé : mardi 23 
</blockquote>

</blockquote>

octobre 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

2012 05:40 À : 


<blockquote>


</blockquote>

</blockquote>

cloudstack-users@incubator.apache.org 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Objet : 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Re: Primary 

Storage 


<blockquote>

<blockquote>

<blockquote>


Hi, 

for Cloudstack i 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

use 


<blockquote>

Solaris 11 ZFS + GlusterFS 


<blockquote>

over Infiniband (RDMA). 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

That 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

gives the best performance and most 


<blockquote>

scalable 
</blockquote>

</blockquote>

Storage. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

I have 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

tasted some different solutions for 
</blockquote>

primary 


<blockquote>

<blockquote>

<blockquote>

Storage but the most 
</blockquote>

</blockquote>

are 


<blockquote>

<blockquote>

<blockquote>

to expensive and for 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

a CloudStack Cluster 


<blockquote>

<blockquote>

<blockquote>

not economic or 
</blockquote>

</blockquote>

have a poor 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

performance. 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


My Configuration: 
</blockquote>

</blockquote>


</blockquote>

</blockquote>


</blockquote>

Storage Node: 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Supermicro Server (Intel Hardware) with Solaris 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

11 


<blockquote>

with 


<blockquote>

SSD write and read 


<blockquote>

cache (read crucial-m4, 
</blockquote>

</blockquote>

</blockquote>

write ZeusIOPS) 


<blockquote>

GlusterFS 


<blockquote>

and dualport ConnectX 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

40Gbit/s Infiniband adapter. 


<blockquote>

<blockquote>

<blockquote>





<blockquote>

I have 
</blockquote>

installed 
</blockquote>

</blockquote>

</blockquote>

GlusterFS direct on Solaris with a modified 


<blockquote>

code. 


<blockquote>

<blockquote>

<blockquote>

Want 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

you 


<blockquote>

<blockquote>

<blockquote>

build bigger systems for more then 50 VMs it is 
</blockquote>

</blockquote>

better you 
</blockquote>

split 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

the 
</blockquote>

</blockquote>

Solaris and GlusterFS with a separte 
</blockquote>

</blockquote>


</blockquote>

headnode for GlusterFS 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


That 
</blockquote>

</blockquote>

looks like: 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

Solaris ZFS 


<blockquote>

Backendstorage with a dataset Volume (Thin 


<blockquote>


</blockquote>

</blockquote>

Provision) --> ( SRP 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

Target attached direct without 
</blockquote>

Infiniband switch 


<blockquote>

<blockquote>

<blockquote>

to GF Node) --> 
</blockquote>

</blockquote>

GlusterFS 


<blockquote>

<blockquote>

<blockquote>

Node the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

srp target formatted with xfs 


<blockquote>

<blockquote>

<blockquote>

filesystem, 
</blockquote>

</blockquote>

create a GlusterFS 



<blockquote>

Volume --> ( Infiniband over a 
Mellanox Port 
</blockquote>

Switch) --> 
</blockquote>

Cloudstack Node 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

mount glusterFS Volume over 
</blockquote>

</blockquote>

RDMA 
</blockquote>

</blockquote>





<blockquote>

<blockquote>

<blockquote>

For the Dataset Volume at the ZFS Storage, disable atime 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

and 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

enable 


<blockquote>

<blockquote>

<blockquote>

compression. 
(Space reclaim) With 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

compression you 


<blockquote>

can 


<blockquote>

shrink the ZFS Volume with command 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

at Linux dd /dev/zero or 


<blockquote>

In a 


<blockquote>

Windows VM with sdelete That 
</blockquote>

</blockquote>

gives you space 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

left on the 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Primary 


<blockquote>

Storage for 
</blockquote>

</blockquote>

deleted Files in a VM or for deleted vhd's 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

or vm's in the 

cloudstack 


<blockquote>

<blockquote>

<blockquote>


greeting Andreas 


</blockquote>

</blockquote>

</blockquote>





<blockquote>




<blockquote>


Mit freundlichen 
</blockquote>

Grüßen 


<blockquote>


</blockquote>


</blockquote>

Andreas Huser 


<blockquote>


Geschäftsführer 


<blockquote>

System Engineer 
</blockquote>

</blockquote>

</blockquote>

/ 


<blockquote>

<blockquote>

<blockquote>

Consultant 


<blockquote>

(Cisco 
</blockquote>

</blockquote>

</blockquote>

CSE, SMBAM, LCSE, ASAM) 


<blockquote>


</blockquote>




<blockquote>


</blockquote>

--------------------------------------- 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

Zellerstraße 28 - 77654 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Offenburg 


<blockquote>

<blockquote>

<blockquote>

Tel: +49(781) 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

12786898 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

Mobil: +49(176) 10308549 
</blockquote>

</blockquote>




</blockquote>

</blockquote>

</blockquote>

ahuser@7five-edv.de 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>






</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

----- 


<blockquote>

Ursprüngliche Mail ----- 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

Von: "Outback 
</blockquote>

</blockquote>

Dingo" 


<blockquote>

<blockquote>

<blockquote>

<blockquote>

<blockquote>

An: 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

cloudstack-users@incubator.apache.org 


<blockquote>

<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>


Gesendet: Dienstag, 23. 
Oktober 2012 02:15:16 


<blockquote>

<blockquote>

<blockquote>

Betreff: 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

Re: Primary 


<blockquote>

<blockquote>

<blockquote>

Storage 


<blockquote>


</blockquote>

</blockquote>

</blockquote>

On Mon, Oct 22, 2012 at 
</blockquote>

8:09 PM, Ivan 


<blockquote>

<blockquote>

<blockquote>

Rodriguezwrote: 


<blockquote>


</blockquote>






<blockquote>

<blockquote>


</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

Solaris 11 ZFS and yes we tried different 


<blockquote>

<blockquote>

<blockquote>

setups, raids 
</blockquote>

</blockquote>

levels 
</blockquote>

number >> of SSD cache, ARC zfs options etc etc 


<blockquote>

<blockquote>

<blockquote>

etc. >> >> 
</blockquote>

</blockquote>


</blockquote>

Cheers >> >> >> VMWare ?? >> 


<blockquote>

<blockquote>

<blockquote>




<blockquote>




</blockquote>


</blockquote>

</blockquote>




<blockquote>


</blockquote>


</blockquote>




-- 
Regards, 

Nik 

Nik Martin 
VP Business Development 
Nfina Technologies, Inc. 
+1.251.243.0043 x1003 
Relentless Reliability 

</blockquote>

Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

Posted by Trevor Francis <tr...@tgrahamcapital.com>.

So it is more and more looking like I am going to be using ZFS over FreeBSD.

I am space limited on my storage server to 4 drives. I was thinking of putting 2 x 1TB 10k RPM 64-mb Cache SATA drives in and a 60GB SSD for Zil and 256G SSD for L2ARC. Most of our VMs will have very small (1k) writes, but they will grow into the thousands per second. Any recommendations for or against this setup? I would put FreeBSD on a bootable flash drive.

Also, unrelated, will XS 6.1 work over CS4?



Trevor Francis
Partner
46 Labs | The PeerEdge Cloud
http://www.46labs.com | http://www.peeredge.net
405-362-0046 - Voice  | 405-410-4980 - Cell
trevorgfrancis - Skype
trevor@46labs.com
 
Solutions Provider for the Telecom Industry

 

On Oct 24, 2012, at 1:20 PM, Clayton Weise <cw...@iswest.net> wrote:

> You’re going to see a dip in performance with KVM primarily because of the qcow2 format (KVM people please correct me if I’m wrong because I’m new to KVM).  Due to time constraints I wasn’t able to run any tests with writeback cache turned on or off but I’ve seen people saying both in forums that having it on improves performance, and having it off improves performance to who knows.
>  
> What I can tell you is that in my tests with our current setup I was able to achieve about ~4-5k IOPS for random reads/writes using a multitude of different benchmark utilities in both Linux and Windows with KVM’s raw disk format.  When I switched over to qcow2 it went down to about 2-2.5k.  With the way our test RAID system is constructed our theoretical max is about 5k IOPS so I’m not sure if the drop is something related to the overall potential (e.g. 50% of the total) or simply a hard cap of that number but I haven’t had a chance to really kick things around much.  This test was done using an iSCSI LUN mounted from a Compellent SAN on 10gbit Ethernet and OCFS2 on top of it.  I know there was a comment earlier saying that OCFS2 was old but what I can say about it is that it was extremely easy to configure and get up and running and seemed to be pretty solid when we tested it about a year ago as well as the other week.
>  
> Regarding your ZFS on Linux question.  ZFS on Linux just isn’t there yet.  It’s too new and the work that’s being done on it right now is primarily around stability and integrating features and not about performance so if you’re going to do ZFS do it on either BSD or Solaris/Illumos.  The alternative to that is the hardware equivalent with something like LSI’s cachecade which will do basically the same thing and it will work just fine with Linux.
>  
> From: Trevor Francis [mailto:trevor.francis@tgrahamcapital.com] 
> Sent: Wednesday, October 24, 2012 8:27 AM
> To: cloudstack-users@incubator.apache.org
> Subject: Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT
>  
> So, since ZFS is super slow on Centos, would it be better to use ext4 and bcache over the native linux NFS daemon?
>  
> Also, I am reading the KVM is much more disk performant using writeback caching. Does Cloudstack deploy the KVM VM instances with this enabled by default?
>  
>  
>  
> Trevor Francis
> Partner
> 46 Labs | The PeerEdge Cloud
> http://www.46labs.com | http://www.peeredge.net
> 405-362-0046 - Voice  | 405-410-4980 - Cell
> trevorgfrancis - Skype
> trevor@46labs.com
>  
> Solutions Provider for the Telecom Industry
> 
>  
>  
> On Oct 24, 2012, at 8:50 AM, Nik Martin <ni...@nfinausa.com> wrote:
> 
> 
> On 10/24/2012 08:29 AM, Trevor Francis wrote:
> 
> Seems my original post has taken a detour. Who knew storage could be so
> controversial!
> 
> Since we are a Centos house and it would be nice to have a standard OS
> across our platform, but many people have been singing ZFS' praises. To
> bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS
> for linux.
> Does anyone have experience with this? Would it be better to deploy
> Centos with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or
> bcache?
> 
> 
> I inquired about ZFS on Linux in the #cloudstack IRC channel, and the quote I got was "ZFS is slow on Linux". I then asked "Slow relative to what, ZFS on Solaris, ext4 on Linux, ZFS on Linux?", and the response was "yes", so I'm being led to believe that it does not perform as well as other filesystems.  For large file storage on Linux, which Xen VHDs are, I tend to use XFS if the network protocol is NFS.
> 
> 
> Also, is the standard linux NFS daemon sufficient to handle a
> significant load or do you recommend a different NFS daemon....or go iSCSI ?
> Our biggest concern is that we have most VMs doing many sequential small
> writes (log files).
> 
> When I first started with CS on Xenserver in January, I thought thin provisioning was important, so I built file system based storage using NFS, with XFS as the underlying filesystem, on LVM.  NFS (v3) on Centos 6 became processor bound VERY quickly, running %300 CPU load on a quad core Xeon.
> 
> I switched to iSCSI with LVM as backing store, and havwe had zero issues, except I'm currently investigating lower than desired random read/write on < 4k random.  Using tests like Iozone in a VM, and running iostat -t1 on the hypervisor console itself, i can get sustained 2500 IOPS on iSCSI, and it drops to 300 with < 4k blocks of data. I expect to resolve this soon though.
> 
> 
> 
> Thoughts before I make a decision?
> 
> 
> 
> Trevor Francis
> Partner
> 46 Labs | The PeerEdge Cloud
> http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
> <http://www.peeredge.net/>
> 405-362-0046 - Voice  | 405-410-4980 - Cell
> trevorgfrancis - Skype
> trevor@46labs.com <ma...@46labs.com>
> Solutions Provider for the Telecom Industry
> 
> <http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.facebook.com/PeerEdge>
> 
> On Oct 24, 2012, at 2:53 AM, Fabrice Brazier <fabrice.brazier@apalia.net
> <ma...@apalia.net>> wrote:
> 
> 
> Hi,
> 
> @bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i
> or the 8i mean 8 internals ports (so with 2 lsi you can have only 16
> disks).
> 
> After that a velociraptor disk it's only a 10k discs so we can count 120
> IOPs by disks. If we put them in a raid 5 (by the way everybody know that
> raid 5 don't scale very well after 12 disk the performance doesn't
> increase
> so much but nevermind you built your raid 5 of 30 disks).
> So you have a maximum of (30-1)*120=3480 iops for read and
> ((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only
> 32Gb ram wich is peanuts to caching your read iops.
> 
> So the only way you have to make performance is your cahecade.
> Now a cachecade is made of ssd (weird actually it's the same as zil and
> l2arc in zfs xD) plus the max size of a cachecade and the maximum size
> of a
> cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade
> max.
> 
> So you say your storage doesn't cost anything but you don't talk about the
> size and the price of your cachecade.
> And another thing the day you will have a real workload (not only 50
> Vms...)
> and this workload will be larger than 1Tb come again and explain us your
> very large performance.
> 
> Me too I can built a zfs server with 4 zeusram in zil and 5 disks
> behind and
> make a benchmark on files of 2Gb and say hey see I have 500 000 Iops
> and 4Gb
> of throughput...
> 
> Regards
> 
> -----Message d'origine-----
> De : bruce.m@v365.com.au <ma...@v365.com.au>
> [mailto:bruce.m@v365.com.au <http://v365.com.au>]
> Envoyé : mercredi 24 octobre 2012 06:04
> À : cloudstack-users@incubator.apache.org
> <ma...@incubator.apache.org>
> Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT
> 
> 
> 
> These Sans we have build have saved us a fortune $ compared to FC
> Block I/O
> Sans from IBM, HP etc
> 
> Ive been a contractor for IBM and HP
> for 20 years. I know how much they are $$$$$ and what they can do.
> 
> A
> Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your
> lucky you may get 100,000 perhaps 150,000 IOPS
> 
> What ive build is
> better. cheaper, and frankly as i support Linux and SCST, much more
> reliable.
> 
> the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800
> ($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam
> $800,
> Dual PSU $650, about $14,000 for a much superior SAN.
> 
> As i explained another $15K for infiniband network. which does SRP and IP
> 
> As for your issues with cache misses, the CacheCade doesn't Miss much from
> our tests. with WriteBack enabled, BBU enabled, and the SANs are on our
> primary 22KVA UPS It never fails.
> 
> Also Its not
> doing very much unless a user in the cloud is hitting sequential blocks or
> R/W the same data repeatedly for some weird reason. Even then we see
> little
> lagg or slow down, when as you say its hitting the Disc.
> Infiniband with LSI Just rocks even more than we had thought!
> 
> The R/W
> pattern from groups of 50 VM.s appear to be completely random, web
> servers,
> virtual firewalls, email servers, DNS etc, FTP, windows domains,
> sharepoint,
> MSSQL, MYSQL working on the same Lun which are not considered critical (as
> stated by clients) are each VM getting over 300-1200MB/s constantly even
> when other users are pulling or pushing a file copy @ 250Mb/s between VMs
> etc. Weve actually had to figure out how to slow down the SRP connections.
> If a user is provisioned with 2 Luns from each SAN, in windows you can
> copy
> between VM Drives @ 1220MB/s!!!!!
> faster than SSD in Linux its even faster, and that is in ESXi4.1 with
> 15~20 Servers per VM Host.
> 
> we have snmp monitoring the 2 primary sans, and they never go above 30%
> bandwidth. so the point is the Cache has no real bearing on Cloud I/O.
> cloud
> IO is fairly low. We have LSI supplying us with the new PICE3 cards for a
> trial, so i expect even better performance. So long STORY SHORT. GET
> LSI AND
> 16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s
> SSD's its SLOWER! MORE SPINDLES = MORE SPEED!
> 
> the main reason we wanted fast IO is for backups, snapshots, so as not to
> interfere with the daily operational workings of the cloud VM guests.
> Which
> is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will
> work, and some smart ass does a restore or backup in prod hours, only
> to see
> the whole cluster slow down to a crawl. on iSCSI its even more pathetic.
> 
> With SRP on LSI " WE DONT
> HAVE THAT PROBLEM!"
> 
> I did a snapshot restore for 2 machines yesterday for a client, it took 2
> mins to restore 100GB. And none of the other servers were affected. SNMP
> latency on all VM's no impact. Virtually 0CPU cycles on the VMware
> hosts for
> the IO move. IO latency/storage path for other VM's from snmp data from
> VMserver was 2ms which is normal. the machine doing the copy crept up to
> 10ms but no more.
> 
> So my conclusion
> ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING!
> 
> Cache misses we don't have a problem there at all. Were more worried about
> Data Rot, which is why were interested in GFS to ZFS Luns.
> But ZFS is still slow as compared to what we have.....
> 
> we are going to
> launch a new script which moves a file full of 16-bit random io for 100MB,
> and 1TB, it will create a binary checksum for each file, and we can
> compare
> it over time to check for data rot.
> 
> Just so you can
> appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts.
> the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started
> the 2nd, while the 1st was running, same results across the 2 VM's.
> then the
> 3rd concurrently. they all dropped to 2000MB/s and IOPS went down to
> 220,000
> per host. so yes there is a penalty. But the way infiniband works is that
> the switch balances the load across all servers. 4th server online same
> test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s
> @ 128K
> blocks 50%50% random seq. change the distribution to 100% random makes
> little difference. We confirmed all the drives were being accessed! all
> were. we even took a look at the camera out DC to monitor the array
> LEDS...
> alll Flickers like mad! we took a look at the metrics of the Megaraid
> as we
> got it to work on CentOS, no errors, no issues or problems, no dropped
> reads
> or writes.
> 
> we set the file on IO meter as high as poss 900000000 units. We then fired
> up 2 more hosts and did real world file copy, no real impact still get
> 2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent
> file copies between the 2 sans and each host blew us away at 2000MB/s read
> and write across all 6 hosts.!! FXXK we were amazed
> 
> Aside
> from that LSI have done a super job on the Cache modules on their cards.
> I dont see how ZFS will ever approach this kind of performance.
> 
> Anyway
> take from what ive said what you like. We have an awsum SAN... and its
> CHEAP
> and Amazingly fast!
> 
> were in the process of developing a public Cloud, so when the web site is
> finn we will publish all of this in a whitepaper there.
> 
> On 24.10.2012 10:57, Jason Davis wrote:
> 
> Passionate
> are you? Take it easy I am not knocking on what you are doing.
> 
> 
> The
> term "spinning rust" is a catch all for hard drives, nothing personal
> 
>  
> at all, in fact the setup you have is impressive.
> 
> 
> Impressive until
> you have a cache miss. Cache misses are definitely an
> 
> issue even in
> "capital E" enterprise storage.
> 
> 
> Big bad XIV disk arrays or 3par are
> impressive, fast arrays (crazy
> 
> expensive ;]) when you are hitting DRAM
> or SSD but if you are running cache
> 
> unfriendly workloads then
> ultimately your array is only as fast as the
> 
> spinning disk sitting at
> the lowest tier.
> 
> 
> I agree that openstorage is the way to go, from a
> service providers
> 
> perspective it's hard being competitive with AWS or
> RackSpace if you are
> 
> using expensive traditional SAN.
> 
> So anyway,
> my question wasn't a poke at what you are doing, just curious
> 
> about
> how big of an IOP/latency drop you see once a read or write has to
> 
>  
> actually see the spinning disk. :)
> 
> On Oct 23, 2012 8:10 PM, wrote:
> 
> 
> 
> 
> 
> oh well if your negative, and make rude comments then that's
> your
> 
> problem, i don't care about you.
> 
> If you don't believe that
> RDMA can
> 
> perform this fast then again, your problem.
> 
> Ask anyone
> with a decent
> 
> 12+ Disc Raid Array, and RDMA and they will tell you it
> will kick you in
> 
> YOUR ballz! Anyone in storage knows to get iops and
> performance you need
> 
> 12 preferably 16 spindles.
> 
> the spinning
> rust as you call it, are the
> 
> newest 200MB/s WD Veloci Raprors with
> 64MB Cache. @ 10KRPM 2.5" dics in
> 
> 32 Bay Storage Chassis.
> 
> we've
> used them in the past and they are as
> 
> reliable as the SAS drives we
> use, but faster!
> 
> 
> RAID 10 IS BULLSHIT!
> Bloody slow and 50% goes
> up in smoke for nothing. All you get is the
> 
> mirror speed of 2 drives
> which is barley 400MB/s tops! I Spit on RAID 10
> 
> 
> 
> We also tried
> Raid 60 and it was very good as well. But no point to
> 
> use it as we
> replicated to 2nd SAN
> 
> 
> When we tested our new SANs, we
> failed 1
> drive and set the rebuild rate to 30% on raid 6, with 40% scan
> 
> rate.
> The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we
> 
>  
> removed a 2nd disc, and replaced with blank, and it was rebuild in
> 
>  
> another 4 hours. With no real impact to performance tests.
> 
>  
> 
> Separating
> 
> the underlying 6Gb/s Sata ports x 8 into effectively 16 x
> 3Gb/s Channels
> 
> using a SAS Expander gives each Disc 300MB/s of
> bandwidth. They cant
> 
> physically perform better than 200~220MB/s so
> there is enough bandwidth
> 
> on the SATA and PCIE Bus to cope. LSI rate
> the 9280-8i @ 2500MB/s but it
> 
> does work faster with Cache Cade. up to
> 3200MB/s in that test.
> 
> 
> So a
> real SAN has many spindles of high
> performance, and the WD VRs are
> 
> better than some cheap SSD
> drives.
> 
> 
> RDMA is very fast, low cpu cycles,
> and reads and write
> directly to RAM@ 40Gb/s, we created a RAM drive and
> 
> tested it up to
> 3200MB/s which is as fast as the PCIE 2 Bus / LSI could
> 
> Handle.
> 
> 
> If you cant afford this realtivley cheap equipment (compated to
> 
> IBM
> DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast
> 
>  
> performance. Any real Cloud would have what we have built. None of our
> 
> customers, ALL 300 Of them! have lost a single bit of data, all
> VMs are
> 
> running very quickly, and with no problems now for 3
> months.
> 
> 
> So If you
> dont appreciate being shown how to build a
> kick ASS SAN then PISS
> 
> OFF!
> 
> On 24.10.2012 08:17, Jason Davis
> wrote:
> 
> How often do you have
> folks cache missing and hitting the
> spinning rust
> 
> below the cache cade
> tier?
> 
> On Oct 23, 2012
> 6:25 PM, wrote:
> 
> 
> 
> 
> 
> Hi
> 
> with our
> SCST
> RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
> 
> Director 4036
>  
> switches + HP Gen2 QDR Chassis Switches.
> 
> 
> Our San achieves
> 
> 
> 295,000 - 350,000 Iops max per LSI controller, depending on
> block
> 
> size
> 
> for i/o meter test. LSI9280-8i with battery cache.
> +
> 
> cachecade2.0 x 2 in
> 
> PCIE 16x slot on supermicro
> motherboard.
> 
> 
> 1
> x 250GB SSD for CACHE Cade +
> 
> 16 1tb
> WD VR drives (16 - hot stand buy
> 
> - 2 for raid 6 parity) =
> 
>  
> 2600MB/s
> 
> 
> most of our test are from 4K
> to 128K block size,
> @ 4K we get
> 
> 295,000 Iops @ 1590 MB/s and @ 128K
> 350,000 iops @
> 2400~2600MB/s
> 
> 
> we
> have tuned 64K and 128K block
> size
> on different luns. 64K for database
> 
> and 128K for general
>  
> file.
> 
> 
> The best thing about Infiniband is low CPU
> 
> cycles. only
> 
> 5~6% during these tests.
> 
> 
> latency is as low
> as 4-6ms
> 
> average
> read time. concurrent response times are from
> 5-12ms. even under
> 
>  
> heavy load its below 20ms
> 
>  
> 
> Infiniband latency is below 0.01 us
> 
> Which is
> 
> why we chose
> it.
> 
> 
> we run MSSQL on the 64K formatted
> Luns, its
> massively
> 
> fast.
> 
> If we copy the same data twice the
>  
> cache cade kicks in and we
> 
> achieve even better speeds.
> 
> 
> Ive
> 
> compared this to block io @ 8Gb/s Fiber
> 
> channel, and it
> barely gets
> 
> 120,000 IOPS. @ much higher latency and
> 
>  
> bandwidth.
> 
> 
> So for our
> money RDMA wins!
> 
>  
> 
> Bruce
> 
> 
> On 23.10.2012 21:20,
> Jason Davis
> wrote:
> 
>  
> Out of curiosity, is there any quick performance
> 
> numbers
> for
> these ZFS +
> 
> GlusterFS mashups you guys are talking
>  
> about?
> 
>  
> 
> 
> Specifically, IOPs and latency? Sequential
> read/write
> 
>  
> performance honestly
> 
> isn't a very good
> benchmark to determine
> 
> your
> 
> SANs performance. It's like
> 
>  
> comparing CPUs based solely on
> 
> how many
> 
> GHz it runs at. Sure
> you can get
> 
> great MB or GB/s with
> SATA disk but
> 
> I'd
> reckon that IOP performance is
> 
> abismal. If you
> are
> utilizing
> 
> GlusterFS without the cache pooling magic
> 
> that
> is
> 
> ZFS then I would
> 
> imagine that latency can be an issue.
> 
> 
> 
>  
>  
> 
> 
> On Tue, Oct 23, 2012 at
> 7:56 AM,
> Andreas Huser wrote:
> 
>  
> 
> Hi Fabrice,
> 
> 
> i
> know
> 
> OpenSolaris/Solaris Oracle it's so
> a thing.
> 
> I'm
> for more then 10
> 
> years a open source user and
> that
> 
>  
> with oracle - i did no like at the
> 
> beginning of this
>  
> constallation.
> 
> But Oracle makes his work good i
> know
> that. The
> 
> cost of one socket
> 
> are 700$ and you can use so
> much
> 
> quantity of
> TB as you will.
> 
> And you can use the
> full premier Support
> 
> from
> Oracle.
> 
> Nexenta develop with
> the Illumos code. And the Licence
> 
>  
> are TB based.
> 
> That
> is not my favorite. As well the pool version
> 
> from
> 
> Nexenta
> comes
> 
> not after. Current Nexenta Infiniband are
> not a
> usable
> 
> solution.
> 
> But every can use what he will.
> Everyone
> 
> must decide for
> 
> themselves.
> 
> 
> SRP
> Targets or iser are not
> 
> difficult to configure.
> 
> Use the SRP
> for
> 
> the Storage unit
> connection. Solaris and GlusterFS
> 
>  
> builds one Storage unit.
> 
> The
> GlusterFS Server export the
> final Volume
> 
> to the Clients as well
> KVM,
> 
> VMWare,
> Hyper-V etc.
> 
> You can use
> native GlusterFS,
> RDMA, NFS
> ore CIFS to export the Volume.
> 
> SRP have
> nothing to do
>  
> with VmWare.
> 
> 
> When you are use a 7200 SAS drive the
> 
> access
> time are the same as a
> 
> SATA drive only the quality
> of the
> 
>  
> hardware are better. When you need
> 
> Performance
> you must use SAS
> 
>  
> drives with 15000U/m. But it's not needed
> 
> 
> when you install SSD
> for
> 
> ZIL/L2ARC. ZeusRAM rocks :-)
> 
> 
> 
> 
> I use dedup only at
> secondary stroage
> 
> or on
> Backupserver not on Primary
> 
> Storage.
> 
> When you use
> SSD SATA
> 
> drives then you have an cheap an fast
> strorage.
> 
>  
> 1TB drive cost unter
> 
> 100$. Current i'm not need to
> save
> storage volume.
> 
> 
> Which
> application use atime? I
> know
> 
> "find -atime N". atime on a Storage
> 
>  
> that stores
> only Virtual
> 
> Disks?
> 
> I don't need the information when i
> 
> have last opening
> the disk :-)
> 
> 
> For a Solaris GUI
> take a look at
> 
> napp-it
> 
> http://www.napp-it.org/
> 
> 
> 
> greetings from germany
> 
>  
> Andreas
> 
>  
> 
> 
> 
> 
> 
> 
> ----- Ursprüngliche Mail -----
>  
> 
> 
> 
> Von:
> "Fabrice Brazier">> An:
>  
> cloudstack-users@incubator.apache.org
> <ma...@incubator.apache.org>
> 
>  
> Gesendet: Dienstag,
> 23.
> 
> Oktober 2012 12:30:50
> 
> Betreff: RE: Primary
>  
> Storage
> 
>  
> 
> Hi Andreas,
> 
> 
> Hum that's
> pretty cool, i know they have
> 
>  
> still trouble with infiniband
> on
> 
> nexenta but it's clearly a
> priority
> 
> on their
> roadmap (plus i trust more the
> 
> nexenta team
> to evolve zfs
> 
> than oracle).
> 
> 
> I agree iscsi over infiniband
>  
> increase the latency
> 
> but most of the time
> 
> it's
> 
> just more
> 
> simple to use IPOIB than IB,
> 
> for example if you use
> vmware the
> 
>  
> support of infiniband it's assured
> 
> by
> mellanox and not vmware so if
> 
> you
> 
> have
> an issue the
> vmware
> 
> support probably don't help
> you anymore.
> 
>  
>  
> About the raid i'm a fan
> 
> of raid 10, i
> prefer build a raid
> 10 with 7200 sas
> 
> drive than
> multiple vdev
> raidz/raidz2
> 15k sas drive. Particularly for
> 
> virtual
>  
> environment
> with a random workload and multiple access.
> 
> 
> I'm
> a
> fan
> 
> of nfs so agreed about the zil, and zeus ram are the only
> one
> 
> for
> 
>  
> that (with classical ssd you can hit the write
> hole).
> 
> 
> 
> 
> Agreed for
> compression too (but only
> lzjb level gzip use too
> 
> much cpu),
> 
> disable
> atime
> permit to decrease the iops load but
> 
> i'm not sure is really a
> 
>  
> 
> 
> best practice. About the deduplication
> I don't totally
> agree. It
> 
> really
> 
> depends of your array, the
> workload
> and the VM type on your
> 
> cloudstack.
> 
> Actually when i
>  
> build a zfs array i count 1Gb ram for 1Tb
> 
> disks. With
> 
>  
>  
> deduplication i count 2Gb ram for 1 Tb disks (plus in
> 
> nexenta 4
> they
> 
> will
> 
> add a possibility to use ssd for the
>  
> deduplication metada,
> 
> like the l2arc
> 
> for the read cache).
> 
> 
> 
> 
> The
> last point
> it's about your drives, why sata
> drives? I mean sata
> 
> doesn't
> 
>  
> respect their latency, and
> the reliability of a sas drive is
> 
> 10x the
> sata.
> 
> Plus
> now you can found many NL-sas drive at low
> 
> cost
> 
> 
> 
> But it's really a nice architecture, i never try glusterfs
> 
>  
>  
> (for the moment)
> 
> plus in that case it's really a good way to
> have
> 
> a
> 
> replacement to a
> 
> metro-cluster for free, i try
> one of is
> 
> competitor
> 
> (onefs) and clearly
> 
> clustered
> filesystem are the
> 
> futur.
> 
>  
> 
> Cheers,
> 
>  
> Fabrice
> 
> 
> -----Message
> d'origine-----
> 
> De :
> Andreas
> 
> Huser
> [mailto:ahuser@7five-edv.de]
> 
> Envoyé :
> mardi 23 octobre 2012
> 
>  
> 11:37
> 
> À :
> cloudstack-users@incubator.apache.org
> 
> Objet :
> Re:
> 
>  
> Primary Storage
> 
> 
> Hi Fabrice,
> 
> I don't
> know
> 
> what other people
> 
> do but i have no problems with
> Infiniband +
> 
>  
> GlusterFS + Cloudstack
> 
> I'm not use
> Nexenta it's based on Illumos and
> 
> work
> 
> not fine with
>  
> Infiniband.
> 
> I have two different
> clusters in productiv
> 
>  
> enviroments.
> 
> 
> The first: Solaris 11
> with built-in
> GlusterFS 3.3
> 
> export Gluster Vol. with
> 
> RDMA. >
>  
> performance is okey you can use that
> 
> for smaller enviroments
> The
> 
> 
> Second: Is a little bit complex with a GlusterFS
> Server in the
> 
> middle
> 
> 
> ZFS Server: based on Solaris
> 11
> 
>  
> 1.) Create a
> zpool min. two vdevs and SSD
> read/write cache
> 
> 2.) Create
> a thin
> provisioning Volume
> "zfs create -V" disable atime and
> 
> enable
>  
> compression
> (do not enable dedup!) and export as (iWarp) SRP Target
> 
>  
> 
> 
> to the GlusterFS Server use a direct connection without IB
> 
> Switch.
> 
>  
> 
> GlusterFS Server:
> 
> 1.) Use
> RedHat, CentOS or
> 
> Fedora (I use CentOS 5
> 
> and 6)
> 
> 2.)
> Use OFED driver from
> 
> https://www.openfabrics.org
> 
> 3.)
>  
> Import the SRP Target from ZFS
> 
> Server and format as xfs
> 
> 4.)
> Create a
> 
> Glustervolume "volume
> create xy transport rdma " (use
> only
> 
> rdma)
> 
> 5.) Connect
> with the second IB
> Port to a IB Switch
> 
> 
> Cloudstack
>  
> Hypervisor
> Node:
> 
> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
>  
> and
> 
> 6)
> 
> 2.) Use OFED driver from
> https://www.openfabrics.org
> 
> 
> 3.) Import
> the
> Glustervolume
> 
> 
> ZFS Thin Volume ----
> Infinband SRP
> ---->
> 
> GlusterFS ---- GFSVol rdma ---->
> 
> IB Switch
> ---->
> Clients
> 
> 
> The
> ZFS and GlusterFS Server form a
> Storage
> 
> unit connect direct with
> 
> 40Gbit
> 
> Infiniband
> Point-to-Point You
> 
> do not feel that is a cable
> 
> between!
> 
> 
> 
> Important: When you
> have Infiniband use not IPoIB with
> 
> iSCSI! If one
> 
> already
> 
> has Infiniband then you
> should also use the
> 
> advantage.
> 
> IPoIB
> have a higher
> Latency as iWarp SRP!
> 
> 
> 
> SRP =
> usec
> 
> 
> 
> 
> 
> -- SRP --
> local address: LID 0x01 QPN
> 0x44004b PSN
> 
>  
> 0xf3265b RKey 0x9804237c VAddr
> 
>  
> 0x00000001dda000 remote address:
> 
> LID
> 
> 0x0a QPN 0x10004a PSN
> 0x44072e RKey
> 
> 0x1c0f115 VAddr
>  
> 0x000000088e6000
> 
>  
> 
> 
>  
>  
> ------------------------------------------------------------------
> 
> 
> 
> 
> 
> #bytes #iterations t_min[usec] t_max[usec]
> t_typical[usec]
> 
> 2
> 1000
> 
> 1.29 125.30 1.31
> 
>  
> 
> 
> 
> ------------------------------------------------------------------
> 
> 
> 
> 
> 
> 
>  
> -- IPoIB ---
> 
> [root@sv01sfogaa ~]# ping
> 10.200.0.10
> 
> 
> PING
> 10.200.0.10 (10.200.0.10) 56(84)
> bytes of data.
> 
> 64
> bytes from
> 
> 10.200.0.10: icmp_seq=1
> ttl=255 time=0.147 ms
> 
> 64
> bytes from
> 
> 10.200.0.10:
> icmp_seq=2 ttl=255 time=0.116 ms
> 
>  
> 
> When you put load
> on
> 
> IPoIB then increases the latency that is not
> good
> 
>  
> 
> 
> 
> That is my
> recommendation for a simple
>  
> GlusterFS Mirror:
> 
> 
> - Supermicro Server
> with Intel
> Hardware
> 
> and Expander Backplane
> 
> - 1x Crucal-M4 SSD Read
> 
> Cache
> 
> - 2x
> ZeusIOPs SSD Write cache (mirror)
> 
> -
> SATA 24/7
> 
> Harddrives
> 
>  
> - LSI HBA 9207 or 9211
> 
> -
> ConnectX-2 QDR Dualport
> 
> Infiniband
> Adaper (HP refurbished with
> full
> 
> warrany for 100$)
> Importent >
> Flash newest
> Firmware from Mellanox!
> 
> - Mellanox IB
> Swich
> 
>  
> -
> Solaris 11
> 
> - GlusterFS 3.3 compiled with ib_verbs
> -
> 
> 
> Gluster Volume transport only rdma
> 
> 
> 
> 
> 
> 
> 
> 
>  
> 
> 
> 
> Throughput
> constant up to 200
> Mbyte/s >> more throughput with
> 
> more storage servers
> 
> or more
> harddrives on jbod >> >> >>
> 
> 
> 
>  
> 
> 
> 
> Info:
> 
> 
> 
> 
> - I have some
> problems with infiniband
> RDMA or
> SRP with OpenIndiana or
> 
> Illumos or
> Nexenta. Some
>  
> adapters have a high latency or not a stable
> 
>  
> connection.
> Use
> 
> Solaris that's the right way!
> 
> - OpenIndiana is Beta!
> 
> Infiniband
> ib_verbs works not or not fine!
> 
> - Use Solaris
> 11
> 
> Infiniband
> ib_verbs are native and stable
> 
> - Don't
> use Ubuntu Client
> 
> Server
> for Infiniband! Use RedHat, Fedora
> or
> 
> CentOS and install the
>  
> right drivers from
> 
>  
> https://www.openfabrics.org/downloads/OFED/
> 
> 
> -
> You have
> not SSD Cache? Disable at the ZFS Volume the sync!
> 
>  
>  
> Important!
> 
> You
> loose Security for your Data but
> some
> 
> protocols use
> 
> sync flags in
> 
> transport.
> 
> Example NFS use by
> 
> default fsync. write
> 
> cache are not active.
> NFS writes
> 
> data
> direct to the Harddrive. For
> 
> Data
> Security and Performance give
> 
> the
> 
> Storage Server a SSD
> write
> 
> cache. ZFS works at default with
> sync=standard
> 
>  
> that prevent write
> 
> holes. (COW System)
> 
>  
> 
> I hope
> that I could help a little
> 
>  
> 
> Greeting from
>  
> Germany
> 
> Andreas
> 
> 
> 
> -----
> Ursprüngliche
> 
> Mail
> 
> -----
> 
> 
> Von: "Fabrice
> Brazier"
> 
> An:
>  
> cloudstack-users@incubator.apache.org
> 
> 
> Gesendet: Dienstag, 23.
> Oktober
> 
> 2012 09:55:15
> 
>  
> Betreff: RE: Primary Storage
> 
>  
> 
> Hi Andreas,
> 
>  
> 
> 
>  
> Hello i just see your configuration, it seems
> quite
> interesting.
> 
> If i
> understand well you want to build
> some
> 
> zfs array on the backend.
> 
>  
> Export luns (probably
> by iscsi over
> 
> infiniband) to you linux cluster,
> 
> and
> 
>  
> on
> 
> the linux
> cluster you put glusterFS.
> 
> I can
> understand
> 
> the point, with
> that you can have very good
> performance
> 
> and
> 
> reliability
> (zfs),
> scalability and redundancy (gluster) for very low
> 
> cost.
> 
> 
> So just one question, did you try the global namespace
> 
>  
> implementation
> 
> from
> 
> nexenta?
> If yes can you tell me
> what
> 
> configuration
> is the best for you?
> 
> I mean the
> fact you have a gluster
> 
> cluster
> in the middle must impact
> the
> 
> overral performance no?
> 
> 
> 
> 
> Fabrice
> 
> 
> -----Message d'origine-----
> De :
> Andreas
> 
> Huser
> 
> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23
> octobre
> 
>  
> 2012 05:40 À :
> 
>  
> cloudstack-users@incubator.apache.org
> 
> Objet :
> Re: Primary
> 
> Storage
> 
> 
> Hi,
> 
> for Cloudstack i
> use
> 
> Solaris 11 ZFS + GlusterFS
> 
> over Infiniband (RDMA).
> 
>  
> That
> 
>  
> gives the best performance and most
> 
> scalable
> Storage.
> 
> I have
> tasted some different solutions for
> primary
> 
> Storage but the most
> are
> 
> to expensive and for
> a CloudStack Cluster
> 
> not economic or
> have a poor
> 
>  
> performance.
> 
> 
> My Configuration:
> 
>  
>  
> Storage Node:
> 
> Supermicro Server (Intel Hardware) with Solaris
> 11
> 
> with
> 
> SSD write and read
> 
> cache (read crucial-m4,
> write ZeusIOPS)
> 
> GlusterFS
> 
> and dualport ConnectX
> 
>  
> 40Gbit/s Infiniband adapter.
> 
> 
> 
> 
> I have
> installed
> GlusterFS direct on Solaris with a modified
> 
> code.
> 
> Want
> you
> 
> build bigger systems for more then 50 VMs it is
> better you
> split
> 
> the
> Solaris and GlusterFS with a separte
>  
> headnode for GlusterFS
> 
> 
> That
> looks like:
> 
>  
> Solaris ZFS
> 
> Backendstorage with a dataset Volume (Thin
> 
>  
> Provision) --> ( SRP
> 
>  
> Target attached direct without
> Infiniband switch
> 
> to GF Node) -->
> GlusterFS
> 
> Node the
> srp target formatted with xfs
> 
> filesystem,
> create a GlusterFS
> 
> 
> Volume --> ( Infiniband over a
> Mellanox Port
> Switch) -->
> Cloudstack Node
> 
> mount glusterFS Volume over
> RDMA
> 
> 
> 
> 
> For the Dataset Volume at the ZFS Storage, disable atime
> and
> 
>  
> enable
> 
> compression.
> (Space reclaim) With
> compression you
> 
> can
> 
> shrink the ZFS Volume with command
> 
>  
> at Linux dd /dev/zero or
> 
> In a
> 
> Windows VM with sdelete That
> gives you space
> 
> left on the
> Primary
> 
> Storage for
> deleted Files in a VM or for deleted vhd's
> 
>  
> or vm's in the
> 
> cloudstack
> 
> 
> greeting Andreas
> 
> 
> 
> 
> 
> 
> 
> 
> Mit freundlichen
> Grüßen
> 
>  
>  
> Andreas Huser
> 
> 
> Geschäftsführer
> 
> System Engineer
> /
> 
> Consultant
> 
> (Cisco
> CSE, SMBAM, LCSE, ASAM)
> 
>  
> 
> 
>  
> ---------------------------------------
> 
>  
> Zellerstraße 28 - 77654
> 
>  
> Offenburg
> 
> Tel: +49(781)
> 12786898
> 
> Mobil: +49(176) 10308549
> 
> 
> 
> ahuser@7five-edv.de
> 
> 
> 
> 
> 
> 
> -----
> 
> Ursprüngliche Mail -----
> 
>  
> 
> Von: "Outback
> Dingo"
> 
> An:
> cloudstack-users@incubator.apache.org
> 
>  
> 
> Gesendet: Dienstag, 23.
> Oktober 2012 02:15:16
> 
> Betreff:
> Re: Primary
> 
> Storage
> 
>  
> 
> On Mon, Oct 22, 2012 at
> 8:09 PM, Ivan
> 
> Rodriguezwrote:
> 
>  
> 
> 
> 
> 
>  
> Solaris 11 ZFS and yes we tried different
> 
> setups, raids
> levels
> number >> of SSD cache, ARC zfs options etc etc
> 
> etc. >> >>
>  
> Cheers >> >> >> VMWare ?? >>
> 
> 
> 
> 
> 
> 
>  
> 
> 
>  
>  
> 
>  
> 
> 
> -- 
> Regards,
> 
> Nik
> 
> Nik Martin
> VP Business Development
> Nfina Technologies, Inc.
> +1.251.243.0043 x1003
> Relentless Reliability

RE: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

Posted by Clayton Weise <cw...@iswest.net>.

You're going to see a dip in performance with KVM primarily because of the qcow2 format (KVM people please correct me if I'm wrong because I'm new to KVM).  Due to time constraints I wasn't able to run any tests with writeback cache turned on or off but I've seen people saying both in forums that having it on improves performance, and having it off improves performance to who knows.

What I can tell you is that in my tests with our current setup I was able to achieve about ~4-5k IOPS for random reads/writes using a multitude of different benchmark utilities in both Linux and Windows with KVM's raw disk format.  When I switched over to qcow2 it went down to about 2-2.5k.  With the way our test RAID system is constructed our theoretical max is about 5k IOPS so I'm not sure if the drop is something related to the overall potential (e.g. 50% of the total) or simply a hard cap of that number but I haven't had a chance to really kick things around much.  This test was done using an iSCSI LUN mounted from a Compellent SAN on 10gbit Ethernet and OCFS2 on top of it.  I know there was a comment earlier saying that OCFS2 was old but what I can say about it is that it was extremely easy to configure and get up and running and seemed to be pretty solid when we tested it about a year ago as well as the other week.

Regarding your ZFS on Linux question.  ZFS on Linux just isn't there yet.  It's too new and the work that's being done on it right now is primarily around stability and integrating features and not about performance so if you're going to do ZFS do it on either BSD or Solaris/Illumos.  The alternative to that is the hardware equivalent with something like LSI's cachecade which will do basically the same thing and it will work just fine with Linux.

From: Trevor Francis [mailto:trevor.francis@tgrahamcapital.com]
Sent: Wednesday, October 24, 2012 8:27 AM
To: cloudstack-users@incubator.apache.org
Subject: Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

So, since ZFS is super slow on Centos, would it be better to use ext4 and bcache over the native linux NFS daemon?

Also, I am reading the KVM is much more disk performant using writeback caching. Does Cloudstack deploy the KVM VM instances with this enabled by default?

Trevor Francis
Partner
46 Labs | The PeerEdge Cloud
http://www.46labs.com<http://www.46labs.com/> | http://www.peeredge.net<http://www.peeredge.net/>
405-362-0046 - Voice  | 405-410-4980 - Cell
trevorgfrancis - Skype
trevor@46labs.com<ma...@46labs.com>

Solutions Provider for the Telecom Industry

[cid:image001.jpg@01CDB1D8.F6D21DB0]<http://www.twitter.com/peeredge> <http://www.twitter.com/peeredge> [cid:image002.jpg@01CDB1D8.F6D21DB0]

On Oct 24, 2012, at 8:50 AM, Nik Martin <ni...@nfinausa.com>> wrote:

On 10/24/2012 08:29 AM, Trevor Francis wrote:

Seems my original post has taken a detour. Who knew storage could be so
controversial!

Since we are a Centos house and it would be nice to have a standard OS
across our platform, but many people have been singing ZFS' praises. To
bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS
for linux.
Does anyone have experience with this? Would it be better to deploy
Centos with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or
bcache?

I inquired about ZFS on Linux in the #cloudstack IRC channel, and the quote I got was "ZFS is slow on Linux". I then asked "Slow relative to what, ZFS on Solaris, ext4 on Linux, ZFS on Linux?", and the response was "yes", so I'm being led to believe that it does not perform as well as other filesystems.  For large file storage on Linux, which Xen VHDs are, I tend to use XFS if the network protocol is NFS.

Also, is the standard linux NFS daemon sufficient to handle a
significant load or do you recommend a different NFS daemon....or go iSCSI ?
Our biggest concern is that we have most VMs doing many sequential small
writes (log files).

When I first started with CS on Xenserver in January, I thought thin provisioning was important, so I built file system based storage using NFS, with XFS as the underlying filesystem, on LVM.  NFS (v3) on Centos 6 became processor bound VERY quickly, running %300 CPU load on a quad core Xeon.

I switched to iSCSI with LVM as backing store, and havwe had zero issues, except I'm currently investigating lower than desired random read/write on < 4k random.  Using tests like Iozone in a VM, and running iostat -t1 on the hypervisor console itself, i can get sustained 2500 IOPS on iSCSI, and it drops to 300 with < 4k blocks of data. I expect to resolve this soon though.

Thoughts before I make a decision?

Trevor Francis
Partner
46 Labs | The PeerEdge Cloud
http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
<http://www.peeredge.net/>
405-362-0046 - Voice  | 405-410-4980 - Cell
trevorgfrancis - Skype
trevor@46labs.com<ma...@46labs.com> <ma...@46labs.com>
Solutions Provider for the Telecom Industry

<http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.facebook.com/PeerEdge>

On Oct 24, 2012, at 2:53 AM, Fabrice Brazier <fa...@apalia.net>
<ma...@apalia.net>> wrote:

Hi,

@bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i
or the 8i mean 8 internals ports (so with 2 lsi you can have only 16
disks).

After that a velociraptor disk it's only a 10k discs so we can count 120
IOPs by disks. If we put them in a raid 5 (by the way everybody know that
raid 5 don't scale very well after 12 disk the performance doesn't
increase
so much but nevermind you built your raid 5 of 30 disks).
So you have a maximum of (30-1)*120=3480 iops for read and
((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only
32Gb ram wich is peanuts to caching your read iops.

So the only way you have to make performance is your cahecade.
Now a cachecade is made of ssd (weird actually it's the same as zil and
l2arc in zfs xD) plus the max size of a cachecade and the maximum size
of a
cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade
max.

So you say your storage doesn't cost anything but you don't talk about the
size and the price of your cachecade.
And another thing the day you will have a real workload (not only 50
Vms...)
and this workload will be larger than 1Tb come again and explain us your
very large performance.

Me too I can built a zfs server with 4 zeusram in zil and 5 disks
behind and
make a benchmark on files of 2Gb and say hey see I have 500 000 Iops
and 4Gb
of throughput...

Regards

-----Message d'origine-----
De : bruce.m@v365.com.au<ma...@v365.com.au> <ma...@v365.com.au>
[mailto:bruce.m@v365.com.au<http://v365.com.au> <http://v365.com.au>]
Envoyé : mercredi 24 octobre 2012 06:04
À : cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
<ma...@incubator.apache.org>
Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

These Sans we have build have saved us a fortune $ compared to FC
Block I/O
Sans from IBM, HP etc

Ive been a contractor for IBM and HP
for 20 years. I know how much they are $$$$$ and what they can do.

A
Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your
lucky you may get 100,000 perhaps 150,000 IOPS

What ive build is
better. cheaper, and frankly as i support Linux and SCST, much more
reliable.

the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800
($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam
$800,
Dual PSU $650, about $14,000 for a much superior SAN.

As i explained another $15K for infiniband network. which does SRP and IP

As for your issues with cache misses, the CacheCade doesn't Miss much from
our tests. with WriteBack enabled, BBU enabled, and the SANs are on our
primary 22KVA UPS It never fails.

Also Its not
doing very much unless a user in the cloud is hitting sequential blocks or
R/W the same data repeatedly for some weird reason. Even then we see
little
lagg or slow down, when as you say its hitting the Disc.
Infiniband with LSI Just rocks even more than we had thought!

The R/W
pattern from groups of 50 VM.s appear to be completely random, web
servers,
virtual firewalls, email servers, DNS etc, FTP, windows domains,
sharepoint,
MSSQL, MYSQL working on the same Lun which are not considered critical (as
stated by clients) are each VM getting over 300-1200MB/s constantly even
when other users are pulling or pushing a file copy @ 250Mb/s between VMs
etc. Weve actually had to figure out how to slow down the SRP connections.
If a user is provisioned with 2 Luns from each SAN, in windows you can
copy
between VM Drives @ 1220MB/s!!!!!
faster than SSD in Linux its even faster, and that is in ESXi4.1 with
15~20 Servers per VM Host.

we have snmp monitoring the 2 primary sans, and they never go above 30%
bandwidth. so the point is the Cache has no real bearing on Cloud I/O.
cloud
IO is fairly low. We have LSI supplying us with the new PICE3 cards for a
trial, so i expect even better performance. So long STORY SHORT. GET
LSI AND
16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s
SSD's its SLOWER! MORE SPINDLES = MORE SPEED!

the main reason we wanted fast IO is for backups, snapshots, so as not to
interfere with the daily operational workings of the cloud VM guests.
Which
is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will
work, and some smart ass does a restore or backup in prod hours, only
to see
the whole cluster slow down to a crawl. on iSCSI its even more pathetic.

With SRP on LSI " WE DONT
HAVE THAT PROBLEM!"

I did a snapshot restore for 2 machines yesterday for a client, it took 2
mins to restore 100GB. And none of the other servers were affected. SNMP
latency on all VM's no impact. Virtually 0CPU cycles on the VMware
hosts for
the IO move. IO latency/storage path for other VM's from snmp data from
VMserver was 2ms which is normal. the machine doing the copy crept up to
10ms but no more.

So my conclusion
ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING!

Cache misses we don't have a problem there at all. Were more worried about
Data Rot, which is why were interested in GFS to ZFS Luns.
But ZFS is still slow as compared to what we have.....

we are going to
launch a new script which moves a file full of 16-bit random io for 100MB,
and 1TB, it will create a binary checksum for each file, and we can
compare
it over time to check for data rot.

Just so you can
appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts.
the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started
the 2nd, while the 1st was running, same results across the 2 VM's.
then the
3rd concurrently. they all dropped to 2000MB/s and IOPS went down to
220,000
per host. so yes there is a penalty. But the way infiniband works is that
the switch balances the load across all servers. 4th server online same
test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s
@ 128K
blocks 50%50% random seq. change the distribution to 100% random makes
little difference. We confirmed all the drives were being accessed! all
were. we even took a look at the camera out DC to monitor the array
LEDS...
alll Flickers like mad! we took a look at the metrics of the Megaraid
as we
got it to work on CentOS, no errors, no issues or problems, no dropped
reads
or writes.

we set the file on IO meter as high as poss 900000000 units. We then fired
up 2 more hosts and did real world file copy, no real impact still get
2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent
file copies between the 2 sans and each host blew us away at 2000MB/s read
and write across all 6 hosts.!! FXXK we were amazed

Aside
from that LSI have done a super job on the Cache modules on their cards.
I dont see how ZFS will ever approach this kind of performance.

Anyway
take from what ive said what you like. We have an awsum SAN... and its
CHEAP
and Amazingly fast!

were in the process of developing a public Cloud, so when the web site is
finn we will publish all of this in a whitepaper there.

On 24.10.2012 10:57, Jason Davis wrote:

Passionate
are you? Take it easy I am not knocking on what you are doing.

The
term "spinning rust" is a catch all for hard drives, nothing personal

at all, in fact the setup you have is impressive.

Impressive until
you have a cache miss. Cache misses are definitely an

issue even in
"capital E" enterprise storage.

Big bad XIV disk arrays or 3par are
impressive, fast arrays (crazy

expensive ;]) when you are hitting DRAM
or SSD but if you are running cache

unfriendly workloads then
ultimately your array is only as fast as the

spinning disk sitting at
the lowest tier.

I agree that openstorage is the way to go, from a
service providers

perspective it's hard being competitive with AWS or
RackSpace if you are

using expensive traditional SAN.

So anyway,
my question wasn't a poke at what you are doing, just curious

about
how big of an IOP/latency drop you see once a read or write has to

actually see the spinning disk. :)

On Oct 23, 2012 8:10 PM, wrote:

oh well if your negative, and make rude comments then that's
your

problem, i don't care about you.

If you don't believe that
RDMA can

perform this fast then again, your problem.

Ask anyone
with a decent

12+ Disc Raid Array, and RDMA and they will tell you it
will kick you in

YOUR ballz! Anyone in storage knows to get iops and
performance you need

12 preferably 16 spindles.

the spinning
rust as you call it, are the

newest 200MB/s WD Veloci Raprors with
64MB Cache. @ 10KRPM 2.5" dics in

32 Bay Storage Chassis.

we've
used them in the past and they are as

reliable as the SAS drives we
use, but faster!

RAID 10 IS BULLSHIT!
Bloody slow and 50% goes
up in smoke for nothing. All you get is the

mirror speed of 2 drives
which is barley 400MB/s tops! I Spit on RAID 10

We also tried
Raid 60 and it was very good as well. But no point to

use it as we
replicated to 2nd SAN

When we tested our new SANs, we
failed 1
drive and set the rebuild rate to 30% on raid 6, with 40% scan

rate.
The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we

removed a 2nd disc, and replaced with blank, and it was rebuild in

another 4 hours. With no real impact to performance tests.

Separating

the underlying 6Gb/s Sata ports x 8 into effectively 16 x
3Gb/s Channels

using a SAS Expander gives each Disc 300MB/s of
bandwidth. They cant

physically perform better than 200~220MB/s so
there is enough bandwidth

on the SATA and PCIE Bus to cope. LSI rate
the 9280-8i @ 2500MB/s but it

does work faster with Cache Cade. up to
3200MB/s in that test.

So a
real SAN has many spindles of high
performance, and the WD VRs are

better than some cheap SSD
drives.

RDMA is very fast, low cpu cycles,
and reads and write
directly to RAM@ 40Gb/s, we created a RAM drive and

tested it up to
3200MB/s which is as fast as the PCIE 2 Bus / LSI could

Handle.

If you cant afford this realtivley cheap equipment (compated to

IBM
DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast

performance. Any real Cloud would have what we have built. None of our

customers, ALL 300 Of them! have lost a single bit of data, all
VMs are

running very quickly, and with no problems now for 3
months.

So If you
dont appreciate being shown how to build a
kick ASS SAN then PISS

OFF!

On 24.10.2012 08:17, Jason Davis
wrote:

How often do you have
folks cache missing and hitting the
spinning rust

below the cache cade
tier?

On Oct 23, 2012
6:25 PM, wrote:

Hi

with our
SCST
RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid

Director 4036

switches + HP Gen2 QDR Chassis Switches.

Our San achieves

295,000 - 350,000 Iops max per LSI controller, depending on
block

size

for i/o meter test. LSI9280-8i with battery cache.
+

cachecade2.0 x 2 in

PCIE 16x slot on supermicro
motherboard.

1
x 250GB SSD for CACHE Cade +

16 1tb
WD VR drives (16 - hot stand buy

- 2 for raid 6 parity) =

2600MB/s

most of our test are from 4K
to 128K block size,
@ 4K we get

295,000 Iops @ 1590 MB/s and @ 128K
350,000 iops @
2400~2600MB/s

we
have tuned 64K and 128K block
size
on different luns. 64K for database

and 128K for general

file.

The best thing about Infiniband is low CPU
cycles. only

5~6% during these tests.

latency is as low
as 4-6ms

average
read time. concurrent response times are from
5-12ms. even under

heavy load its below 20ms

Infiniband latency is below 0.01 us

Which is

why we chose
it.

we run MSSQL on the 64K formatted
Luns, its
massively

fast.

If we copy the same data twice the

cache cade kicks in and we

achieve even better speeds.

Ive

compared this to block io @ 8Gb/s Fiber

channel, and it
barely gets

120,000 IOPS. @ much higher latency and

bandwidth.

So for our
money RDMA wins!

Bruce

On 23.10.2012 21:20,
Jason Davis
wrote:

Out of curiosity, is there any quick performance

numbers
for
these ZFS +

GlusterFS mashups you guys are talking

about?

Specifically, IOPs and latency? Sequential
read/write

performance honestly

isn't a very good
benchmark to determine

your

SANs performance. It's like

comparing CPUs based solely on

how many

GHz it runs at. Sure
you can get

great MB or GB/s with
SATA disk but

I'd
reckon that IOP performance is

abismal. If you
are
utilizing

GlusterFS without the cache pooling magic

that
is

ZFS then I would

imagine that latency can be an issue.

On Tue, Oct 23, 2012 at
7:56 AM,
Andreas Huser wrote:

Hi Fabrice,

i
know

OpenSolaris/Solaris Oracle it's so
a thing.

I'm
for more then 10

years a open source user and
that

with oracle - i did no like at the

beginning of this

constallation.

But Oracle makes his work good i
know
that. The

cost of one socket

are 700$ and you can use so
much

quantity of
TB as you will.

And you can use the
full premier Support

from
Oracle.

Nexenta develop with
the Illumos code. And the Licence

are TB based.

That
is not my favorite. As well the pool version

from

Nexenta
comes

not after. Current Nexenta Infiniband are
not a
usable

solution.

But every can use what he will.
Everyone

must decide for

themselves.

SRP
Targets or iser are not

difficult to configure.

Use the SRP
for

the Storage unit
connection. Solaris and GlusterFS

builds one Storage unit.

The
GlusterFS Server export the
final Volume

to the Clients as well
KVM,

VMWare,
Hyper-V etc.

You can use
native GlusterFS,
RDMA, NFS
ore CIFS to export the Volume.

SRP have
nothing to do

with VmWare.

When you are use a 7200 SAS drive the

access
time are the same as a

SATA drive only the quality
of the

hardware are better. When you need

Performance
you must use SAS

drives with 15000U/m. But it's not needed

when you install SSD
for

ZIL/L2ARC. ZeusRAM rocks :-)

I use dedup only at
secondary stroage

or on
Backupserver not on Primary

Storage.

When you use
SSD SATA

drives then you have an cheap an fast
strorage.

1TB drive cost unter

100$. Current i'm not need to
save
storage volume.

Which
application use atime? I
know

"find -atime N". atime on a Storage

that stores
only Virtual

Disks?

I don't need the information when i

have last opening
the disk :-)

For a Solaris GUI
take a look at

napp-it

http://www.napp-it.org/

greetings from germany

Andreas

----- Ursprüngliche Mail -----

Von:
"Fabrice Brazier">> An:

cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>
<ma...@incubator.apache.org>

Gesendet: Dienstag,
23.

Oktober 2012 12:30:50

Betreff: RE: Primary

Storage

Hi Andreas,

Hum that's
pretty cool, i know they have

still trouble with infiniband
on

nexenta but it's clearly a
priority

on their
roadmap (plus i trust more the

nexenta team
to evolve zfs

than oracle).

I agree iscsi over infiniband

increase the latency

but most of the time

it's
just more

simple to use IPOIB than IB,

for example if you use
vmware the

support of infiniband it's assured

by
mellanox and not vmware so if

you

have
an issue the
vmware

support probably don't help
you anymore.

About the raid i'm a fan

of raid 10, i
prefer build a raid
10 with 7200 sas

drive than
multiple vdev
raidz/raidz2
15k sas drive. Particularly for

virtual

environment
with a random workload and multiple access.

I'm
a
fan

of nfs so agreed about the zil, and zeus ram are the only
one

for

that (with classical ssd you can hit the write
hole).

Agreed for
compression too (but only
lzjb level gzip use too

much cpu),

disable
atime
permit to decrease the iops load but

i'm not sure is really a

best practice. About the deduplication
I don't totally
agree. It

really

depends of your array, the
workload
and the VM type on your

cloudstack.

Actually when i

build a zfs array i count 1Gb ram for 1Tb

disks. With

deduplication i count 2Gb ram for 1 Tb disks (plus in

nexenta 4
they

will

add a possibility to use ssd for the

deduplication metada,

like the l2arc

for the read cache).

The
last point
it's about your drives, why sata
drives? I mean sata

doesn't

respect their latency, and
the reliability of a sas drive is

10x the
sata.

Plus
now you can found many NL-sas drive at low

cost

But it's really a nice architecture, i never try glusterfs

(for the moment)

plus in that case it's really a good way to
have

a

replacement to a

metro-cluster for free, i try
one of is

competitor

(onefs) and clearly

clustered
filesystem are the

futur.

Cheers,

Fabrice

-----Message
d'origine-----

De :
Andreas

Huser
[mailto:ahuser@7five-edv.de]

Envoyé :
mardi 23 octobre 2012

11:37

À :
cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>

Objet :
Re:

Primary Storage

Hi Fabrice,

I don't
know

what other people

do but i have no problems with
Infiniband +

GlusterFS + Cloudstack

I'm not use
Nexenta it's based on Illumos and

work

not fine with

Infiniband.

I have two different
clusters in productiv

enviroments.

The first: Solaris 11
with built-in
GlusterFS 3.3

export Gluster Vol. with

RDMA. >

performance is okey you can use that

for smaller enviroments
The

Second: Is a little bit complex with a GlusterFS
Server in the

middle

ZFS Server: based on Solaris
11

1.) Create a
zpool min. two vdevs and SSD
read/write cache

2.) Create
a thin
provisioning Volume
"zfs create -V" disable atime and

enable

compression
(do not enable dedup!) and export as (iWarp) SRP Target

to the GlusterFS Server use a direct connection without IB
Switch.

GlusterFS Server:

1.) Use
RedHat, CentOS or

Fedora (I use CentOS 5

and 6)

2.)
Use OFED driver from

https://www.openfabrics.org

3.)

Import the SRP Target from ZFS

Server and format as xfs

4.)
Create a

Glustervolume "volume
create xy transport rdma " (use
only

rdma)
5.) Connect
with the second IB
Port to a IB Switch

Cloudstack

Hypervisor
Node:

1.) Use RedHat, CentOS or Fedora (I use CentOS 5

and

6)

2.) Use OFED driver from
https://www.openfabrics.org

3.) Import
the
Glustervolume

ZFS Thin Volume ----
Infinband SRP
---->

GlusterFS ---- GFSVol rdma ---->

IB Switch
---->
Clients

The
ZFS and GlusterFS Server form a
Storage

unit connect direct with

40Gbit

Infiniband
Point-to-Point You

do not feel that is a cable

between!

Important: When you
have Infiniband use not IPoIB with

iSCSI! If one

already
has Infiniband then you
should also use the

advantage.

IPoIB
have a higher
Latency as iWarp SRP!

SRP =
usec

-- SRP --
local address: LID 0x01 QPN
0x44004b PSN

0xf3265b RKey 0x9804237c VAddr

0x00000001dda000 remote address:

LID

0x0a QPN 0x10004a PSN
0x44072e RKey

0x1c0f115 VAddr

0x000000088e6000

------------------------------------------------------------------

#bytes #iterations t_min[usec] t_max[usec]
t_typical[usec]

2
1000

1.29 125.30 1.31

------------------------------------------------------------------

-- IPoIB ---

[root@sv01sfogaa ~]# ping
10.200.0.10

PING
10.200.0.10 (10.200.0.10) 56(84)
bytes of data.

64
bytes from

10.200.0.10: icmp_seq=1
ttl=255 time=0.147 ms

64
bytes from

10.200.0.10:
icmp_seq=2 ttl=255 time=0.116 ms

When you put load
on

IPoIB then increases the latency that is not
good

That is my
recommendation for a simple

GlusterFS Mirror:

- Supermicro Server
with Intel
Hardware

and Expander Backplane

- 1x Crucal-M4 SSD Read

Cache

- 2x
ZeusIOPs SSD Write cache (mirror)

-
SATA 24/7

Harddrives

- LSI HBA 9207 or 9211

-
ConnectX-2 QDR Dualport

Infiniband
Adaper (HP refurbished with
full

warrany for 100$)
Importent >
Flash newest
Firmware from Mellanox!

- Mellanox IB
Swich

-
Solaris 11

- GlusterFS 3.3 compiled with ib_verbs
-

Gluster Volume transport only rdma

Throughput
constant up to 200
Mbyte/s >> more throughput with

more storage servers

or more
harddrives on jbod >> >> >>

Info:

- I have some
problems with infiniband
RDMA or
SRP with OpenIndiana or

Illumos or
Nexenta. Some

adapters have a high latency or not a stable

connection.
Use

Solaris that's the right way!

- OpenIndiana is Beta!

Infiniband
ib_verbs works not or not fine!

- Use Solaris
11

Infiniband
ib_verbs are native and stable

- Don't
use Ubuntu Client

Server
for Infiniband! Use RedHat, Fedora
or

CentOS and install the

right drivers from

https://www.openfabrics.org/downloads/OFED/

-
You have
not SSD Cache? Disable at the ZFS Volume the sync!

Important!

You
loose Security for your Data but
some

protocols use

sync flags in

transport.
Example NFS use by

default fsync. write

cache are not active.
NFS writes

data
direct to the Harddrive. For

Data
Security and Performance give

the

Storage Server a SSD
write

cache. ZFS works at default with
sync=standard

that prevent write

holes. (COW System)

I hope
that I could help a little

Greeting from

Germany

Andreas

-----
Ursprüngliche

Mail

-----

Von: "Fabrice
Brazier"

An:

cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>

Gesendet: Dienstag, 23.
Oktober

2012 09:55:15

Betreff: RE: Primary Storage

Hi Andreas,

Hello i just see your configuration, it seems
quite
interesting.

If i
understand well you want to build
some

zfs array on the backend.

Export luns (probably
by iscsi over

infiniband) to you linux cluster,

and

on

the linux
cluster you put glusterFS.

I can
understand

the point, with
that you can have very good
performance

and
reliability
(zfs),
scalability and redundancy (gluster) for very low

cost.

So just one question, did you try the global namespace

implementation

from

nexenta?
If yes can you tell me
what

configuration
is the best for you?

I mean the
fact you have a gluster

cluster
in the middle must impact
the

overral performance no?

Fabrice

-----Message d'origine-----
De :
Andreas

Huser

[mailto:ahuser@7five-edv.de] Envoyé : mardi 23
octobre

2012 05:40 À :

cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>

Objet :
Re: Primary

Storage

Hi,

for Cloudstack i
use

Solaris 11 ZFS + GlusterFS

over Infiniband (RDMA).

That

gives the best performance and most

scalable
Storage.

I have
tasted some different solutions for
primary

Storage but the most
are

to expensive and for
a CloudStack Cluster

not economic or
have a poor

performance.

My Configuration:

Storage Node:

Supermicro Server (Intel Hardware) with Solaris
11

with

SSD write and read

cache (read crucial-m4,
write ZeusIOPS)

GlusterFS

and dualport ConnectX

40Gbit/s Infiniband adapter.

I have
installed
GlusterFS direct on Solaris with a modified

code.

Want
you

build bigger systems for more then 50 VMs it is
better you
split

the
Solaris and GlusterFS with a separte

headnode for GlusterFS

That
looks like:

Solaris ZFS

Backendstorage with a dataset Volume (Thin

Provision) --> ( SRP

Target attached direct without
Infiniband switch

to GF Node) -->
GlusterFS

Node the
srp target formatted with xfs

filesystem,
create a GlusterFS

Volume --> ( Infiniband over a
Mellanox Port
Switch) -->
Cloudstack Node

mount glusterFS Volume over
RDMA

For the Dataset Volume at the ZFS Storage, disable atime
and

enable

compression.
(Space reclaim) With
compression you

can

shrink the ZFS Volume with command

at Linux dd /dev/zero or

In a

Windows VM with sdelete That
gives you space

left on the
Primary

Storage for
deleted Files in a VM or for deleted vhd's

or vm's in the

cloudstack

greeting Andreas

Mit freundlichen
Grüßen

Andreas Huser

Geschäftsführer

System Engineer
/

Consultant

(Cisco
CSE, SMBAM, LCSE, ASAM)

---------------------------------------

Zellerstraße 28 - 77654

Offenburg

Tel: +49(781)
12786898

Mobil: +49(176) 10308549

ahuser@7five-edv.de<ma...@7five-edv.de>

-----

Ursprüngliche Mail -----

Von: "Outback
Dingo"

An:
cloudstack-users@incubator.apache.org<ma...@incubator.apache.org>

Gesendet: Dienstag, 23.
Oktober 2012 02:15:16

Betreff:
Re: Primary

Storage

On Mon, Oct 22, 2012 at
8:09 PM, Ivan

Rodriguezwrote:

Solaris 11 ZFS and yes we tried different

setups, raids
levels
number >> of SSD cache, ARC zfs options etc etc

etc. >> >>

Cheers >> >> >> VMWare ?? >>

--
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

Posted by Trevor Francis <tr...@tgrahamcapital.com>.

So, since ZFS is super slow on Centos, would it be better to use ext4 and bcache over the native linux NFS daemon?

Also, I am reading the KVM is much more disk performant using writeback caching. Does Cloudstack deploy the KVM VM instances with this enabled by default?



Trevor Francis
Partner
46 Labs | The PeerEdge Cloud
http://www.46labs.com | http://www.peeredge.net
405-362-0046 - Voice  | 405-410-4980 - Cell
trevorgfrancis - Skype
trevor@46labs.com
 
Solutions Provider for the Telecom Industry

 

On Oct 24, 2012, at 8:50 AM, Nik Martin <ni...@nfinausa.com> wrote:

> On 10/24/2012 08:29 AM, Trevor Francis wrote:
>> Seems my original post has taken a detour. Who knew storage could be so
>> controversial!
>> 
>> Since we are a Centos house and it would be nice to have a standard OS
>> across our platform, but many people have been singing ZFS' praises. To
>> bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS
>> for linux.
>> Does anyone have experience with this? Would it be better to deploy
>> Centos with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or
>> bcache?
> 
> 
> I inquired about ZFS on Linux in the #cloudstack IRC channel, and the quote I got was "ZFS is slow on Linux". I then asked "Slow relative to what, ZFS on Solaris, ext4 on Linux, ZFS on Linux?", and the response was "yes", so I'm being led to believe that it does not perform as well as other filesystems.  For large file storage on Linux, which Xen VHDs are, I tend to use XFS if the network protocol is NFS.
> 
>> Also, is the standard linux NFS daemon sufficient to handle a
>> significant load or do you recommend a different NFS daemon....or go iSCSI ?
>> Our biggest concern is that we have most VMs doing many sequential small
>> writes (log files).
> 
> When I first started with CS on Xenserver in January, I thought thin provisioning was important, so I built file system based storage using NFS, with XFS as the underlying filesystem, on LVM.  NFS (v3) on Centos 6 became processor bound VERY quickly, running %300 CPU load on a quad core Xeon.
> 
> I switched to iSCSI with LVM as backing store, and havwe had zero issues, except I'm currently investigating lower than desired random read/write on < 4k random.  Using tests like Iozone in a VM, and running iostat -t1 on the hypervisor console itself, i can get sustained 2500 IOPS on iSCSI, and it drops to 300 with < 4k blocks of data. I expect to resolve this soon though.
> 
>> 
>> Thoughts before I make a decision?
>> 
>> 
>> 
>> Trevor Francis
>> Partner
>> 46 Labs | The PeerEdge Cloud
>> http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
>> <http://www.peeredge.net/>
>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>> trevorgfrancis - Skype
>> trevor@46labs.com <ma...@46labs.com>
>> Solutions Provider for the Telecom Industry
>> 
>> <http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.facebook.com/PeerEdge>
>> 
>> On Oct 24, 2012, at 2:53 AM, Fabrice Brazier <fabrice.brazier@apalia.net
>> <ma...@apalia.net>> wrote:
>> 
>>> Hi,
>>> 
>>> @bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i
>>> or the 8i mean 8 internals ports (so with 2 lsi you can have only 16
>>> disks).
>>> 
>>> After that a velociraptor disk it's only a 10k discs so we can count 120
>>> IOPs by disks. If we put them in a raid 5 (by the way everybody know that
>>> raid 5 don't scale very well after 12 disk the performance doesn't
>>> increase
>>> so much but nevermind you built your raid 5 of 30 disks).
>>> So you have a maximum of (30-1)*120=3480 iops for read and
>>> ((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only
>>> 32Gb ram wich is peanuts to caching your read iops.
>>> 
>>> So the only way you have to make performance is your cahecade.
>>> Now a cachecade is made of ssd (weird actually it's the same as zil and
>>> l2arc in zfs xD) plus the max size of a cachecade and the maximum size
>>> of a
>>> cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade
>>> max.
>>> 
>>> So you say your storage doesn't cost anything but you don't talk about the
>>> size and the price of your cachecade.
>>> And another thing the day you will have a real workload (not only 50
>>> Vms...)
>>> and this workload will be larger than 1Tb come again and explain us your
>>> very large performance.
>>> 
>>> Me too I can built a zfs server with 4 zeusram in zil and 5 disks
>>> behind and
>>> make a benchmark on files of 2Gb and say hey see I have 500 000 Iops
>>> and 4Gb
>>> of throughput...
>>> 
>>> Regards
>>> 
>>> -----Message d'origine-----
>>> De : bruce.m@v365.com.au <ma...@v365.com.au>
>>> [mailto:bruce.m@v365.com.au <http://v365.com.au>]
>>> Envoyé : mercredi 24 octobre 2012 06:04
>>> À : cloudstack-users@incubator.apache.org
>>> <ma...@incubator.apache.org>
>>> Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT
>>> 
>>> 
>>> 
>>> These Sans we have build have saved us a fortune $ compared to FC
>>> Block I/O
>>> Sans from IBM, HP etc
>>> 
>>> Ive been a contractor for IBM and HP
>>> for 20 years. I know how much they are $$$$$ and what they can do.
>>> 
>>> A
>>> Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your
>>> lucky you may get 100,000 perhaps 150,000 IOPS
>>> 
>>> What ive build is
>>> better. cheaper, and frankly as i support Linux and SCST, much more
>>> reliable.
>>> 
>>> the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800
>>> ($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam
>>> $800,
>>> Dual PSU $650, about $14,000 for a much superior SAN.
>>> 
>>> As i explained another $15K for infiniband network. which does SRP and IP
>>> 
>>> As for your issues with cache misses, the CacheCade doesn't Miss much from
>>> our tests. with WriteBack enabled, BBU enabled, and the SANs are on our
>>> primary 22KVA UPS It never fails.
>>> 
>>> Also Its not
>>> doing very much unless a user in the cloud is hitting sequential blocks or
>>> R/W the same data repeatedly for some weird reason. Even then we see
>>> little
>>> lagg or slow down, when as you say its hitting the Disc.
>>> Infiniband with LSI Just rocks even more than we had thought!
>>> 
>>> The R/W
>>> pattern from groups of 50 VM.s appear to be completely random, web
>>> servers,
>>> virtual firewalls, email servers, DNS etc, FTP, windows domains,
>>> sharepoint,
>>> MSSQL, MYSQL working on the same Lun which are not considered critical (as
>>> stated by clients) are each VM getting over 300-1200MB/s constantly even
>>> when other users are pulling or pushing a file copy @ 250Mb/s between VMs
>>> etc. Weve actually had to figure out how to slow down the SRP connections.
>>> If a user is provisioned with 2 Luns from each SAN, in windows you can
>>> copy
>>> between VM Drives @ 1220MB/s!!!!!
>>> faster than SSD in Linux its even faster, and that is in ESXi4.1 with
>>> 15~20 Servers per VM Host.
>>> 
>>> we have snmp monitoring the 2 primary sans, and they never go above 30%
>>> bandwidth. so the point is the Cache has no real bearing on Cloud I/O.
>>> cloud
>>> IO is fairly low. We have LSI supplying us with the new PICE3 cards for a
>>> trial, so i expect even better performance. So long STORY SHORT. GET
>>> LSI AND
>>> 16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s
>>> SSD's its SLOWER! MORE SPINDLES = MORE SPEED!
>>> 
>>> the main reason we wanted fast IO is for backups, snapshots, so as not to
>>> interfere with the daily operational workings of the cloud VM guests.
>>> Which
>>> is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will
>>> work, and some smart ass does a restore or backup in prod hours, only
>>> to see
>>> the whole cluster slow down to a crawl. on iSCSI its even more pathetic.
>>> 
>>> With SRP on LSI " WE DONT
>>> HAVE THAT PROBLEM!"
>>> 
>>> I did a snapshot restore for 2 machines yesterday for a client, it took 2
>>> mins to restore 100GB. And none of the other servers were affected. SNMP
>>> latency on all VM's no impact. Virtually 0CPU cycles on the VMware
>>> hosts for
>>> the IO move. IO latency/storage path for other VM's from snmp data from
>>> VMserver was 2ms which is normal. the machine doing the copy crept up to
>>> 10ms but no more.
>>> 
>>> So my conclusion
>>> ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING!
>>> 
>>> Cache misses we don't have a problem there at all. Were more worried about
>>> Data Rot, which is why were interested in GFS to ZFS Luns.
>>> But ZFS is still slow as compared to what we have.....
>>> 
>>> we are going to
>>> launch a new script which moves a file full of 16-bit random io for 100MB,
>>> and 1TB, it will create a binary checksum for each file, and we can
>>> compare
>>> it over time to check for data rot.
>>> 
>>> Just so you can
>>> appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts.
>>> the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started
>>> the 2nd, while the 1st was running, same results across the 2 VM's.
>>> then the
>>> 3rd concurrently. they all dropped to 2000MB/s and IOPS went down to
>>> 220,000
>>> per host. so yes there is a penalty. But the way infiniband works is that
>>> the switch balances the load across all servers. 4th server online same
>>> test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s
>>> @ 128K
>>> blocks 50%50% random seq. change the distribution to 100% random makes
>>> little difference. We confirmed all the drives were being accessed! all
>>> were. we even took a look at the camera out DC to monitor the array
>>> LEDS...
>>> alll Flickers like mad! we took a look at the metrics of the Megaraid
>>> as we
>>> got it to work on CentOS, no errors, no issues or problems, no dropped
>>> reads
>>> or writes.
>>> 
>>> we set the file on IO meter as high as poss 900000000 units. We then fired
>>> up 2 more hosts and did real world file copy, no real impact still get
>>> 2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent
>>> file copies between the 2 sans and each host blew us away at 2000MB/s read
>>> and write across all 6 hosts.!! FXXK we were amazed
>>> 
>>> Aside
>>> from that LSI have done a super job on the Cache modules on their cards.
>>> I dont see how ZFS will ever approach this kind of performance.
>>> 
>>> Anyway
>>> take from what ive said what you like. We have an awsum SAN... and its
>>> CHEAP
>>> and Amazingly fast!
>>> 
>>> were in the process of developing a public Cloud, so when the web site is
>>> finn we will publish all of this in a whitepaper there.
>>> 
>>> On 24.10.2012 10:57, Jason Davis wrote:
>>>> Passionate
>>> are you? Take it easy I am not knocking on what you are doing.
>>>> 
>>>> The
>>> term "spinning rust" is a catch all for hard drives, nothing personal
>>>> 
>>> at all, in fact the setup you have is impressive.
>>>> 
>>>> Impressive until
>>> you have a cache miss. Cache misses are definitely an
>>>> issue even in
>>> "capital E" enterprise storage.
>>>> 
>>>> Big bad XIV disk arrays or 3par are
>>> impressive, fast arrays (crazy
>>>> expensive ;]) when you are hitting DRAM
>>> or SSD but if you are running cache
>>>> unfriendly workloads then
>>> ultimately your array is only as fast as the
>>>> spinning disk sitting at
>>> the lowest tier.
>>>> 
>>>> I agree that openstorage is the way to go, from a
>>> service providers
>>>> perspective it's hard being competitive with AWS or
>>> RackSpace if you are
>>>> using expensive traditional SAN.
>>>> 
>>>> So anyway,
>>> my question wasn't a poke at what you are doing, just curious
>>>> about
>>> how big of an IOP/latency drop you see once a read or write has to
>>>> 
>>> actually see the spinning disk. :)
>>>> On Oct 23, 2012 8:10 PM, wrote:
>>>> 
>>> 
>>>>> 
>>>>> 
>>>>> oh well if your negative, and make rude comments then that's
>>> your
>>>>> problem, i don't care about you.
>>>>> 
>>>>> If you don't believe that
>>> RDMA can
>>>>> perform this fast then again, your problem.
>>>>> 
>>>>> Ask anyone
>>> with a decent
>>>>> 12+ Disc Raid Array, and RDMA and they will tell you it
>>> will kick you in
>>>>> YOUR ballz! Anyone in storage knows to get iops and
>>> performance you need
>>>>> 12 preferably 16 spindles.
>>>>> 
>>>>> the spinning
>>> rust as you call it, are the
>>>>> newest 200MB/s WD Veloci Raprors with
>>> 64MB Cache. @ 10KRPM 2.5" dics in
>>>>> 32 Bay Storage Chassis.
>>>>> 
>>>>> we've
>>> used them in the past and they are as
>>>>> reliable as the SAS drives we
>>> use, but faster!
>>>>> 
>>>>> RAID 10 IS BULLSHIT!
>>>>> Bloody slow and 50% goes
>>> up in smoke for nothing. All you get is the
>>>>> mirror speed of 2 drives
>>> which is barley 400MB/s tops! I Spit on RAID 10
>>>>> 
>>>>> 
>>>>> We also tried
>>> Raid 60 and it was very good as well. But no point to
>>>>> use it as we
>>> replicated to 2nd SAN
>>>>> 
>>>>> When we tested our new SANs, we
>>>>> failed 1
>>> drive and set the rebuild rate to 30% on raid 6, with 40% scan
>>>>> rate.
>>> The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we
>>>>> 
>>> removed a 2nd disc, and replaced with blank, and it was rebuild in
>>>>> 
>>> another 4 hours. With no real impact to performance tests.
>>>>> 
>>>>> 
>>> Separating
>>>>> the underlying 6Gb/s Sata ports x 8 into effectively 16 x
>>> 3Gb/s Channels
>>>>> using a SAS Expander gives each Disc 300MB/s of
>>> bandwidth. They cant
>>>>> physically perform better than 200~220MB/s so
>>> there is enough bandwidth
>>>>> on the SATA and PCIE Bus to cope. LSI rate
>>> the 9280-8i @ 2500MB/s but it
>>>>> does work faster with Cache Cade. up to
>>> 3200MB/s in that test.
>>>>> 
>>>>> So a
>>>>> real SAN has many spindles of high
>>> performance, and the WD VRs are
>>>>> better than some cheap SSD
>>> drives.
>>>>> 
>>>>> RDMA is very fast, low cpu cycles,
>>>>> and reads and write
>>> directly to RAM@ 40Gb/s, we created a RAM drive and
>>>>> tested it up to
>>> 3200MB/s which is as fast as the PCIE 2 Bus / LSI could
>>>>> Handle.
>>>>> 
>>>>> 
>>> If you cant afford this realtivley cheap equipment (compated to
>>>>> IBM
>>> DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast
>>>>> 
>>> performance. Any real Cloud would have what we have built. None of our
>>>>> customers, ALL 300 Of them! have lost a single bit of data, all
>>> VMs are
>>>>> running very quickly, and with no problems now for 3
>>> months.
>>>>> 
>>>>> So If you
>>>>> dont appreciate being shown how to build a
>>> kick ASS SAN then PISS
>>>>> OFF!
>>>>> 
>>>>> On 24.10.2012 08:17, Jason Davis
>>> wrote:
>>>>>> How often do you have
>>>>> folks cache missing and hitting the
>>> spinning rust
>>>>>> below the cache cade
>>>>> tier?
>>>>>> On Oct 23, 2012
>>> 6:25 PM, wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hi
>>>>>>> 
>>>>>>> with our
>>>>> SCST
>>> RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
>>>>>>> Director 4036
>>>>> 
>>> switches + HP Gen2 QDR Chassis Switches.
>>>>>>> 
>>>>>>> Our San achieves
>>>>> 
>>>>> 
>>>>> 295,000 - 350,000 Iops max per LSI controller, depending on
>>> block
>>>>> size
>>>>>>> for i/o meter test. LSI9280-8i with battery cache.
>>> +
>>>>> cachecade2.0 x 2 in
>>>>>>> PCIE 16x slot on supermicro
>>> motherboard.
>>>>>>> 
>>>>>>> 1
>>>>> x 250GB SSD for CACHE Cade +
>>>>>>> 16 1tb
>>> WD VR drives (16 - hot stand buy
>>>>> - 2 for raid 6 parity) =
>>>>>>> 
>>> 2600MB/s
>>>>>>> 
>>>>>>> most of our test are from 4K
>>>>> to 128K block size,
>>> @ 4K we get
>>>>>>> 295,000 Iops @ 1590 MB/s and @ 128K
>>>>> 350,000 iops @
>>> 2400~2600MB/s
>>>>>>> 
>>>>>>> we
>>>>>>> have tuned 64K and 128K block
>>>>> size
>>> on different luns. 64K for database
>>>>>>> and 128K for general
>>>>> 
>>> file.
>>>>>>> 
>>>>>>> The best thing about Infiniband is low CPU
>>>>>>> 
>>> cycles. only
>>>>> 5~6% during these tests.
>>>>>>> 
>>>>>>> latency is as low
>>> as 4-6ms
>>>>>>> average
>>>>> read time. concurrent response times are from
>>> 5-12ms. even under
>>>>>>> 
>>>>> heavy load its below 20ms
>>>>>>> 
>>>>>>> 
>>> Infiniband latency is below 0.01 us
>>>>> Which is
>>>>>>> why we chose
>>> it.
>>>>>>> 
>>>>>>> we run MSSQL on the 64K formatted
>>>>> Luns, its
>>> massively
>>>>>>> fast.
>>>>>>> 
>>>>>>> If we copy the same data twice the
>>>>> 
>>> cache cade kicks in and we
>>>>>>> achieve even better speeds.
>>>>>>> 
>>>>>>> 
>>> Ive
>>>>> compared this to block io @ 8Gb/s Fiber
>>>>>>> channel, and it
>>> barely gets
>>>>> 120,000 IOPS. @ much higher latency and
>>>>>>> 
>>> bandwidth.
>>>>>>> 
>>>>>>> So for our
>>>>> money RDMA wins!
>>>>>>> 
>>>>>>> 
>>> Bruce
>>>>>>> 
>>>>>>> On 23.10.2012 21:20,
>>>>>>> Jason Davis
>>>>> wrote:
>>>>>>> 
>>>> Out of curiosity, is there any quick performance
>>>>>>> numbers
>>>>> for
>>> these ZFS +
>>>>>>>> GlusterFS mashups you guys are talking
>>>>>>> 
>>> about?
>>>>>>> 
>>>>>> 
>>>>>>>> Specifically, IOPs and latency? Sequential
>>> read/write
>>>>>>> 
>>>>> performance honestly
>>>>>>>> isn't a very good
>>> benchmark to determine
>>>>> your
>>>>>>> SANs performance. It's like
>>>>>>>> 
>>> comparing CPUs based solely on
>>>>> how many
>>>>>>> GHz it runs at. Sure
>>> you can get
>>>>>>>> great MB or GB/s with
>>>>> SATA disk but
>>>>>>> I'd
>>> reckon that IOP performance is
>>>>>>>> abismal. If you
>>>>> are
>>> utilizing
>>>>>>> GlusterFS without the cache pooling magic
>>>>>>>> that
>>> is
>>>>> ZFS then I would
>>>>>>> imagine that latency can be an issue.
>>>>>>> 
>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>>>> On Tue, Oct 23, 2012 at
>>>>>>> 7:56 AM,
>>> Andreas Huser wrote:
>>>>>>>> 
>>>>>>> 
>>>>>>> Hi Fabrice,
>>>>>>>>> 
>>>>>>>>> i
>>> know
>>>>>>> OpenSolaris/Solaris Oracle it's so
>>>>> a thing.
>>>>>>>>> I'm
>>> for more then 10
>>>>>>> years a open source user and
>>>>> that
>>>>>>>>> 
>>> with oracle - i did no like at the
>>>>>>> beginning of this
>>>>> 
>>> constallation.
>>>>>>>>> But Oracle makes his work good i
>>>>>>> know
>>> that. The
>>>>> cost of one socket
>>>>>>>>> are 700$ and you can use so
>>> much
>>>>>>> quantity of
>>>>> TB as you will.
>>>>>>>>> And you can use the
>>> full premier Support
>>>>>>> from
>>>>> Oracle.
>>>>>>>>> Nexenta develop with
>>> the Illumos code. And the Licence
>>>>>>> 
>>>>> are TB based.
>>>>>>>>> That
>>> is not my favorite. As well the pool version
>>>>> from
>>>>>>> Nexenta
>>> comes
>>>>>>>>> not after. Current Nexenta Infiniband are
>>>>> not a
>>> usable
>>>>>>> solution.
>>>>>>>>> But every can use what he will.
>>> Everyone
>>>>> must decide for
>>>>>>> themselves.
>>>>>>>>> 
>>>>>>>>> SRP
>>> Targets or iser are not
>>>>> difficult to configure.
>>>>>>> Use the SRP
>>> for
>>>>>>>>> the Storage unit
>>>>> connection. Solaris and GlusterFS
>>>>>>> 
>>> builds one Storage unit.
>>>>>>>>> The
>>>>> GlusterFS Server export the
>>> final Volume
>>>>>>> to the Clients as well
>>>>> KVM,
>>>>>>>>> VMWare,
>>> Hyper-V etc.
>>>>>>>>> You can use
>>>>>>> native GlusterFS,
>>>>> RDMA, NFS
>>> ore CIFS to export the Volume.
>>>>>>>>> SRP have
>>>>>>> nothing to do
>>>>> 
>>> with VmWare.
>>>>>>>>> 
>>>>>>>>> When you are use a 7200 SAS drive the
>>>>> 
>>>>> access
>>>>> time are the same as a
>>>>>>>>> SATA drive only the quality
>>> of the
>>>>>>> 
>>>>> hardware are better. When you need
>>>>>>>>> Performance
>>> you must use SAS
>>>>>>> 
>>>>> drives with 15000U/m. But it's not needed
>>>>> 
>>>>>>> when you install SSD
>>>>> for
>>>>>>> ZIL/L2ARC. ZeusRAM rocks :-)
>>>>> 
>>>>>>> 
>>>>>>>>> I use dedup only at
>>>>> secondary stroage
>>>>>>> or on
>>> Backupserver not on Primary
>>>>>>>>> Storage.
>>>>>>> 
>>>>>>> When you use
>>> SSD SATA
>>>>>>> drives then you have an cheap an fast
>>>>> strorage.
>>>>>>> 
>>>>> 1TB drive cost unter
>>>>>>> 100$. Current i'm not need to
>>>>> save
>>> storage volume.
>>>>>>>>> 
>>>>>>>>> Which
>>>>>>> application use atime? I
>>> know
>>>>> "find -atime N". atime on a Storage
>>>>>>>>> 
>>>>>>> that stores
>>> only Virtual
>>>>> Disks?
>>>>>>>>> I don't need the information when i
>>>>> 
>>>>> have last opening
>>>>> the disk :-)
>>>>>>>>> 
>>>>>>>>> For a Solaris GUI
>>> take a look at
>>>>>>> napp-it
>>>>>>> 
>>>>>>> http://www.napp-it.org/
>>>>>>> 
>>>>> 
>>>>>>>>> greetings from germany
>>>>>>>>> 
>>>>>>> 
>>>>> Andreas
>>>>>>>>> 
>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ----- Ursprüngliche Mail -----
>>>>>>> 
>>>>> 
>>>>> 
>>>>>>>>> Von:
>>>>>>> "Fabrice Brazier">> An:
>>>>> 
>>> cloudstack-users@incubator.apache.org
>>> <ma...@incubator.apache.org>
>>>>>>>>> 
>>>>>>> Gesendet: Dienstag,
>>> 23.
>>>>> Oktober 2012 12:30:50
>>>>>>>>> Betreff: RE: Primary
>>>>>>> 
>>> Storage
>>>>>>>>> 
>>>>>>>>> 
>>>>> Hi Andreas,
>>>>>>>>> 
>>>>>>>>> Hum that's
>>> pretty cool, i know they have
>>>>>>> 
>>>>> still trouble with infiniband
>>> on
>>>>>>>>> nexenta but it's clearly a
>>>>> priority
>>>>>>> on their
>>> roadmap (plus i trust more the
>>>>>>>>> nexenta team
>>>>> to evolve zfs
>>>>> 
>>>>> than oracle).
>>>>>>>>> 
>>>>>>>>> I agree iscsi over infiniband
>>>>> 
>>> increase the latency
>>>>>>> but most of the time
>>>>>>>>> it's
>>>>>>>>> 
>>> just more
>>>>> simple to use IPOIB than IB,
>>>>>>> for example if you use
>>> vmware the
>>>>>>>>> 
>>>>> support of infiniband it's assured
>>>>>>> by
>>> mellanox and not vmware so if
>>>>> you
>>>>>>>>> have
>>>>>>>>> an issue the
>>> vmware
>>>>>>> support probably don't help
>>>>> you anymore.
>>>>>>>>> 
>>>>>>> 
>>>>> About the raid i'm a fan
>>>>>>> of raid 10, i
>>>>> prefer build a raid
>>> 10 with 7200 sas
>>>>>>>>> drive than
>>>>>>> multiple vdev
>>>>> raidz/raidz2
>>> 15k sas drive. Particularly for
>>>>>>>>> virtual
>>>>>>> 
>>>>> environment
>>> with a random workload and multiple access.
>>>>>>>>> 
>>>>>>>>> I'm
>>>>> a
>>> fan
>>>>>>> of nfs so agreed about the zil, and zeus ram are the only
>>> one
>>>>> for
>>>>>>>>> 
>>>>>>> that (with classical ssd you can hit the write
>>> hole).
>>>>>>> 
>>>>>>> 
>>>>>>>>> Agreed for
>>>>>>> compression too (but only
>>> lzjb level gzip use too
>>>>> much cpu),
>>>>>>>>> disable
>>>>>>> atime
>>> permit to decrease the iops load but
>>>>> i'm not sure is really a
>>>>>>> 
>>>>> 
>>>>>>> best practice. About the deduplication
>>>>> I don't totally
>>> agree. It
>>>>>>> really
>>>>>>>>> depends of your array, the
>>>>> workload
>>> and the VM type on your
>>>>>>> cloudstack.
>>>>>>>>> Actually when i
>>>>> 
>>> build a zfs array i count 1Gb ram for 1Tb
>>>>>>> disks. With
>>>>>>>>> 
>>>>> 
>>> deduplication i count 2Gb ram for 1 Tb disks (plus in
>>>>>>> nexenta 4
>>> they
>>>>> will
>>>>>>>>> add a possibility to use ssd for the
>>>>>>> 
>>> deduplication metada,
>>>>> like the l2arc
>>>>>>>>> for the read cache).
>>>>> 
>>>>>>> 
>>>>>>>>> The
>>>>>>> last point
>>>>> it's about your drives, why sata
>>> drives? I mean sata
>>>>>>> doesn't
>>>>>>>>> 
>>>>> respect their latency, and
>>> the reliability of a sas drive is
>>>>>>> 10x the
>>>>> sata.
>>>>>>>>> Plus
>>> now you can found many NL-sas drive at low
>>>>>>> cost
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> But it's really a nice architecture, i never try glusterfs
>>>>>>> 
>>>>> 
>>> (for the moment)
>>>>>>>>> plus in that case it's really a good way to
>>> have
>>>>> a
>>>>>>> replacement to a
>>>>>>>>> metro-cluster for free, i try
>>> one of is
>>>>> competitor
>>>>>>> (onefs) and clearly
>>>>>>>>> clustered
>>> filesystem are the
>>>>> futur.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> Cheers,
>>>>>>>>> 
>>> Fabrice
>>>>>>>>> 
>>>>>>>>> -----Message
>>>>> d'origine-----
>>>>>>>>> De :
>>> Andreas
>>>>>>> Huser
>>>>> [mailto:ahuser@7five-edv.de]
>>>>>>>>> Envoyé :
>>> mardi 23 octobre 2012
>>>>>>> 
>>>>> 11:37
>>>>>>>>> À :
>>> cloudstack-users@incubator.apache.org
>>>>>>>>> Objet :
>>>>> Re:
>>>>>>> 
>>> Primary Storage
>>>>>>>>> 
>>>>>>>>> Hi Fabrice,
>>>>>>>>> 
>>>>>>>>> I don't
>>> know
>>>>> what other people
>>>>>>> do but i have no problems with
>>> Infiniband +
>>>>>>>>> 
>>>>> GlusterFS + Cloudstack
>>>>>>> I'm not use
>>> Nexenta it's based on Illumos and
>>>>> work
>>>>>>>>> not fine with
>>>>>>> 
>>> Infiniband.
>>>>>>>>> I have two different
>>>>> clusters in productiv
>>>>>>> 
>>> enviroments.
>>>>>>>>> 
>>>>>>>>> The first: Solaris 11
>>>>> with built-in
>>> GlusterFS 3.3
>>>>>>> export Gluster Vol. with
>>>>>>>>> RDMA. >
>>>>> 
>>> performance is okey you can use that
>>>>>>> for smaller enviroments
>>> The
>>>>>>> 
>>>>>>> Second: Is a little bit complex with a GlusterFS
>>> Server in the
>>>>> middle
>>>>>>>>> 
>>>>>>>>> ZFS Server: based on Solaris
>>> 11
>>>>>>>>> 
>>>>>>> 1.) Create a
>>>>> zpool min. two vdevs and SSD
>>> read/write cache
>>>>>>>>> 2.) Create
>>>>>>> a thin
>>>>> provisioning Volume
>>> "zfs create -V" disable atime and
>>>>>>>>> enable
>>>>>>> 
>>>>> compression
>>> (do not enable dedup!) and export as (iWarp) SRP Target
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> to the GlusterFS Server use a direct connection without IB
>>>>> 
>>> Switch.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> GlusterFS Server:
>>>>>>>>> 1.) Use
>>> RedHat, CentOS or
>>>>> Fedora (I use CentOS 5
>>>>>>> and 6)
>>>>>>>>> 2.)
>>> Use OFED driver from
>>>>> https://www.openfabrics.org
>>>>>>>>> 3.)
>>>>>>> 
>>> Import the SRP Target from ZFS
>>>>> Server and format as xfs
>>>>>>>>> 4.)
>>> Create a
>>>>>>> Glustervolume "volume
>>>>> create xy transport rdma " (use
>>> only
>>>>>>>>> rdma)
>>>>>>>>> 
>>>>>>> 5.) Connect
>>>>> with the second IB
>>> Port to a IB Switch
>>>>>>>>> 
>>>>>>>>> Cloudstack
>>>>>>> 
>>>>> Hypervisor
>>> Node:
>>>>>>>>> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
>>>>> 
>>> and
>>>>>>> 6)
>>>>>>>>> 2.) Use OFED driver from
>>> https://www.openfabrics.org
>>>>>>> 
>>>>>>> 3.) Import
>>>>>>> the
>>> Glustervolume
>>>>>>>>> 
>>>>>>>>> ZFS Thin Volume ----
>>>>> Infinband SRP
>>> ---->
>>>>>>> GlusterFS ---- GFSVol rdma ---->
>>>>>>>>> IB Switch
>>>>> ---->
>>> Clients
>>>>>>>>> 
>>>>>>>>> The
>>>>>>> ZFS and GlusterFS Server form a
>>> Storage
>>>>> unit connect direct with
>>>>>>> 40Gbit
>>>>>>>>> Infiniband
>>> Point-to-Point You
>>>>> do not feel that is a cable
>>>>>>> between!
>>>>>>> 
>>>>> 
>>>>>>>>> Important: When you
>>>>> have Infiniband use not IPoIB with
>>>>> 
>>>>> iSCSI! If one
>>>>>>>>> already
>>>>>>>>> 
>>>>> has Infiniband then you
>>> should also use the
>>>>>>> advantage.
>>>>>>>>> IPoIB
>>>>> have a higher
>>> Latency as iWarp SRP!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> SRP =
>>>>>>> usec
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>>>> -- SRP --
>>>>>>>>> local address: LID 0x01 QPN
>>> 0x44004b PSN
>>>>>>> 
>>>>> 0xf3265b RKey 0x9804237c VAddr
>>>>>>>>> 
>>> 0x00000001dda000 remote address:
>>>>> LID
>>>>>>> 0x0a QPN 0x10004a PSN
>>> 0x44072e RKey
>>>>>>>>> 0x1c0f115 VAddr
>>>>>>> 
>>>>> 0x000000088e6000
>>>>>>> 
>>>>> 
>>>>>>> 
>>>>> 
>>> ------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>> #bytes #iterations t_min[usec] t_max[usec]
>>> t_typical[usec]
>>>>>>>>> 2
>>>>> 1000
>>>>>>> 1.29 125.30 1.31
>>>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> ------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>> -- IPoIB ---
>>>>>>>>> [root@sv01sfogaa ~]# ping
>>> 10.200.0.10
>>>>>>> 
>>>>>>> PING
>>>>>>> 10.200.0.10 (10.200.0.10) 56(84)
>>> bytes of data.
>>>>>>>>> 64
>>>>> bytes from
>>>>>>> 10.200.0.10: icmp_seq=1
>>> ttl=255 time=0.147 ms
>>>>>>>>> 64
>>>>> bytes from
>>>>>>> 10.200.0.10:
>>> icmp_seq=2 ttl=255 time=0.116 ms
>>>>>>>>> 
>>>>>>>>> 
>>>>> When you put load
>>> on
>>>>>>> IPoIB then increases the latency that is not
>>>>> good
>>>>>>> 
>>>>> 
>>>>>>>>> 
>>>>>>>>> That is my
>>>>>>> recommendation for a simple
>>>>> 
>>> GlusterFS Mirror:
>>>>>>>>> 
>>>>>>>>> - Supermicro Server
>>>>>>> with Intel
>>> Hardware
>>>>> and Expander Backplane
>>>>>>>>> - 1x Crucal-M4 SSD Read
>>>>> 
>>>>> Cache
>>>>>>>>> - 2x
>>>>> ZeusIOPs SSD Write cache (mirror)
>>>>>>>>> -
>>> SATA 24/7
>>>>>>> Harddrives
>>>>>>>>> 
>>>>> - LSI HBA 9207 or 9211
>>>>>>>>> -
>>> ConnectX-2 QDR Dualport
>>>>>>> Infiniband
>>>>> Adaper (HP refurbished with
>>> full
>>>>>>>>> warrany for 100$)
>>>>>>> Importent >
>>>>> Flash newest
>>> Firmware from Mellanox!
>>>>>>>>> - Mellanox IB
>>>>>>> Swich
>>>>>>>>> 
>>>>> -
>>> Solaris 11
>>>>>>>>> - GlusterFS 3.3 compiled with ib_verbs
>>>>>>>>> -
>>>>> 
>>>>> 
>>>>> Gluster Volume transport only rdma
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Throughput
>>>>>>> constant up to 200
>>> Mbyte/s >> more throughput with
>>>>> more storage servers
>>>>>>> or more
>>> harddrives on jbod >> >> >>
>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>> Info:
>>>>> 
>>>>>>> 
>>>>>>>>> - I have some
>>>>>>> problems with infiniband
>>>>> RDMA or
>>> SRP with OpenIndiana or
>>>>>>>>> Illumos or
>>>>>>> Nexenta. Some
>>>>> 
>>> adapters have a high latency or not a stable
>>>>>>>>> 
>>>>>>> connection.
>>> Use
>>>>> Solaris that's the right way!
>>>>>>>>> - OpenIndiana is Beta!
>>>>> 
>>>>> Infiniband
>>>>> ib_verbs works not or not fine!
>>>>>>>>> - Use Solaris
>>> 11
>>>>>>> Infiniband
>>>>> ib_verbs are native and stable
>>>>>>>>> - Don't
>>> use Ubuntu Client
>>>>>>> Server
>>>>> for Infiniband! Use RedHat, Fedora
>>> or
>>>>>>>>> CentOS and install the
>>>>>>> 
>>>>> right drivers from
>>>>>>>>> 
>>> https://www.openfabrics.org/downloads/OFED/
>>>>>>> 
>>>>>>> -
>>>>>>> You have
>>> not SSD Cache? Disable at the ZFS Volume the sync!
>>>>>>> 
>>>>> 
>>> Important!
>>>>>>>>> You
>>>>>>>>> loose Security for your Data but
>>> some
>>>>> protocols use
>>>>>>> sync flags in
>>>>>>>>> transport.
>>>>>>>>> 
>>> Example NFS use by
>>>>> default fsync. write
>>>>>>> cache are not active.
>>> NFS writes
>>>>>>>>> data
>>>>> direct to the Harddrive. For
>>>>>>> Data
>>> Security and Performance give
>>>>> the
>>>>>>>>> Storage Server a SSD
>>> write
>>>>>>> cache. ZFS works at default with
>>>>> sync=standard
>>>>>>>>> 
>>> that prevent write
>>>>>>> holes. (COW System)
>>>>>>>>> 
>>>>>>> 
>>>>>>> I hope
>>> that I could help a little
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> Greeting from
>>>>> 
>>> Germany
>>>>>>>>> Andreas
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -----
>>> Ursprüngliche
>>>>> Mail
>>>>>>> -----
>>>>>>>>> 
>>>>>>>>> Von: "Fabrice
>>> Brazier"
>>>>>>>>> An:
>>>>>>> 
>>>>> cloudstack-users@incubator.apache.org
>>>>> 
>>>>>>> Gesendet: Dienstag, 23.
>>>>> Oktober
>>>>>>> 2012 09:55:15
>>>>>>>>> 
>>> Betreff: RE: Primary Storage
>>>>>>>>> 
>>>>>>>>> 
>>>>> Hi Andreas,
>>>>>>> 
>>>>> 
>>>>>>>>> 
>>>>>>> Hello i just see your configuration, it seems
>>>>> quite
>>> interesting.
>>>>>>>>> If i
>>>>>>> understand well you want to build
>>> some
>>>>> zfs array on the backend.
>>>>>>>>> 
>>>>>>> Export luns (probably
>>> by iscsi over
>>>>> infiniband) to you linux cluster,
>>>>>>> and
>>>>>>>>> 
>>> on
>>>>>>>>> the linux
>>>>> cluster you put glusterFS.
>>>>>>>>> I can
>>> understand
>>>>>>> the point, with
>>>>> that you can have very good
>>> performance
>>>>>>>>> and
>>>>>>>>> 
>>>>>>> reliability
>>>>> (zfs),
>>> scalability and redundancy (gluster) for very low
>>>>>>> cost.
>>>>>>> 
>>>>> 
>>>>> So just one question, did you try the global namespace
>>>>>>> 
>>> implementation
>>>>> from
>>>>>>>>> nexenta?
>>>>>>>>> If yes can you tell me
>>> what
>>>>>>> configuration
>>>>> is the best for you?
>>>>>>>>> I mean the
>>> fact you have a gluster
>>>>>>> cluster
>>>>> in the middle must impact
>>> the
>>>>>>>>> overral performance no?
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>> Fabrice
>>>>>>>>> 
>>>>>>>>> -----Message d'origine-----
>>>>>>>>> De :
>>> Andreas
>>>>> Huser
>>>>>>> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23
>>> octobre
>>>>>>>>> 
>>>>> 2012 05:40 À :
>>>>>>> 
>>> cloudstack-users@incubator.apache.org
>>>>>>>>> Objet :
>>>>> Re: Primary
>>>>> 
>>>>> Storage
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> for Cloudstack i
>>> use
>>>>> Solaris 11 ZFS + GlusterFS
>>>>>>> over Infiniband (RDMA).
>>>>>>>>> 
>>> That
>>>>>>>>> 
>>>>> gives the best performance and most
>>>>>>> scalable
>>> Storage.
>>>>>>>>> I have
>>>>> tasted some different solutions for
>>> primary
>>>>>>> Storage but the most
>>>>> are
>>>>>>>>> to expensive and for
>>> a CloudStack Cluster
>>>>>>> not economic or
>>>>> have a poor
>>>>>>>>> 
>>> performance.
>>>>>>>>> 
>>>>>>>>> My Configuration:
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> Storage Node:
>>>>>>>>> Supermicro Server (Intel Hardware) with Solaris
>>> 11
>>>>> with
>>>>>>> SSD write and read
>>>>>>>>> cache (read crucial-m4,
>>> write ZeusIOPS)
>>>>> GlusterFS
>>>>>>> and dualport ConnectX
>>>>>>>>> 
>>> 40Gbit/s Infiniband adapter.
>>>>>>> 
>>>>>>> 
>>>>>>>>> I have
>>>>>>> installed
>>> GlusterFS direct on Solaris with a modified
>>>>> code.
>>>>>>>>> Want
>>> you
>>>>>>> build bigger systems for more then 50 VMs it is
>>>>> better you
>>> split
>>>>>>>>> the
>>>>>>> Solaris and GlusterFS with a separte
>>>>> 
>>> headnode for GlusterFS
>>>>>>>>> 
>>>>>>>>> That
>>>>>>> looks like:
>>>>>>>>> 
>>> Solaris ZFS
>>>>> Backendstorage with a dataset Volume (Thin
>>>>>>> 
>>> Provision) --> ( SRP
>>>>>>>>> 
>>>>> Target attached direct without
>>> Infiniband switch
>>>>>>> to GF Node) -->
>>>>> GlusterFS
>>>>>>>>> Node the
>>> srp target formatted with xfs
>>>>>>> filesystem,
>>>>> create a GlusterFS
>>>>> 
>>>>>>> Volume --> ( Infiniband over a
>>>>>>> Mellanox Port
>>>>> Switch) -->
>>> Cloudstack Node
>>>>>>>>> mount glusterFS Volume over
>>>>>>> RDMA
>>>>>>> 
>>>>> 
>>>>> 
>>>>>>>>> For the Dataset Volume at the ZFS Storage, disable atime
>>> and
>>>>>>> 
>>>>> enable
>>>>>>>>> compression.
>>>>>>>>> (Space reclaim) With
>>> compression you
>>>>> can
>>>>>>> shrink the ZFS Volume with command
>>>>>>>>> 
>>> at Linux dd /dev/zero or
>>>>> In a
>>>>>>> Windows VM with sdelete That
>>> gives you space
>>>>>>>>> left on the
>>>>> Primary
>>>>>>> Storage for
>>> deleted Files in a VM or for deleted vhd's
>>>>>>>>> 
>>>>> or vm's in the
>>>>> 
>>>>> cloudstack
>>>>>>>>> 
>>>>>>>>> greeting Andreas
>>>>>>>>> 
>>>>>>>>> 
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Mit freundlichen
>>>>>>> Grüßen
>>>>>>>>> 
>>>>>>> 
>>>>> Andreas Huser
>>>>>>> 
>>>>>>> Geschäftsführer
>>>>>>>>> System Engineer
>>> /
>>>>>>> Consultant
>>>>>>>>> (Cisco
>>>>> CSE, SMBAM, LCSE, ASAM)
>>>>>>> 
>>>>> 
>>>>>>> 
>>>>> ---------------------------------------
>>>>>>>>> 
>>> Zellerstraße 28 - 77654
>>>>>>> 
>>>>> Offenburg
>>>>>>>>> Tel: +49(781)
>>> 12786898
>>>>>>>>> Mobil: +49(176) 10308549
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>> ahuser@7five-edv.de
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>> -----
>>>>> Ursprüngliche Mail -----
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> Von: "Outback
>>> Dingo"
>>>>>>>>> An:
>>>>> cloudstack-users@incubator.apache.org
>>>>>>>>> 
>>>>> 
>>>>> Gesendet: Dienstag, 23.
>>>>> Oktober 2012 02:15:16
>>>>>>>>> Betreff:
>>> Re: Primary
>>>>>>> Storage
>>>>>>>>> 
>>>>>>>>> 
>>>>> On Mon, Oct 22, 2012 at
>>> 8:09 PM, Ivan
>>>>>>> Rodriguezwrote:
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>>> 
>>> Solaris 11 ZFS and yes we tried different
>>>>>>> setups, raids
>>>>> levels
>>> number >> of SSD cache, ARC zfs options etc etc
>>>>>>> etc. >> >>
>>>>> 
>>> Cheers >> >> >> VMWare ?? >>
>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>> 
> 
> 
> -- 
> Regards,
> 
> Nik
> 
> Nik Martin
> VP Business Development
> Nfina Technologies, Inc.
> +1.251.243.0043 x1003
> Relentless Reliability

Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

Posted by Nik Martin <ni...@nfinausa.com>.

On 10/24/2012 08:29 AM, Trevor Francis wrote:
> Seems my original post has taken a detour. Who knew storage could be so
> controversial!
>
> Since we are a Centos house and it would be nice to have a standard OS
> across our platform, but many people have been singing ZFS' praises. To
> bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS
> for linux.
> Does anyone have experience with this? Would it be better to deploy
> Centos with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or
> bcache?


I inquired about ZFS on Linux in the #cloudstack IRC channel, and the 
quote I got was "ZFS is slow on Linux". I then asked "Slow relative to 
what, ZFS on Solaris, ext4 on Linux, ZFS on Linux?", and the response 
was "yes", so I'm being led to believe that it does not perform as well 
as other filesystems.  For large file storage on Linux, which Xen VHDs 
are, I tend to use XFS if the network protocol is NFS.

> Also, is the standard linux NFS daemon sufficient to handle a
> significant load or do you recommend a different NFS daemon....or go iSCSI ?
> Our biggest concern is that we have most VMs doing many sequential small
> writes (log files).

When I first started with CS on Xenserver in January, I thought thin 
provisioning was important, so I built file system based storage using 
NFS, with XFS as the underlying filesystem, on LVM.  NFS (v3) on Centos 
6 became processor bound VERY quickly, running %300 CPU load on a quad 
core Xeon.

I switched to iSCSI with LVM as backing store, and havwe had zero 
issues, except I'm currently investigating lower than desired random 
read/write on < 4k random.  Using tests like Iozone in a VM, and running 
iostat -t1 on the hypervisor console itself, i can get sustained 2500 
IOPS on iSCSI, and it drops to 300 with < 4k blocks of data. I expect to 
resolve this soon though.

>
> Thoughts before I make a decision?
>
>
>
> Trevor Francis
> Partner
> 46 Labs | The PeerEdge Cloud
> http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
> <http://www.peeredge.net/>
> 405-362-0046 - Voice  | 405-410-4980 - Cell
> trevorgfrancis - Skype
> trevor@46labs.com <ma...@46labs.com>
> Solutions Provider for the Telecom Industry
>
> <http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.facebook.com/PeerEdge>
>
> On Oct 24, 2012, at 2:53 AM, Fabrice Brazier <fabrice.brazier@apalia.net
> <ma...@apalia.net>> wrote:
>
>> Hi,
>>
>> @bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i
>> or the 8i mean 8 internals ports (so with 2 lsi you can have only 16
>> disks).
>>
>> After that a velociraptor disk it's only a 10k discs so we can count 120
>> IOPs by disks. If we put them in a raid 5 (by the way everybody know that
>> raid 5 don't scale very well after 12 disk the performance doesn't
>> increase
>> so much but nevermind you built your raid 5 of 30 disks).
>> So you have a maximum of (30-1)*120=3480 iops for read and
>> ((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only
>> 32Gb ram wich is peanuts to caching your read iops.
>>
>> So the only way you have to make performance is your cahecade.
>> Now a cachecade is made of ssd (weird actually it's the same as zil and
>> l2arc in zfs xD) plus the max size of a cachecade and the maximum size
>> of a
>> cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade
>> max.
>>
>> So you say your storage doesn't cost anything but you don't talk about the
>> size and the price of your cachecade.
>> And another thing the day you will have a real workload (not only 50
>> Vms...)
>> and this workload will be larger than 1Tb come again and explain us your
>> very large performance.
>>
>> Me too I can built a zfs server with 4 zeusram in zil and 5 disks
>> behind and
>> make a benchmark on files of 2Gb and say hey see I have 500 000 Iops
>> and 4Gb
>> of throughput...
>>
>> Regards
>>
>> -----Message d'origine-----
>> De : bruce.m@v365.com.au <ma...@v365.com.au>
>> [mailto:bruce.m@v365.com.au <http://v365.com.au>]
>> Envoyé : mercredi 24 octobre 2012 06:04
>> À : cloudstack-users@incubator.apache.org
>> <ma...@incubator.apache.org>
>> Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT
>>
>>
>>
>> These Sans we have build have saved us a fortune $ compared to FC
>> Block I/O
>> Sans from IBM, HP etc
>>
>> Ive been a contractor for IBM and HP
>> for 20 years. I know how much they are $$$$$ and what they can do.
>>
>> A
>> Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your
>> lucky you may get 100,000 perhaps 150,000 IOPS
>>
>> What ive build is
>> better. cheaper, and frankly as i support Linux and SCST, much more
>> reliable.
>>
>> the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800
>> ($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam
>> $800,
>> Dual PSU $650, about $14,000 for a much superior SAN.
>>
>> As i explained another $15K for infiniband network. which does SRP and IP
>>
>> As for your issues with cache misses, the CacheCade doesn't Miss much from
>> our tests. with WriteBack enabled, BBU enabled, and the SANs are on our
>> primary 22KVA UPS It never fails.
>>
>> Also Its not
>> doing very much unless a user in the cloud is hitting sequential blocks or
>> R/W the same data repeatedly for some weird reason. Even then we see
>> little
>> lagg or slow down, when as you say its hitting the Disc.
>> Infiniband with LSI Just rocks even more than we had thought!
>>
>> The R/W
>> pattern from groups of 50 VM.s appear to be completely random, web
>> servers,
>> virtual firewalls, email servers, DNS etc, FTP, windows domains,
>> sharepoint,
>> MSSQL, MYSQL working on the same Lun which are not considered critical (as
>> stated by clients) are each VM getting over 300-1200MB/s constantly even
>> when other users are pulling or pushing a file copy @ 250Mb/s between VMs
>> etc. Weve actually had to figure out how to slow down the SRP connections.
>> If a user is provisioned with 2 Luns from each SAN, in windows you can
>> copy
>> between VM Drives @ 1220MB/s!!!!!
>> faster than SSD in Linux its even faster, and that is in ESXi4.1 with
>> 15~20 Servers per VM Host.
>>
>> we have snmp monitoring the 2 primary sans, and they never go above 30%
>> bandwidth. so the point is the Cache has no real bearing on Cloud I/O.
>> cloud
>> IO is fairly low. We have LSI supplying us with the new PICE3 cards for a
>> trial, so i expect even better performance. So long STORY SHORT. GET
>> LSI AND
>> 16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s
>> SSD's its SLOWER! MORE SPINDLES = MORE SPEED!
>>
>> the main reason we wanted fast IO is for backups, snapshots, so as not to
>> interfere with the daily operational workings of the cloud VM guests.
>> Which
>> is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will
>> work, and some smart ass does a restore or backup in prod hours, only
>> to see
>> the whole cluster slow down to a crawl. on iSCSI its even more pathetic.
>>
>> With SRP on LSI " WE DONT
>> HAVE THAT PROBLEM!"
>>
>> I did a snapshot restore for 2 machines yesterday for a client, it took 2
>> mins to restore 100GB. And none of the other servers were affected. SNMP
>> latency on all VM's no impact. Virtually 0CPU cycles on the VMware
>> hosts for
>> the IO move. IO latency/storage path for other VM's from snmp data from
>> VMserver was 2ms which is normal. the machine doing the copy crept up to
>> 10ms but no more.
>>
>> So my conclusion
>> ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING!
>>
>> Cache misses we don't have a problem there at all. Were more worried about
>> Data Rot, which is why were interested in GFS to ZFS Luns.
>> But ZFS is still slow as compared to what we have.....
>>
>> we are going to
>> launch a new script which moves a file full of 16-bit random io for 100MB,
>> and 1TB, it will create a binary checksum for each file, and we can
>> compare
>> it over time to check for data rot.
>>
>> Just so you can
>> appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts.
>> the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started
>> the 2nd, while the 1st was running, same results across the 2 VM's.
>> then the
>> 3rd concurrently. they all dropped to 2000MB/s and IOPS went down to
>> 220,000
>> per host. so yes there is a penalty. But the way infiniband works is that
>> the switch balances the load across all servers. 4th server online same
>> test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s
>> @ 128K
>> blocks 50%50% random seq. change the distribution to 100% random makes
>> little difference. We confirmed all the drives were being accessed! all
>> were. we even took a look at the camera out DC to monitor the array
>> LEDS...
>> alll Flickers like mad! we took a look at the metrics of the Megaraid
>> as we
>> got it to work on CentOS, no errors, no issues or problems, no dropped
>> reads
>> or writes.
>>
>> we set the file on IO meter as high as poss 900000000 units. We then fired
>> up 2 more hosts and did real world file copy, no real impact still get
>> 2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent
>> file copies between the 2 sans and each host blew us away at 2000MB/s read
>> and write across all 6 hosts.!! FXXK we were amazed
>>
>> Aside
>> from that LSI have done a super job on the Cache modules on their cards.
>> I dont see how ZFS will ever approach this kind of performance.
>>
>> Anyway
>> take from what ive said what you like. We have an awsum SAN... and its
>> CHEAP
>> and Amazingly fast!
>>
>> were in the process of developing a public Cloud, so when the web site is
>> finn we will publish all of this in a whitepaper there.
>>
>> On 24.10.2012 10:57, Jason Davis wrote:
>>> Passionate
>> are you? Take it easy I am not knocking on what you are doing.
>>>
>>> The
>> term "spinning rust" is a catch all for hard drives, nothing personal
>>>
>> at all, in fact the setup you have is impressive.
>>>
>>> Impressive until
>> you have a cache miss. Cache misses are definitely an
>>> issue even in
>> "capital E" enterprise storage.
>>>
>>> Big bad XIV disk arrays or 3par are
>> impressive, fast arrays (crazy
>>> expensive ;]) when you are hitting DRAM
>> or SSD but if you are running cache
>>> unfriendly workloads then
>> ultimately your array is only as fast as the
>>> spinning disk sitting at
>> the lowest tier.
>>>
>>> I agree that openstorage is the way to go, from a
>> service providers
>>> perspective it's hard being competitive with AWS or
>> RackSpace if you are
>>> using expensive traditional SAN.
>>>
>>> So anyway,
>> my question wasn't a poke at what you are doing, just curious
>>> about
>> how big of an IOP/latency drop you see once a read or write has to
>>>
>> actually see the spinning disk. :)
>>> On Oct 23, 2012 8:10 PM, wrote:
>>>
>>
>>>>
>>>>
>>>> oh well if your negative, and make rude comments then that's
>> your
>>>> problem, i don't care about you.
>>>>
>>>> If you don't believe that
>> RDMA can
>>>> perform this fast then again, your problem.
>>>>
>>>> Ask anyone
>> with a decent
>>>> 12+ Disc Raid Array, and RDMA and they will tell you it
>> will kick you in
>>>> YOUR ballz! Anyone in storage knows to get iops and
>> performance you need
>>>> 12 preferably 16 spindles.
>>>>
>>>> the spinning
>> rust as you call it, are the
>>>> newest 200MB/s WD Veloci Raprors with
>> 64MB Cache. @ 10KRPM 2.5" dics in
>>>> 32 Bay Storage Chassis.
>>>>
>>>> we've
>> used them in the past and they are as
>>>> reliable as the SAS drives we
>> use, but faster!
>>>>
>>>> RAID 10 IS BULLSHIT!
>>>> Bloody slow and 50% goes
>> up in smoke for nothing. All you get is the
>>>> mirror speed of 2 drives
>> which is barley 400MB/s tops! I Spit on RAID 10
>>>>
>>>>
>>>> We also tried
>> Raid 60 and it was very good as well. But no point to
>>>> use it as we
>> replicated to 2nd SAN
>>>>
>>>> When we tested our new SANs, we
>>>> failed 1
>> drive and set the rebuild rate to 30% on raid 6, with 40% scan
>>>> rate.
>> The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we
>>>>
>> removed a 2nd disc, and replaced with blank, and it was rebuild in
>>>>
>> another 4 hours. With no real impact to performance tests.
>>>>
>>>>
>> Separating
>>>> the underlying 6Gb/s Sata ports x 8 into effectively 16 x
>> 3Gb/s Channels
>>>> using a SAS Expander gives each Disc 300MB/s of
>> bandwidth. They cant
>>>> physically perform better than 200~220MB/s so
>> there is enough bandwidth
>>>> on the SATA and PCIE Bus to cope. LSI rate
>> the 9280-8i @ 2500MB/s but it
>>>> does work faster with Cache Cade. up to
>> 3200MB/s in that test.
>>>>
>>>> So a
>>>> real SAN has many spindles of high
>> performance, and the WD VRs are
>>>> better than some cheap SSD
>> drives.
>>>>
>>>> RDMA is very fast, low cpu cycles,
>>>> and reads and write
>> directly to RAM@ 40Gb/s, we created a RAM drive and
>>>> tested it up to
>> 3200MB/s which is as fast as the PCIE 2 Bus / LSI could
>>>> Handle.
>>>>
>>>>
>> If you cant afford this realtivley cheap equipment (compated to
>>>> IBM
>> DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast
>>>>
>> performance. Any real Cloud would have what we have built. None of our
>>>> customers, ALL 300 Of them! have lost a single bit of data, all
>> VMs are
>>>> running very quickly, and with no problems now for 3
>> months.
>>>>
>>>> So If you
>>>> dont appreciate being shown how to build a
>> kick ASS SAN then PISS
>>>> OFF!
>>>>
>>>> On 24.10.2012 08:17, Jason Davis
>> wrote:
>>>>> How often do you have
>>>> folks cache missing and hitting the
>> spinning rust
>>>>> below the cache cade
>>>> tier?
>>>>> On Oct 23, 2012
>> 6:25 PM, wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> with our
>>>> SCST
>> RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
>>>>>> Director 4036
>>>>
>> switches + HP Gen2 QDR Chassis Switches.
>>>>>>
>>>>>> Our San achieves
>>>>
>>>>
>>>> 295,000 - 350,000 Iops max per LSI controller, depending on
>> block
>>>> size
>>>>>> for i/o meter test. LSI9280-8i with battery cache.
>> +
>>>> cachecade2.0 x 2 in
>>>>>> PCIE 16x slot on supermicro
>> motherboard.
>>>>>>
>>>>>> 1
>>>> x 250GB SSD for CACHE Cade +
>>>>>> 16 1tb
>> WD VR drives (16 - hot stand buy
>>>> - 2 for raid 6 parity) =
>>>>>>
>> 2600MB/s
>>>>>>
>>>>>> most of our test are from 4K
>>>> to 128K block size,
>> @ 4K we get
>>>>>> 295,000 Iops @ 1590 MB/s and @ 128K
>>>> 350,000 iops @
>> 2400~2600MB/s
>>>>>>
>>>>>> we
>>>>>> have tuned 64K and 128K block
>>>> size
>> on different luns. 64K for database
>>>>>> and 128K for general
>>>>
>> file.
>>>>>>
>>>>>> The best thing about Infiniband is low CPU
>>>>>>
>> cycles. only
>>>> 5~6% during these tests.
>>>>>>
>>>>>> latency is as low
>> as 4-6ms
>>>>>> average
>>>> read time. concurrent response times are from
>> 5-12ms. even under
>>>>>>
>>>> heavy load its below 20ms
>>>>>>
>>>>>>
>> Infiniband latency is below 0.01 us
>>>> Which is
>>>>>> why we chose
>> it.
>>>>>>
>>>>>> we run MSSQL on the 64K formatted
>>>> Luns, its
>> massively
>>>>>> fast.
>>>>>>
>>>>>> If we copy the same data twice the
>>>>
>> cache cade kicks in and we
>>>>>> achieve even better speeds.
>>>>>>
>>>>>>
>> Ive
>>>> compared this to block io @ 8Gb/s Fiber
>>>>>> channel, and it
>> barely gets
>>>> 120,000 IOPS. @ much higher latency and
>>>>>>
>> bandwidth.
>>>>>>
>>>>>> So for our
>>>> money RDMA wins!
>>>>>>
>>>>>>
>> Bruce
>>>>>>
>>>>>> On 23.10.2012 21:20,
>>>>>> Jason Davis
>>>> wrote:
>>>>>>
>>> Out of curiosity, is there any quick performance
>>>>>> numbers
>>>> for
>> these ZFS +
>>>>>>> GlusterFS mashups you guys are talking
>>>>>>
>> about?
>>>>>>
>>>>>
>>>>>>> Specifically, IOPs and latency? Sequential
>> read/write
>>>>>>
>>>> performance honestly
>>>>>>> isn't a very good
>> benchmark to determine
>>>> your
>>>>>> SANs performance. It's like
>>>>>>>
>> comparing CPUs based solely on
>>>> how many
>>>>>> GHz it runs at. Sure
>> you can get
>>>>>>> great MB or GB/s with
>>>> SATA disk but
>>>>>> I'd
>> reckon that IOP performance is
>>>>>>> abismal. If you
>>>> are
>> utilizing
>>>>>> GlusterFS without the cache pooling magic
>>>>>>> that
>> is
>>>> ZFS then I would
>>>>>> imagine that latency can be an issue.
>>>>>>
>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>>> On Tue, Oct 23, 2012 at
>>>>>> 7:56 AM,
>> Andreas Huser wrote:
>>>>>>>
>>>>>>
>>>>>> Hi Fabrice,
>>>>>>>>
>>>>>>>> i
>> know
>>>>>> OpenSolaris/Solaris Oracle it's so
>>>> a thing.
>>>>>>>> I'm
>> for more then 10
>>>>>> years a open source user and
>>>> that
>>>>>>>>
>> with oracle - i did no like at the
>>>>>> beginning of this
>>>>
>> constallation.
>>>>>>>> But Oracle makes his work good i
>>>>>> know
>> that. The
>>>> cost of one socket
>>>>>>>> are 700$ and you can use so
>> much
>>>>>> quantity of
>>>> TB as you will.
>>>>>>>> And you can use the
>> full premier Support
>>>>>> from
>>>> Oracle.
>>>>>>>> Nexenta develop with
>> the Illumos code. And the Licence
>>>>>>
>>>> are TB based.
>>>>>>>> That
>> is not my favorite. As well the pool version
>>>> from
>>>>>> Nexenta
>> comes
>>>>>>>> not after. Current Nexenta Infiniband are
>>>> not a
>> usable
>>>>>> solution.
>>>>>>>> But every can use what he will.
>> Everyone
>>>> must decide for
>>>>>> themselves.
>>>>>>>>
>>>>>>>> SRP
>> Targets or iser are not
>>>> difficult to configure.
>>>>>> Use the SRP
>> for
>>>>>>>> the Storage unit
>>>> connection. Solaris and GlusterFS
>>>>>>
>> builds one Storage unit.
>>>>>>>> The
>>>> GlusterFS Server export the
>> final Volume
>>>>>> to the Clients as well
>>>> KVM,
>>>>>>>> VMWare,
>> Hyper-V etc.
>>>>>>>> You can use
>>>>>> native GlusterFS,
>>>> RDMA, NFS
>> ore CIFS to export the Volume.
>>>>>>>> SRP have
>>>>>> nothing to do
>>>>
>> with VmWare.
>>>>>>>>
>>>>>>>> When you are use a 7200 SAS drive the
>>>>
>>>> access
>>>> time are the same as a
>>>>>>>> SATA drive only the quality
>> of the
>>>>>>
>>>> hardware are better. When you need
>>>>>>>> Performance
>> you must use SAS
>>>>>>
>>>> drives with 15000U/m. But it's not needed
>>>>
>>>>>> when you install SSD
>>>> for
>>>>>> ZIL/L2ARC. ZeusRAM rocks :-)
>>>>
>>>>>>
>>>>>>>> I use dedup only at
>>>> secondary stroage
>>>>>> or on
>> Backupserver not on Primary
>>>>>>>> Storage.
>>>>>>
>>>>>> When you use
>> SSD SATA
>>>>>> drives then you have an cheap an fast
>>>> strorage.
>>>>>>
>>>> 1TB drive cost unter
>>>>>> 100$. Current i'm not need to
>>>> save
>> storage volume.
>>>>>>>>
>>>>>>>> Which
>>>>>> application use atime? I
>> know
>>>> "find -atime N". atime on a Storage
>>>>>>>>
>>>>>> that stores
>> only Virtual
>>>> Disks?
>>>>>>>> I don't need the information when i
>>>>
>>>> have last opening
>>>> the disk :-)
>>>>>>>>
>>>>>>>> For a Solaris GUI
>> take a look at
>>>>>> napp-it
>>>>>>
>>>>>> http://www.napp-it.org/
>>>>>>
>>>>
>>>>>>>> greetings from germany
>>>>>>>>
>>>>>>
>>>> Andreas
>>>>>>>>
>>>>
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Ursprüngliche Mail -----
>>>>>>
>>>>
>>>>
>>>>>>>> Von:
>>>>>> "Fabrice Brazier">> An:
>>>>
>> cloudstack-users@incubator.apache.org
>> <ma...@incubator.apache.org>
>>>>>>>>
>>>>>> Gesendet: Dienstag,
>> 23.
>>>> Oktober 2012 12:30:50
>>>>>>>> Betreff: RE: Primary
>>>>>>
>> Storage
>>>>>>>>
>>>>>>>>
>>>> Hi Andreas,
>>>>>>>>
>>>>>>>> Hum that's
>> pretty cool, i know they have
>>>>>>
>>>> still trouble with infiniband
>> on
>>>>>>>> nexenta but it's clearly a
>>>> priority
>>>>>> on their
>> roadmap (plus i trust more the
>>>>>>>> nexenta team
>>>> to evolve zfs
>>>>
>>>> than oracle).
>>>>>>>>
>>>>>>>> I agree iscsi over infiniband
>>>>
>> increase the latency
>>>>>> but most of the time
>>>>>>>> it's
>>>>>>>>
>> just more
>>>> simple to use IPOIB than IB,
>>>>>> for example if you use
>> vmware the
>>>>>>>>
>>>> support of infiniband it's assured
>>>>>> by
>> mellanox and not vmware so if
>>>> you
>>>>>>>> have
>>>>>>>> an issue the
>> vmware
>>>>>> support probably don't help
>>>> you anymore.
>>>>>>>>
>>>>>>
>>>> About the raid i'm a fan
>>>>>> of raid 10, i
>>>> prefer build a raid
>> 10 with 7200 sas
>>>>>>>> drive than
>>>>>> multiple vdev
>>>> raidz/raidz2
>> 15k sas drive. Particularly for
>>>>>>>> virtual
>>>>>>
>>>> environment
>> with a random workload and multiple access.
>>>>>>>>
>>>>>>>> I'm
>>>> a
>> fan
>>>>>> of nfs so agreed about the zil, and zeus ram are the only
>> one
>>>> for
>>>>>>>>
>>>>>> that (with classical ssd you can hit the write
>> hole).
>>>>>>
>>>>>>
>>>>>>>> Agreed for
>>>>>> compression too (but only
>> lzjb level gzip use too
>>>> much cpu),
>>>>>>>> disable
>>>>>> atime
>> permit to decrease the iops load but
>>>> i'm not sure is really a
>>>>>>
>>>>
>>>>>> best practice. About the deduplication
>>>> I don't totally
>> agree. It
>>>>>> really
>>>>>>>> depends of your array, the
>>>> workload
>> and the VM type on your
>>>>>> cloudstack.
>>>>>>>> Actually when i
>>>>
>> build a zfs array i count 1Gb ram for 1Tb
>>>>>> disks. With
>>>>>>>>
>>>>
>> deduplication i count 2Gb ram for 1 Tb disks (plus in
>>>>>> nexenta 4
>> they
>>>> will
>>>>>>>> add a possibility to use ssd for the
>>>>>>
>> deduplication metada,
>>>> like the l2arc
>>>>>>>> for the read cache).
>>>>
>>>>>>
>>>>>>>> The
>>>>>> last point
>>>> it's about your drives, why sata
>> drives? I mean sata
>>>>>> doesn't
>>>>>>>>
>>>> respect their latency, and
>> the reliability of a sas drive is
>>>>>> 10x the
>>>> sata.
>>>>>>>> Plus
>> now you can found many NL-sas drive at low
>>>>>> cost
>>>>>>
>>>>>>
>>>>>>
>>>> But it's really a nice architecture, i never try glusterfs
>>>>>>
>>>>
>> (for the moment)
>>>>>>>> plus in that case it's really a good way to
>> have
>>>> a
>>>>>> replacement to a
>>>>>>>> metro-cluster for free, i try
>> one of is
>>>> competitor
>>>>>> (onefs) and clearly
>>>>>>>> clustered
>> filesystem are the
>>>> futur.
>>>>>>>>
>>>>>>>>
>>>>>> Cheers,
>>>>>>>>
>> Fabrice
>>>>>>>>
>>>>>>>> -----Message
>>>> d'origine-----
>>>>>>>> De :
>> Andreas
>>>>>> Huser
>>>> [mailto:ahuser@7five-edv.de]
>>>>>>>> Envoyé :
>> mardi 23 octobre 2012
>>>>>>
>>>> 11:37
>>>>>>>> À :
>> cloudstack-users@incubator.apache.org
>>>>>>>> Objet :
>>>> Re:
>>>>>>
>> Primary Storage
>>>>>>>>
>>>>>>>> Hi Fabrice,
>>>>>>>>
>>>>>>>> I don't
>> know
>>>> what other people
>>>>>> do but i have no problems with
>> Infiniband +
>>>>>>>>
>>>> GlusterFS + Cloudstack
>>>>>> I'm not use
>> Nexenta it's based on Illumos and
>>>> work
>>>>>>>> not fine with
>>>>>>
>> Infiniband.
>>>>>>>> I have two different
>>>> clusters in productiv
>>>>>>
>> enviroments.
>>>>>>>>
>>>>>>>> The first: Solaris 11
>>>> with built-in
>> GlusterFS 3.3
>>>>>> export Gluster Vol. with
>>>>>>>> RDMA. >
>>>>
>> performance is okey you can use that
>>>>>> for smaller enviroments
>> The
>>>>>>
>>>>>> Second: Is a little bit complex with a GlusterFS
>> Server in the
>>>> middle
>>>>>>>>
>>>>>>>> ZFS Server: based on Solaris
>> 11
>>>>>>>>
>>>>>> 1.) Create a
>>>> zpool min. two vdevs and SSD
>> read/write cache
>>>>>>>> 2.) Create
>>>>>> a thin
>>>> provisioning Volume
>> "zfs create -V" disable atime and
>>>>>>>> enable
>>>>>>
>>>> compression
>> (do not enable dedup!) and export as (iWarp) SRP Target
>>>>>>
>>>>>>
>>>>
>>>> to the GlusterFS Server use a direct connection without IB
>>>>
>> Switch.
>>>>>>>>
>>>>>>>>
>>>>>> GlusterFS Server:
>>>>>>>> 1.) Use
>> RedHat, CentOS or
>>>> Fedora (I use CentOS 5
>>>>>> and 6)
>>>>>>>> 2.)
>> Use OFED driver from
>>>> https://www.openfabrics.org
>>>>>>>> 3.)
>>>>>>
>> Import the SRP Target from ZFS
>>>> Server and format as xfs
>>>>>>>> 4.)
>> Create a
>>>>>> Glustervolume "volume
>>>> create xy transport rdma " (use
>> only
>>>>>>>> rdma)
>>>>>>>>
>>>>>> 5.) Connect
>>>> with the second IB
>> Port to a IB Switch
>>>>>>>>
>>>>>>>> Cloudstack
>>>>>>
>>>> Hypervisor
>> Node:
>>>>>>>> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
>>>>
>> and
>>>>>> 6)
>>>>>>>> 2.) Use OFED driver from
>> https://www.openfabrics.org
>>>>>>
>>>>>> 3.) Import
>>>>>> the
>> Glustervolume
>>>>>>>>
>>>>>>>> ZFS Thin Volume ----
>>>> Infinband SRP
>> ---->
>>>>>> GlusterFS ---- GFSVol rdma ---->
>>>>>>>> IB Switch
>>>> ---->
>> Clients
>>>>>>>>
>>>>>>>> The
>>>>>> ZFS and GlusterFS Server form a
>> Storage
>>>> unit connect direct with
>>>>>> 40Gbit
>>>>>>>> Infiniband
>> Point-to-Point You
>>>> do not feel that is a cable
>>>>>> between!
>>>>>>
>>>>
>>>>>>>> Important: When you
>>>> have Infiniband use not IPoIB with
>>>>
>>>> iSCSI! If one
>>>>>>>> already
>>>>>>>>
>>>> has Infiniband then you
>> should also use the
>>>>>> advantage.
>>>>>>>> IPoIB
>>>> have a higher
>> Latency as iWarp SRP!
>>>>>>>>
>>>>>>>>
>>>>>>>> SRP =
>>>>>> usec
>>>>
>>>>
>>>>>>
>>>>>>>> -- SRP --
>>>>>>>> local address: LID 0x01 QPN
>> 0x44004b PSN
>>>>>>
>>>> 0xf3265b RKey 0x9804237c VAddr
>>>>>>>>
>> 0x00000001dda000 remote address:
>>>> LID
>>>>>> 0x0a QPN 0x10004a PSN
>> 0x44072e RKey
>>>>>>>> 0x1c0f115 VAddr
>>>>>>
>>>> 0x000000088e6000
>>>>>>
>>>>
>>>>>>
>>>>
>> ------------------------------------------------------------------
>>>>
>>>>
>>>>>>
>>>>>> #bytes #iterations t_min[usec] t_max[usec]
>> t_typical[usec]
>>>>>>>> 2
>>>> 1000
>>>>>> 1.29 125.30 1.31
>>>>>>>>
>>>>
>>>>
>>>>
>> ------------------------------------------------------------------
>>>>
>>>>
>>>>>>
>>>>>>>>
>>>>>> -- IPoIB ---
>>>>>>>> [root@sv01sfogaa ~]# ping
>> 10.200.0.10
>>>>>>
>>>>>> PING
>>>>>> 10.200.0.10 (10.200.0.10) 56(84)
>> bytes of data.
>>>>>>>> 64
>>>> bytes from
>>>>>> 10.200.0.10: icmp_seq=1
>> ttl=255 time=0.147 ms
>>>>>>>> 64
>>>> bytes from
>>>>>> 10.200.0.10:
>> icmp_seq=2 ttl=255 time=0.116 ms
>>>>>>>>
>>>>>>>>
>>>> When you put load
>> on
>>>>>> IPoIB then increases the latency that is not
>>>> good
>>>>>>
>>>>
>>>>>>>>
>>>>>>>> That is my
>>>>>> recommendation for a simple
>>>>
>> GlusterFS Mirror:
>>>>>>>>
>>>>>>>> - Supermicro Server
>>>>>> with Intel
>> Hardware
>>>> and Expander Backplane
>>>>>>>> - 1x Crucal-M4 SSD Read
>>>>
>>>> Cache
>>>>>>>> - 2x
>>>> ZeusIOPs SSD Write cache (mirror)
>>>>>>>> -
>> SATA 24/7
>>>>>> Harddrives
>>>>>>>>
>>>> - LSI HBA 9207 or 9211
>>>>>>>> -
>> ConnectX-2 QDR Dualport
>>>>>> Infiniband
>>>> Adaper (HP refurbished with
>> full
>>>>>>>> warrany for 100$)
>>>>>> Importent >
>>>> Flash newest
>> Firmware from Mellanox!
>>>>>>>> - Mellanox IB
>>>>>> Swich
>>>>>>>>
>>>> -
>> Solaris 11
>>>>>>>> - GlusterFS 3.3 compiled with ib_verbs
>>>>>>>> -
>>>>
>>>>
>>>> Gluster Volume transport only rdma
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>
>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>> Throughput
>>>>>> constant up to 200
>> Mbyte/s >> more throughput with
>>>> more storage servers
>>>>>> or more
>> harddrives on jbod >> >> >>
>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Info:
>>>>
>>>>>>
>>>>>>>> - I have some
>>>>>> problems with infiniband
>>>> RDMA or
>> SRP with OpenIndiana or
>>>>>>>> Illumos or
>>>>>> Nexenta. Some
>>>>
>> adapters have a high latency or not a stable
>>>>>>>>
>>>>>> connection.
>> Use
>>>> Solaris that's the right way!
>>>>>>>> - OpenIndiana is Beta!
>>>>
>>>> Infiniband
>>>> ib_verbs works not or not fine!
>>>>>>>> - Use Solaris
>> 11
>>>>>> Infiniband
>>>> ib_verbs are native and stable
>>>>>>>> - Don't
>> use Ubuntu Client
>>>>>> Server
>>>> for Infiniband! Use RedHat, Fedora
>> or
>>>>>>>> CentOS and install the
>>>>>>
>>>> right drivers from
>>>>>>>>
>> https://www.openfabrics.org/downloads/OFED/
>>>>>>
>>>>>> -
>>>>>> You have
>> not SSD Cache? Disable at the ZFS Volume the sync!
>>>>>>
>>>>
>> Important!
>>>>>>>> You
>>>>>>>> loose Security for your Data but
>> some
>>>> protocols use
>>>>>> sync flags in
>>>>>>>> transport.
>>>>>>>>
>> Example NFS use by
>>>> default fsync. write
>>>>>> cache are not active.
>> NFS writes
>>>>>>>> data
>>>> direct to the Harddrive. For
>>>>>> Data
>> Security and Performance give
>>>> the
>>>>>>>> Storage Server a SSD
>> write
>>>>>> cache. ZFS works at default with
>>>> sync=standard
>>>>>>>>
>> that prevent write
>>>>>> holes. (COW System)
>>>>>>>>
>>>>>>
>>>>>> I hope
>> that I could help a little
>>>>>>>>
>>>>>>>>
>>>>>> Greeting from
>>>>
>> Germany
>>>>>>>> Andreas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----
>> Ursprüngliche
>>>> Mail
>>>>>> -----
>>>>>>>>
>>>>>>>> Von: "Fabrice
>> Brazier"
>>>>>>>> An:
>>>>>>
>>>> cloudstack-users@incubator.apache.org
>>>>
>>>>>> Gesendet: Dienstag, 23.
>>>> Oktober
>>>>>> 2012 09:55:15
>>>>>>>>
>> Betreff: RE: Primary Storage
>>>>>>>>
>>>>>>>>
>>>> Hi Andreas,
>>>>>>
>>>>
>>>>>>>>
>>>>>> Hello i just see your configuration, it seems
>>>> quite
>> interesting.
>>>>>>>> If i
>>>>>> understand well you want to build
>> some
>>>> zfs array on the backend.
>>>>>>>>
>>>>>> Export luns (probably
>> by iscsi over
>>>> infiniband) to you linux cluster,
>>>>>> and
>>>>>>>>
>> on
>>>>>>>> the linux
>>>> cluster you put glusterFS.
>>>>>>>> I can
>> understand
>>>>>> the point, with
>>>> that you can have very good
>> performance
>>>>>>>> and
>>>>>>>>
>>>>>> reliability
>>>> (zfs),
>> scalability and redundancy (gluster) for very low
>>>>>> cost.
>>>>>>
>>>>
>>>> So just one question, did you try the global namespace
>>>>>>
>> implementation
>>>> from
>>>>>>>> nexenta?
>>>>>>>> If yes can you tell me
>> what
>>>>>> configuration
>>>> is the best for you?
>>>>>>>> I mean the
>> fact you have a gluster
>>>>>> cluster
>>>> in the middle must impact
>> the
>>>>>>>> overral performance no?
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>> Fabrice
>>>>>>>>
>>>>>>>> -----Message d'origine-----
>>>>>>>> De :
>> Andreas
>>>> Huser
>>>>>> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23
>> octobre
>>>>>>>>
>>>> 2012 05:40 À :
>>>>>>
>> cloudstack-users@incubator.apache.org
>>>>>>>> Objet :
>>>> Re: Primary
>>>>
>>>> Storage
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> for Cloudstack i
>> use
>>>> Solaris 11 ZFS + GlusterFS
>>>>>> over Infiniband (RDMA).
>>>>>>>>
>> That
>>>>>>>>
>>>> gives the best performance and most
>>>>>> scalable
>> Storage.
>>>>>>>> I have
>>>> tasted some different solutions for
>> primary
>>>>>> Storage but the most
>>>> are
>>>>>>>> to expensive and for
>> a CloudStack Cluster
>>>>>> not economic or
>>>> have a poor
>>>>>>>>
>> performance.
>>>>>>>>
>>>>>>>> My Configuration:
>>>>>>>>
>>>>>>
>>>>
>> Storage Node:
>>>>>>>> Supermicro Server (Intel Hardware) with Solaris
>> 11
>>>> with
>>>>>> SSD write and read
>>>>>>>> cache (read crucial-m4,
>> write ZeusIOPS)
>>>> GlusterFS
>>>>>> and dualport ConnectX
>>>>>>>>
>> 40Gbit/s Infiniband adapter.
>>>>>>
>>>>>>
>>>>>>>> I have
>>>>>> installed
>> GlusterFS direct on Solaris with a modified
>>>> code.
>>>>>>>> Want
>> you
>>>>>> build bigger systems for more then 50 VMs it is
>>>> better you
>> split
>>>>>>>> the
>>>>>> Solaris and GlusterFS with a separte
>>>>
>> headnode for GlusterFS
>>>>>>>>
>>>>>>>> That
>>>>>> looks like:
>>>>>>>>
>> Solaris ZFS
>>>> Backendstorage with a dataset Volume (Thin
>>>>>>
>> Provision) --> ( SRP
>>>>>>>>
>>>> Target attached direct without
>> Infiniband switch
>>>>>> to GF Node) -->
>>>> GlusterFS
>>>>>>>> Node the
>> srp target formatted with xfs
>>>>>> filesystem,
>>>> create a GlusterFS
>>>>
>>>>>> Volume --> ( Infiniband over a
>>>>>> Mellanox Port
>>>> Switch) -->
>> Cloudstack Node
>>>>>>>> mount glusterFS Volume over
>>>>>> RDMA
>>>>>>
>>>>
>>>>
>>>>>>>> For the Dataset Volume at the ZFS Storage, disable atime
>> and
>>>>>>
>>>> enable
>>>>>>>> compression.
>>>>>>>> (Space reclaim) With
>> compression you
>>>> can
>>>>>> shrink the ZFS Volume with command
>>>>>>>>
>> at Linux dd /dev/zero or
>>>> In a
>>>>>> Windows VM with sdelete That
>> gives you space
>>>>>>>> left on the
>>>> Primary
>>>>>> Storage for
>> deleted Files in a VM or for deleted vhd's
>>>>>>>>
>>>> or vm's in the
>>>>
>>>> cloudstack
>>>>>>>>
>>>>>>>> greeting Andreas
>>>>>>>>
>>>>>>>>
>>>>
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> Mit freundlichen
>>>>>> Grüßen
>>>>>>>>
>>>>>>
>>>> Andreas Huser
>>>>>>
>>>>>> Geschäftsführer
>>>>>>>> System Engineer
>> /
>>>>>> Consultant
>>>>>>>> (Cisco
>>>> CSE, SMBAM, LCSE, ASAM)
>>>>>>
>>>>
>>>>>>
>>>> ---------------------------------------
>>>>>>>>
>> Zellerstraße 28 - 77654
>>>>>>
>>>> Offenburg
>>>>>>>> Tel: +49(781)
>> 12786898
>>>>>>>> Mobil: +49(176) 10308549
>>>>>>
>>>>>>
>>>>>>
>> ahuser@7five-edv.de
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>> -----
>>>> Ursprüngliche Mail -----
>>>>>>>>
>>>>>>>>
>>>>>> Von: "Outback
>> Dingo"
>>>>>>>> An:
>>>> cloudstack-users@incubator.apache.org
>>>>>>>>
>>>>
>>>> Gesendet: Dienstag, 23.
>>>> Oktober 2012 02:15:16
>>>>>>>> Betreff:
>> Re: Primary
>>>>>> Storage
>>>>>>>>
>>>>>>>>
>>>> On Mon, Oct 22, 2012 at
>> 8:09 PM, Ivan
>>>>>> Rodriguezwrote:
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>
>> Solaris 11 ZFS and yes we tried different
>>>>>> setups, raids
>>>> levels
>> number >> of SSD cache, ARC zfs options etc etc
>>>>>> etc. >> >>
>>>>
>> Cheers >> >> >> VMWare ?? >>
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>>>>
>>>>
>>>>
>


-- 
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

Posted by Jason Davis <sc...@gmail.com>.

ZFS for Linux is still somewhat in it's infancy, it's certainly not
performant... at least compared with ZFS running on FreeBSD or
Solaris/Illumos.

If you are a CentOS shop, then using LSI's controller cards with CacheCade
would be the route I would go. Phoronix has some decent benchmarks with ZFS
on Linux if you are interested:

http://www.phoronix.com/scan.php?page=article&item=linux_zfs_june2012&num=1

The standard NFS daemon seems to do well in my experience, however I must
say that I haven't had much in the way of actual experience with thousand+
clients or anything connecting to it.



On Wed, Oct 24, 2012 at 8:29 AM, Trevor Francis <
trevor.francis@tgrahamcapital.com> wrote:

> Seems my original post has taken a detour. Who knew storage could be so
> controversial!
>
> Since we are a Centos house and it would be nice to have a standard OS
> across our platform, but many people have been singing ZFS' praises. To
> bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS for
> linux.
> Does anyone have experience with this? Would it be better to deploy Centos
> with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or bcache?
> Also, is the standard linux NFS daemon sufficient to handle a significant
> load or do you recommend a different NFS daemon....or go iSCSI ?
> Our biggest concern is that we have most VMs doing many sequential small
> writes (log files).
>
> Thoughts before I make a decision?
>
>
>
> Trevor Francis****
> Partner
> 46 Labs | The PeerEdge Cloud****
> http://www.46labs.com | http://www.peeredge.net****
> 405-362-0046 - Voice  | 405-410-4980 - Cell****
> trevorgfrancis - Skype****
> trevor@46labs.com****
> ** **
> Solutions Provider for the Telecom Industry
>
> <http://www.twitter.com/peeredge> <http://www.twitter.com/peeredge> <http://www.twitter.com/peeredge><http://www.facebook.com/PeerEdge>
>
> On Oct 24, 2012, at 2:53 AM, Fabrice Brazier <fa...@apalia.net>
> wrote:
>
> Hi,
>
> @bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i
> or the 8i mean 8 internals ports (so with 2 lsi you can have only 16
> disks).
>
> After that a velociraptor disk it's only a 10k discs so we can count 120
> IOPs by disks. If we put them in a raid 5 (by the way everybody know that
> raid 5 don't scale very well after 12 disk the performance doesn't increase
> so much but nevermind you built your raid 5 of 30 disks).
> So you have a maximum of (30-1)*120=3480 iops for read and
> ((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only
> 32Gb ram wich is peanuts to caching your read iops.
>
> So the only way you have to make performance is your cahecade.
> Now a cachecade is made of ssd (weird actually it's the same as zil and
> l2arc in zfs xD) plus the max size of a cachecade and the maximum size of a
> cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade
> max.
>
> So you say your storage doesn't cost anything but you don't talk about the
> size and the price of your cachecade.
> And another thing the day you will have a real workload (not only 50
> Vms...)
> and this workload will be larger than 1Tb come again and explain us your
> very large performance.
>
> Me too I can built a zfs server with 4 zeusram in zil and 5 disks behind
> and
> make a benchmark on files of 2Gb and say hey see I have 500 000 Iops and
> 4Gb
> of throughput...
>
> Regards
>
> -----Message d'origine-----
> De : bruce.m@v365.com.au [mailto:bruce.m@v365.com.au]
> Envoyé : mercredi 24 octobre 2012 06:04
> À : cloudstack-users@incubator.apache.org
> Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT
>
>
>
> These Sans we have build have saved us a fortune $ compared to FC Block I/O
> Sans from IBM, HP etc
>
> Ive been a contractor for IBM and HP
> for 20 years. I know how much they are $$$$$ and what they can do.
>
> A
> Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your
> lucky you may get 100,000 perhaps 150,000 IOPS
>
> What ive build is
> better. cheaper, and frankly as i support Linux and SCST, much more
> reliable.
>
> the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800
> ($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam $800,
> Dual PSU $650, about $14,000 for a much superior SAN.
>
> As i explained another $15K for infiniband network. which does SRP and IP
>
> As for your issues with cache misses, the CacheCade doesn't Miss much from
> our tests. with WriteBack enabled, BBU enabled, and the SANs are on our
> primary 22KVA UPS It never fails.
>
> Also Its not
> doing very much unless a user in the cloud is hitting sequential blocks or
> R/W the same data repeatedly for some weird reason. Even then we see little
> lagg or slow down, when as you say its hitting the Disc.
> Infiniband with LSI Just rocks even more than we had thought!
>
> The R/W
> pattern from groups of 50 VM.s appear to be completely random, web servers,
> virtual firewalls, email servers, DNS etc, FTP, windows domains,
> sharepoint,
> MSSQL, MYSQL working on the same Lun which are not considered critical (as
> stated by clients) are each VM getting over 300-1200MB/s constantly even
> when other users are pulling or pushing a file copy @ 250Mb/s between VMs
> etc. Weve actually had to figure out how to slow down the SRP connections.
> If a user is provisioned with 2 Luns from each SAN, in windows you can copy
> between VM Drives @ 1220MB/s!!!!!
> faster than SSD in Linux its even faster, and that is in ESXi4.1 with
> 15~20 Servers per VM Host.
>
> we have snmp monitoring the 2 primary sans, and they never go above 30%
> bandwidth. so the point is the Cache has no real bearing on Cloud I/O.
> cloud
> IO is fairly low. We have LSI supplying us with the new PICE3 cards for a
> trial, so i expect even better performance. So long STORY SHORT. GET LSI
> AND
> 16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s
> SSD's its SLOWER! MORE SPINDLES = MORE SPEED!
>
> the main reason we wanted fast IO is for backups, snapshots, so as not to
> interfere with the daily operational workings of the cloud VM guests. Which
> is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will
> work, and some smart ass does a restore or backup in prod hours, only to
> see
> the whole cluster slow down to a crawl. on iSCSI its even more pathetic.
>
> With SRP on LSI " WE DONT
> HAVE THAT PROBLEM!"
>
> I did a snapshot restore for 2 machines yesterday for a client, it took 2
> mins to restore 100GB. And none of the other servers were affected. SNMP
> latency on all VM's no impact. Virtually 0CPU cycles on the VMware hosts
> for
> the IO move. IO latency/storage path for other VM's from snmp data from
> VMserver was 2ms which is normal. the machine doing the copy crept up to
> 10ms but no more.
>
> So my conclusion
> ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING!
>
> Cache misses we don't have a problem there at all. Were more worried about
> Data Rot, which is why were interested in GFS to ZFS Luns.
> But ZFS is still slow as compared to what we have.....
>
> we are going to
> launch a new script which moves a file full of 16-bit random io for 100MB,
> and 1TB, it will create a binary checksum for each file, and we can compare
> it over time to check for data rot.
>
> Just so you can
> appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts.
> the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started
> the 2nd, while the 1st was running, same results across the 2 VM's. then
> the
> 3rd concurrently. they all dropped to 2000MB/s and IOPS went down to
> 220,000
> per host. so yes there is a penalty. But the way infiniband works is that
> the switch balances the load across all servers. 4th server online same
> test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s @
> 128K
> blocks 50%50% random seq. change the distribution to 100% random makes
> little difference. We confirmed all the drives were being accessed! all
> were. we even took a look at the camera out DC to monitor the array LEDS...
> alll Flickers like mad! we took a look at the metrics of the Megaraid as we
> got it to work on CentOS, no errors, no issues or problems, no dropped
> reads
> or writes.
>
> we set the file on IO meter as high as poss 900000000 units. We then fired
> up 2 more hosts and did real world file copy, no real impact still get
> 2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent
> file copies between the 2 sans and each host blew us away at 2000MB/s read
> and write across all 6 hosts.!! FXXK we were amazed
>
> Aside
> from that LSI have done a super job on the Cache modules on their cards.
> I dont see how ZFS will ever approach this kind of performance.
>
> Anyway
> take from what ive said what you like. We have an awsum SAN... and its
> CHEAP
> and Amazingly fast!
>
> were in the process of developing a public Cloud, so when the web site is
> finn we will publish all of this in a whitepaper there.
>
> On 24.10.2012 10:57, Jason Davis wrote:
>
> Passionate
>
> are you? Take it easy I am not knocking on what you are doing.
>
>
> The
>
> term "spinning rust" is a catch all for hard drives, nothing personal
>
>
> at all, in fact the setup you have is impressive.
>
>
> Impressive until
>
> you have a cache miss. Cache misses are definitely an
>
> issue even in
>
> "capital E" enterprise storage.
>
>
> Big bad XIV disk arrays or 3par are
>
> impressive, fast arrays (crazy
>
> expensive ;]) when you are hitting DRAM
>
> or SSD but if you are running cache
>
> unfriendly workloads then
>
> ultimately your array is only as fast as the
>
> spinning disk sitting at
>
> the lowest tier.
>
>
> I agree that openstorage is the way to go, from a
>
> service providers
>
> perspective it's hard being competitive with AWS or
>
> RackSpace if you are
>
> using expensive traditional SAN.
>
> So anyway,
>
> my question wasn't a poke at what you are doing, just curious
>
> about
>
> how big of an IOP/latency drop you see once a read or write has to
>
>
> actually see the spinning disk. :)
>
> On Oct 23, 2012 8:10 PM, wrote:
>
>
>
>
> oh well if your negative, and make rude comments then that's
>
> your
>
> problem, i don't care about you.
>
> If you don't believe that
>
> RDMA can
>
> perform this fast then again, your problem.
>
> Ask anyone
>
> with a decent
>
> 12+ Disc Raid Array, and RDMA and they will tell you it
>
> will kick you in
>
> YOUR ballz! Anyone in storage knows to get iops and
>
> performance you need
>
> 12 preferably 16 spindles.
>
> the spinning
>
> rust as you call it, are the
>
> newest 200MB/s WD Veloci Raprors with
>
> 64MB Cache. @ 10KRPM 2.5" dics in
>
> 32 Bay Storage Chassis.
>
> we've
>
> used them in the past and they are as
>
> reliable as the SAS drives we
>
> use, but faster!
>
>
> RAID 10 IS BULLSHIT!
> Bloody slow and 50% goes
>
> up in smoke for nothing. All you get is the
>
> mirror speed of 2 drives
>
> which is barley 400MB/s tops! I Spit on RAID 10
>
>
>
> We also tried
>
> Raid 60 and it was very good as well. But no point to
>
> use it as we
>
> replicated to 2nd SAN
>
>
> When we tested our new SANs, we
> failed 1
>
> drive and set the rebuild rate to 30% on raid 6, with 40% scan
>
> rate.
>
> The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we
>
>
> removed a 2nd disc, and replaced with blank, and it was rebuild in
>
>
> another 4 hours. With no real impact to performance tests.
>
>
>
> Separating
>
> the underlying 6Gb/s Sata ports x 8 into effectively 16 x
>
> 3Gb/s Channels
>
> using a SAS Expander gives each Disc 300MB/s of
>
> bandwidth. They cant
>
> physically perform better than 200~220MB/s so
>
> there is enough bandwidth
>
> on the SATA and PCIE Bus to cope. LSI rate
>
> the 9280-8i @ 2500MB/s but it
>
> does work faster with Cache Cade. up to
>
> 3200MB/s in that test.
>
>
> So a
> real SAN has many spindles of high
>
> performance, and the WD VRs are
>
> better than some cheap SSD
>
> drives.
>
>
> RDMA is very fast, low cpu cycles,
> and reads and write
>
> directly to RAM@ 40Gb/s, we created a RAM drive and
>
> tested it up to
>
> 3200MB/s which is as fast as the PCIE 2 Bus / LSI could
>
> Handle.
>
>
> If you cant afford this realtivley cheap equipment (compated to
>
> IBM
>
> DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast
>
>
> performance. Any real Cloud would have what we have built. None of our
>
> customers, ALL 300 Of them! have lost a single bit of data, all
>
> VMs are
>
> running very quickly, and with no problems now for 3
>
> months.
>
>
> So If you
> dont appreciate being shown how to build a
>
> kick ASS SAN then PISS
>
> OFF!
>
> On 24.10.2012 08:17, Jason Davis
>
> wrote:
>
> How often do you have
>
> folks cache missing and hitting the
>
> spinning rust
>
> below the cache cade
>
> tier?
>
> On Oct 23, 2012
>
> 6:25 PM, wrote:
>
>
>
>
> Hi
>
> with our
>
> SCST
>
> RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
>
> Director 4036
>
>
> switches + HP Gen2 QDR Chassis Switches.
>
>
> Our San achieves
>
>
>
> 295,000 - 350,000 Iops max per LSI controller, depending on
>
> block
>
> size
>
> for i/o meter test. LSI9280-8i with battery cache.
>
> +
>
> cachecade2.0 x 2 in
>
> PCIE 16x slot on supermicro
>
> motherboard.
>
>
> 1
>
> x 250GB SSD for CACHE Cade +
>
> 16 1tb
>
> WD VR drives (16 - hot stand buy
>
> - 2 for raid 6 parity) =
>
>
> 2600MB/s
>
>
> most of our test are from 4K
>
> to 128K block size,
>
> @ 4K we get
>
> 295,000 Iops @ 1590 MB/s and @ 128K
>
> 350,000 iops @
>
> 2400~2600MB/s
>
>
> we
> have tuned 64K and 128K block
>
> size
>
> on different luns. 64K for database
>
> and 128K for general
>
>
> file.
>
>
> The best thing about Infiniband is low CPU
>
> cycles. only
>
> 5~6% during these tests.
>
>
> latency is as low
>
> as 4-6ms
>
> average
>
> read time. concurrent response times are from
>
> 5-12ms. even under
>
>
> heavy load its below 20ms
>
>
>
> Infiniband latency is below 0.01 us
>
> Which is
>
> why we chose
>
> it.
>
>
> we run MSSQL on the 64K formatted
>
> Luns, its
>
> massively
>
> fast.
>
> If we copy the same data twice the
>
>
> cache cade kicks in and we
>
> achieve even better speeds.
>
>
> Ive
>
> compared this to block io @ 8Gb/s Fiber
>
> channel, and it
>
> barely gets
>
> 120,000 IOPS. @ much higher latency and
>
>
> bandwidth.
>
>
> So for our
>
> money RDMA wins!
>
>
>
> Bruce
>
>
> On 23.10.2012 21:20,
> Jason Davis
>
> wrote:
>
>
> Out of curiosity, is there any quick performance
>
> numbers
>
> for
>
> these ZFS +
>
> GlusterFS mashups you guys are talking
>
>
> about?
>
>
>
> Specifically, IOPs and latency? Sequential
>
> read/write
>
>
> performance honestly
>
> isn't a very good
>
> benchmark to determine
>
> your
>
> SANs performance. It's like
>
>
> comparing CPUs based solely on
>
> how many
>
> GHz it runs at. Sure
>
> you can get
>
> great MB or GB/s with
>
> SATA disk but
>
> I'd
>
> reckon that IOP performance is
>
> abismal. If you
>
> are
>
> utilizing
>
> GlusterFS without the cache pooling magic
>
> that
>
> is
>
> ZFS then I would
>
> imagine that latency can be an issue.
>
>
>
>
>
> On Tue, Oct 23, 2012 at
>
> 7:56 AM,
>
> Andreas Huser wrote:
>
>
>
> Hi Fabrice,
>
>
> i
>
> know
>
> OpenSolaris/Solaris Oracle it's so
>
> a thing.
>
> I'm
>
> for more then 10
>
> years a open source user and
>
> that
>
>
> with oracle - i did no like at the
>
> beginning of this
>
>
> constallation.
>
> But Oracle makes his work good i
>
> know
>
> that. The
>
> cost of one socket
>
> are 700$ and you can use so
>
> much
>
> quantity of
>
> TB as you will.
>
> And you can use the
>
> full premier Support
>
> from
>
> Oracle.
>
> Nexenta develop with
>
> the Illumos code. And the Licence
>
>
> are TB based.
>
> That
>
> is not my favorite. As well the pool version
>
> from
>
> Nexenta
>
> comes
>
> not after. Current Nexenta Infiniband are
>
> not a
>
> usable
>
> solution.
>
> But every can use what he will.
>
> Everyone
>
> must decide for
>
> themselves.
>
>
> SRP
>
> Targets or iser are not
>
> difficult to configure.
>
> Use the SRP
>
> for
>
> the Storage unit
>
> connection. Solaris and GlusterFS
>
>
> builds one Storage unit.
>
> The
>
> GlusterFS Server export the
>
> final Volume
>
> to the Clients as well
>
> KVM,
>
> VMWare,
>
> Hyper-V etc.
>
> You can use
>
> native GlusterFS,
>
> RDMA, NFS
>
> ore CIFS to export the Volume.
>
> SRP have
>
> nothing to do
>
>
> with VmWare.
>
>
> When you are use a 7200 SAS drive the
>
>
> access
> time are the same as a
>
> SATA drive only the quality
>
> of the
>
>
> hardware are better. When you need
>
> Performance
>
> you must use SAS
>
>
> drives with 15000U/m. But it's not needed
>
> when you install SSD
>
> for
>
> ZIL/L2ARC. ZeusRAM rocks :-)
>
>
>
> I use dedup only at
>
> secondary stroage
>
> or on
>
> Backupserver not on Primary
>
> Storage.
>
>
> When you use
>
> SSD SATA
>
> drives then you have an cheap an fast
>
> strorage.
>
>
> 1TB drive cost unter
>
> 100$. Current i'm not need to
>
> save
>
> storage volume.
>
>
> Which
>
> application use atime? I
>
> know
>
> "find -atime N". atime on a Storage
>
>
> that stores
>
> only Virtual
>
> Disks?
>
> I don't need the information when i
>
>
> have last opening
> the disk :-)
>
>
> For a Solaris GUI
>
> take a look at
>
> napp-it
>
> http://www.napp-it.org/
>
>
> greetings from germany
>
>
> Andreas
>
>
>
>
>
>
> ----- Ursprüngliche Mail -----
>
>
>
>
> Von:
>
> "Fabrice Brazier">> An:
>
>
> cloudstack-users@incubator.apache.org
>
>
> Gesendet: Dienstag,
>
> 23.
>
> Oktober 2012 12:30:50
>
> Betreff: RE: Primary
>
>
> Storage
>
>
>
> Hi Andreas,
>
>
> Hum that's
>
> pretty cool, i know they have
>
>
> still trouble with infiniband
>
> on
>
> nexenta but it's clearly a
>
> priority
>
> on their
>
> roadmap (plus i trust more the
>
> nexenta team
>
> to evolve zfs
>
> than oracle).
>
>
> I agree iscsi over infiniband
>
>
> increase the latency
>
> but most of the time
>
> it's
>
> just more
>
> simple to use IPOIB than IB,
>
> for example if you use
>
> vmware the
>
>
> support of infiniband it's assured
>
> by
>
> mellanox and not vmware so if
>
> you
>
> have
> an issue the
>
> vmware
>
> support probably don't help
>
> you anymore.
>
>
>
> About the raid i'm a fan
>
> of raid 10, i
>
> prefer build a raid
>
> 10 with 7200 sas
>
> drive than
>
> multiple vdev
>
> raidz/raidz2
>
> 15k sas drive. Particularly for
>
> virtual
>
>
> environment
>
> with a random workload and multiple access.
>
>
> I'm
>
> a
>
> fan
>
> of nfs so agreed about the zil, and zeus ram are the only
>
> one
>
> for
>
>
> that (with classical ssd you can hit the write
>
> hole).
>
>
>
> Agreed for
>
> compression too (but only
>
> lzjb level gzip use too
>
> much cpu),
>
> disable
>
> atime
>
> permit to decrease the iops load but
>
> i'm not sure is really a
>
>
>
> best practice. About the deduplication
>
> I don't totally
>
> agree. It
>
> really
>
> depends of your array, the
>
> workload
>
> and the VM type on your
>
> cloudstack.
>
> Actually when i
>
>
> build a zfs array i count 1Gb ram for 1Tb
>
> disks. With
>
>
>
> deduplication i count 2Gb ram for 1 Tb disks (plus in
>
> nexenta 4
>
> they
>
> will
>
> add a possibility to use ssd for the
>
>
> deduplication metada,
>
> like the l2arc
>
> for the read cache).
>
>
>
> The
>
> last point
>
> it's about your drives, why sata
>
> drives? I mean sata
>
> doesn't
>
>
> respect their latency, and
>
> the reliability of a sas drive is
>
> 10x the
>
> sata.
>
> Plus
>
> now you can found many NL-sas drive at low
>
> cost
>
>
>
> But it's really a nice architecture, i never try glusterfs
>
>
>
> (for the moment)
>
> plus in that case it's really a good way to
>
> have
>
> a
>
> replacement to a
>
> metro-cluster for free, i try
>
> one of is
>
> competitor
>
> (onefs) and clearly
>
> clustered
>
> filesystem are the
>
> futur.
>
>
>
> Cheers,
>
>
> Fabrice
>
>
> -----Message
>
> d'origine-----
>
> De :
>
> Andreas
>
> Huser
>
> [mailto:ahuser@7five-edv.de]
>
> Envoyé :
>
> mardi 23 octobre 2012
>
>
> 11:37
>
> À :
>
> cloudstack-users@incubator.apache.org
>
> Objet :
>
> Re:
>
>
> Primary Storage
>
>
> Hi Fabrice,
>
> I don't
>
> know
>
> what other people
>
> do but i have no problems with
>
> Infiniband +
>
>
> GlusterFS + Cloudstack
>
> I'm not use
>
> Nexenta it's based on Illumos and
>
> work
>
> not fine with
>
>
> Infiniband.
>
> I have two different
>
> clusters in productiv
>
>
> enviroments.
>
>
> The first: Solaris 11
>
> with built-in
>
> GlusterFS 3.3
>
> export Gluster Vol. with
>
> RDMA. >
>
>
> performance is okey you can use that
>
> for smaller enviroments
>
> The
>
>
> Second: Is a little bit complex with a GlusterFS
>
> Server in the
>
> middle
>
>
> ZFS Server: based on Solaris
>
> 11
>
>
> 1.) Create a
>
> zpool min. two vdevs and SSD
>
> read/write cache
>
> 2.) Create
>
> a thin
>
> provisioning Volume
>
> "zfs create -V" disable atime and
>
> enable
>
>
> compression
>
> (do not enable dedup!) and export as (iWarp) SRP Target
>
>
>
>
> to the GlusterFS Server use a direct connection without IB
>
> Switch.
>
>
>
> GlusterFS Server:
>
> 1.) Use
>
> RedHat, CentOS or
>
> Fedora (I use CentOS 5
>
> and 6)
>
> 2.)
>
> Use OFED driver from
>
> https://www.openfabrics.org
>
> 3.)
>
>
> Import the SRP Target from ZFS
>
> Server and format as xfs
>
> 4.)
>
> Create a
>
> Glustervolume "volume
>
> create xy transport rdma " (use
>
> only
>
> rdma)
>
> 5.) Connect
>
> with the second IB
>
> Port to a IB Switch
>
>
> Cloudstack
>
>
> Hypervisor
>
> Node:
>
> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
>
>
> and
>
> 6)
>
> 2.) Use OFED driver from
>
> https://www.openfabrics.org
>
>
> 3.) Import
> the
>
> Glustervolume
>
>
> ZFS Thin Volume ----
>
> Infinband SRP
>
> ---->
>
> GlusterFS ---- GFSVol rdma ---->
>
> IB Switch
>
> ---->
>
> Clients
>
>
> The
>
> ZFS and GlusterFS Server form a
>
> Storage
>
> unit connect direct with
>
> 40Gbit
>
> Infiniband
>
> Point-to-Point You
>
> do not feel that is a cable
>
> between!
>
>
> Important: When you
>
> have Infiniband use not IPoIB with
>
> iSCSI! If one
>
> already
>
> has Infiniband then you
>
> should also use the
>
> advantage.
>
> IPoIB
>
> have a higher
>
> Latency as iWarp SRP!
>
>
>
> SRP =
>
> usec
>
>
>
>
> -- SRP --
> local address: LID 0x01 QPN
>
> 0x44004b PSN
>
>
> 0xf3265b RKey 0x9804237c VAddr
>
>
> 0x00000001dda000 remote address:
>
> LID
>
> 0x0a QPN 0x10004a PSN
>
> 0x44072e RKey
>
> 0x1c0f115 VAddr
>
>
> 0x000000088e6000
>
>
>
>
>
> ------------------------------------------------------------------
>
>
>
>
> #bytes #iterations t_min[usec] t_max[usec]
>
> t_typical[usec]
>
> 2
>
> 1000
>
> 1.29 125.30 1.31
>
>
>
>
>
> ------------------------------------------------------------------
>
>
>
>
>
> -- IPoIB ---
>
> [root@sv01sfogaa ~]# ping
>
> 10.200.0.10
>
>
> PING
> 10.200.0.10 (10.200.0.10) 56(84)
>
> bytes of data.
>
> 64
>
> bytes from
>
> 10.200.0.10: icmp_seq=1
>
> ttl=255 time=0.147 ms
>
> 64
>
> bytes from
>
> 10.200.0.10:
>
> icmp_seq=2 ttl=255 time=0.116 ms
>
>
>
> When you put load
>
> on
>
> IPoIB then increases the latency that is not
>
> good
>
>
>
>
> That is my
>
> recommendation for a simple
>
>
> GlusterFS Mirror:
>
>
> - Supermicro Server
>
> with Intel
>
> Hardware
>
> and Expander Backplane
>
> - 1x Crucal-M4 SSD Read
>
>
> Cache
>
> - 2x
>
> ZeusIOPs SSD Write cache (mirror)
>
> -
>
> SATA 24/7
>
> Harddrives
>
>
> - LSI HBA 9207 or 9211
>
> -
>
> ConnectX-2 QDR Dualport
>
> Infiniband
>
> Adaper (HP refurbished with
>
> full
>
> warrany for 100$)
>
> Importent >
>
> Flash newest
>
> Firmware from Mellanox!
>
> - Mellanox IB
>
> Swich
>
>
> -
>
> Solaris 11
>
> - GlusterFS 3.3 compiled with ib_verbs
> -
>
>
>
> Gluster Volume transport only rdma
>
>
>
>
>
>
>
>
>
> Throughput
>
> constant up to 200
>
> Mbyte/s >> more throughput with
>
> more storage servers
>
> or more
>
> harddrives on jbod >> >> >>
>
>
>
>
>
> Info:
>
>
>
> - I have some
>
> problems with infiniband
>
> RDMA or
>
> SRP with OpenIndiana or
>
> Illumos or
>
> Nexenta. Some
>
>
> adapters have a high latency or not a stable
>
>
> connection.
>
> Use
>
> Solaris that's the right way!
>
> - OpenIndiana is Beta!
>
>
> Infiniband
> ib_verbs works not or not fine!
>
> - Use Solaris
>
> 11
>
> Infiniband
>
> ib_verbs are native and stable
>
> - Don't
>
> use Ubuntu Client
>
> Server
>
> for Infiniband! Use RedHat, Fedora
>
> or
>
> CentOS and install the
>
>
> right drivers from
>
>
> https://www.openfabrics.org/downloads/OFED/
>
>
> -
> You have
>
> not SSD Cache? Disable at the ZFS Volume the sync!
>
>
>
> Important!
>
> You
> loose Security for your Data but
>
> some
>
> protocols use
>
> sync flags in
>
> transport.
>
> Example NFS use by
>
> default fsync. write
>
> cache are not active.
>
> NFS writes
>
> data
>
> direct to the Harddrive. For
>
> Data
>
> Security and Performance give
>
> the
>
> Storage Server a SSD
>
> write
>
> cache. ZFS works at default with
>
> sync=standard
>
>
> that prevent write
>
> holes. (COW System)
>
>
>
> I hope
>
> that I could help a little
>
>
>
> Greeting from
>
>
> Germany
>
> Andreas
>
>
>
> -----
>
> Ursprüngliche
>
> Mail
>
> -----
>
>
> Von: "Fabrice
>
> Brazier"
>
> An:
>
>
> cloudstack-users@incubator.apache.org
>
> Gesendet: Dienstag, 23.
>
> Oktober
>
> 2012 09:55:15
>
>
> Betreff: RE: Primary Storage
>
>
>
> Hi Andreas,
>
>
>
>
> Hello i just see your configuration, it seems
>
> quite
>
> interesting.
>
> If i
>
> understand well you want to build
>
> some
>
> zfs array on the backend.
>
>
> Export luns (probably
>
> by iscsi over
>
> infiniband) to you linux cluster,
>
> and
>
>
> on
>
> the linux
>
> cluster you put glusterFS.
>
> I can
>
> understand
>
> the point, with
>
> that you can have very good
>
> performance
>
> and
>
> reliability
>
> (zfs),
>
> scalability and redundancy (gluster) for very low
>
> cost.
>
>
> So just one question, did you try the global namespace
>
>
> implementation
>
> from
>
> nexenta?
> If yes can you tell me
>
> what
>
> configuration
>
> is the best for you?
>
> I mean the
>
> fact you have a gluster
>
> cluster
>
> in the middle must impact
>
> the
>
> overral performance no?
>
>
>
>
> Fabrice
>
>
> -----Message d'origine-----
> De :
>
> Andreas
>
> Huser
>
> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23
>
> octobre
>
>
> 2012 05:40 À :
>
>
> cloudstack-users@incubator.apache.org
>
> Objet :
>
> Re: Primary
>
> Storage
>
>
> Hi,
>
> for Cloudstack i
>
> use
>
> Solaris 11 ZFS + GlusterFS
>
> over Infiniband (RDMA).
>
>
> That
>
>
> gives the best performance and most
>
> scalable
>
> Storage.
>
> I have
>
> tasted some different solutions for
>
> primary
>
> Storage but the most
>
> are
>
> to expensive and for
>
> a CloudStack Cluster
>
> not economic or
>
> have a poor
>
>
> performance.
>
>
> My Configuration:
>
>
>
> Storage Node:
>
> Supermicro Server (Intel Hardware) with Solaris
>
> 11
>
> with
>
> SSD write and read
>
> cache (read crucial-m4,
>
> write ZeusIOPS)
>
> GlusterFS
>
> and dualport ConnectX
>
>
> 40Gbit/s Infiniband adapter.
>
>
>
> I have
>
> installed
>
> GlusterFS direct on Solaris with a modified
>
> code.
>
> Want
>
> you
>
> build bigger systems for more then 50 VMs it is
>
> better you
>
> split
>
> the
>
> Solaris and GlusterFS with a separte
>
>
> headnode for GlusterFS
>
>
> That
>
> looks like:
>
>
> Solaris ZFS
>
> Backendstorage with a dataset Volume (Thin
>
>
> Provision) --> ( SRP
>
>
> Target attached direct without
>
> Infiniband switch
>
> to GF Node) -->
>
> GlusterFS
>
> Node the
>
> srp target formatted with xfs
>
> filesystem,
>
> create a GlusterFS
>
> Volume --> ( Infiniband over a
> Mellanox Port
>
> Switch) -->
>
> Cloudstack Node
>
> mount glusterFS Volume over
>
> RDMA
>
>
>
> For the Dataset Volume at the ZFS Storage, disable atime
>
> and
>
>
> enable
>
> compression.
> (Space reclaim) With
>
> compression you
>
> can
>
> shrink the ZFS Volume with command
>
>
> at Linux dd /dev/zero or
>
> In a
>
> Windows VM with sdelete That
>
> gives you space
>
> left on the
>
> Primary
>
> Storage for
>
> deleted Files in a VM or for deleted vhd's
>
>
> or vm's in the
>
> cloudstack
>
>
> greeting Andreas
>
>
>
>
>
>
> Mit freundlichen
>
> Grüßen
>
>
>
> Andreas Huser
>
>
> Geschäftsführer
>
> System Engineer
>
> /
>
> Consultant
>
> (Cisco
>
> CSE, SMBAM, LCSE, ASAM)
>
>
>
>
> ---------------------------------------
>
>
> Zellerstraße 28 - 77654
>
>
> Offenburg
>
> Tel: +49(781)
>
> 12786898
>
> Mobil: +49(176) 10308549
>
>
>
>
> ahuser@7five-edv.de
>
>
>
>
>
>
> -----
>
> Ursprüngliche Mail -----
>
>
>
> Von: "Outback
>
> Dingo"
>
> An:
>
> cloudstack-users@incubator.apache.org
>
>
>
> Gesendet: Dienstag, 23.
> Oktober 2012 02:15:16
>
> Betreff:
>
> Re: Primary
>
> Storage
>
>
>
> On Mon, Oct 22, 2012 at
>
> 8:09 PM, Ivan
>
> Rodriguezwrote:
>
>
>
>
>
>
> Solaris 11 ZFS and yes we tried different
>
> setups, raids
>
> levels
>
> number >> of SSD cache, ARC zfs options etc etc
>
> etc. >> >>
>
>
> Cheers >> >> >> VMWare ?? >>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

Posted by Trevor Francis <tr...@tgrahamcapital.com>.

Seems my original post has taken a detour. Who knew storage could be so controversial!

Since we are a Centos house and it would be nice to have a standard OS across our platform, but many people have been singing ZFS' praises. To bridge that gap, evidently ZFS can be deployed on Centos 6.3 using ZFS for linux.
Does anyone have experience with this? Would it be better to deploy Centos with ZFS and a cache drive or Centos 6.3 ext4 with flashcache or bcache?
Also, is the standard linux NFS daemon sufficient to handle a significant load or do you recommend a different NFS daemon....or go iSCSI ?
Our biggest concern is that we have most VMs doing many sequential small writes (log files).

Thoughts before I make a decision?



Trevor Francis
Partner
46 Labs | The PeerEdge Cloud
http://www.46labs.com | http://www.peeredge.net
405-362-0046 - Voice  | 405-410-4980 - Cell
trevorgfrancis - Skype
trevor@46labs.com
 
Solutions Provider for the Telecom Industry

 

On Oct 24, 2012, at 2:53 AM, Fabrice Brazier <fa...@apalia.net> wrote:

> Hi,
> 
> @bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i
> or the 8i mean 8 internals ports (so with 2 lsi you can have only 16 disks).
> 
> After that a velociraptor disk it's only a 10k discs so we can count 120
> IOPs by disks. If we put them in a raid 5 (by the way everybody know that
> raid 5 don't scale very well after 12 disk the performance doesn't increase
> so much but nevermind you built your raid 5 of 30 disks).
> So you have a maximum of (30-1)*120=3480 iops for read and
> ((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only
> 32Gb ram wich is peanuts to caching your read iops.
> 
> So the only way you have to make performance is your cahecade.
> Now a cachecade is made of ssd (weird actually it's the same as zil and
> l2arc in zfs xD) plus the max size of a cachecade and the maximum size of a
> cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade
> max.
> 
> So you say your storage doesn't cost anything but you don't talk about the
> size and the price of your cachecade.
> And another thing the day you will have a real workload (not only 50 Vms...)
> and this workload will be larger than 1Tb come again and explain us your
> very large performance.
> 
> Me too I can built a zfs server with 4 zeusram in zil and 5 disks behind and
> make a benchmark on files of 2Gb and say hey see I have 500 000 Iops and 4Gb
> of throughput...
> 
> Regards
> 
> -----Message d'origine-----
> De : bruce.m@v365.com.au [mailto:bruce.m@v365.com.au]
> Envoyé : mercredi 24 octobre 2012 06:04
> À : cloudstack-users@incubator.apache.org
> Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT
> 
> 
> 
> These Sans we have build have saved us a fortune $ compared to FC Block I/O
> Sans from IBM, HP etc
> 
> Ive been a contractor for IBM and HP
> for 20 years. I know how much they are $$$$$ and what they can do.
> 
> A
> Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your
> lucky you may get 100,000 perhaps 150,000 IOPS
> 
> What ive build is
> better. cheaper, and frankly as i support Linux and SCST, much more
> reliable.
> 
> the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800
> ($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam $800,
> Dual PSU $650, about $14,000 for a much superior SAN.
> 
> As i explained another $15K for infiniband network. which does SRP and IP
> 
> As for your issues with cache misses, the CacheCade doesn't Miss much from
> our tests. with WriteBack enabled, BBU enabled, and the SANs are on our
> primary 22KVA UPS It never fails.
> 
> Also Its not
> doing very much unless a user in the cloud is hitting sequential blocks or
> R/W the same data repeatedly for some weird reason. Even then we see little
> lagg or slow down, when as you say its hitting the Disc.
> Infiniband with LSI Just rocks even more than we had thought!
> 
> The R/W
> pattern from groups of 50 VM.s appear to be completely random, web servers,
> virtual firewalls, email servers, DNS etc, FTP, windows domains, sharepoint,
> MSSQL, MYSQL working on the same Lun which are not considered critical (as
> stated by clients) are each VM getting over 300-1200MB/s constantly even
> when other users are pulling or pushing a file copy @ 250Mb/s between VMs
> etc. Weve actually had to figure out how to slow down the SRP connections.
> If a user is provisioned with 2 Luns from each SAN, in windows you can copy
> between VM Drives @ 1220MB/s!!!!!
> faster than SSD in Linux its even faster, and that is in ESXi4.1 with
> 15~20 Servers per VM Host.
> 
> we have snmp monitoring the 2 primary sans, and they never go above 30%
> bandwidth. so the point is the Cache has no real bearing on Cloud I/O. cloud
> IO is fairly low. We have LSI supplying us with the new PICE3 cards for a
> trial, so i expect even better performance. So long STORY SHORT. GET LSI AND
> 16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s
> SSD's its SLOWER! MORE SPINDLES = MORE SPEED!
> 
> the main reason we wanted fast IO is for backups, snapshots, so as not to
> interfere with the daily operational workings of the cloud VM guests. Which
> is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will
> work, and some smart ass does a restore or backup in prod hours, only to see
> the whole cluster slow down to a crawl. on iSCSI its even more pathetic.
> 
> With SRP on LSI " WE DONT
> HAVE THAT PROBLEM!"
> 
> I did a snapshot restore for 2 machines yesterday for a client, it took 2
> mins to restore 100GB. And none of the other servers were affected. SNMP
> latency on all VM's no impact. Virtually 0CPU cycles on the VMware hosts for
> the IO move. IO latency/storage path for other VM's from snmp data from
> VMserver was 2ms which is normal. the machine doing the copy crept up to
> 10ms but no more.
> 
> So my conclusion
> ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING!
> 
> Cache misses we don't have a problem there at all. Were more worried about
> Data Rot, which is why were interested in GFS to ZFS Luns.
> But ZFS is still slow as compared to what we have.....
> 
> we are going to
> launch a new script which moves a file full of 16-bit random io for 100MB,
> and 1TB, it will create a binary checksum for each file, and we can compare
> it over time to check for data rot.
> 
> Just so you can
> appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts.
> the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started
> the 2nd, while the 1st was running, same results across the 2 VM's. then the
> 3rd concurrently. they all dropped to 2000MB/s and IOPS went down to 220,000
> per host. so yes there is a penalty. But the way infiniband works is that
> the switch balances the load across all servers. 4th server online same
> test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s @ 128K
> blocks 50%50% random seq. change the distribution to 100% random makes
> little difference. We confirmed all the drives were being accessed! all
> were. we even took a look at the camera out DC to monitor the array LEDS...
> alll Flickers like mad! we took a look at the metrics of the Megaraid as we
> got it to work on CentOS, no errors, no issues or problems, no dropped reads
> or writes.
> 
> we set the file on IO meter as high as poss 900000000 units. We then fired
> up 2 more hosts and did real world file copy, no real impact still get
> 2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent
> file copies between the 2 sans and each host blew us away at 2000MB/s read
> and write across all 6 hosts.!! FXXK we were amazed
> 
> Aside
> from that LSI have done a super job on the Cache modules on their cards.
> I dont see how ZFS will ever approach this kind of performance.
> 
> Anyway
> take from what ive said what you like. We have an awsum SAN... and its CHEAP
> and Amazingly fast!
> 
> were in the process of developing a public Cloud, so when the web site is
> finn we will publish all of this in a whitepaper there.
> 
> On 24.10.2012 10:57, Jason Davis wrote:
>> Passionate
> are you? Take it easy I am not knocking on what you are doing.
>> 
>> The
> term "spinning rust" is a catch all for hard drives, nothing personal
>> 
> at all, in fact the setup you have is impressive.
>> 
>> Impressive until
> you have a cache miss. Cache misses are definitely an
>> issue even in
> "capital E" enterprise storage.
>> 
>> Big bad XIV disk arrays or 3par are
> impressive, fast arrays (crazy
>> expensive ;]) when you are hitting DRAM
> or SSD but if you are running cache
>> unfriendly workloads then
> ultimately your array is only as fast as the
>> spinning disk sitting at
> the lowest tier.
>> 
>> I agree that openstorage is the way to go, from a
> service providers
>> perspective it's hard being competitive with AWS or
> RackSpace if you are
>> using expensive traditional SAN.
>> 
>> So anyway,
> my question wasn't a poke at what you are doing, just curious
>> about
> how big of an IOP/latency drop you see once a read or write has to
>> 
> actually see the spinning disk. :)
>> On Oct 23, 2012 8:10 PM, wrote:
>> 
> 
>>> 
>>> 
>>> oh well if your negative, and make rude comments then that's
> your
>>> problem, i don't care about you.
>>> 
>>> If you don't believe that
> RDMA can
>>> perform this fast then again, your problem.
>>> 
>>> Ask anyone
> with a decent
>>> 12+ Disc Raid Array, and RDMA and they will tell you it
> will kick you in
>>> YOUR ballz! Anyone in storage knows to get iops and
> performance you need
>>> 12 preferably 16 spindles.
>>> 
>>> the spinning
> rust as you call it, are the
>>> newest 200MB/s WD Veloci Raprors with
> 64MB Cache. @ 10KRPM 2.5" dics in
>>> 32 Bay Storage Chassis.
>>> 
>>> we've
> used them in the past and they are as
>>> reliable as the SAS drives we
> use, but faster!
>>> 
>>> RAID 10 IS BULLSHIT!
>>> Bloody slow and 50% goes
> up in smoke for nothing. All you get is the
>>> mirror speed of 2 drives
> which is barley 400MB/s tops! I Spit on RAID 10
>>> 
>>> 
>>> We also tried
> Raid 60 and it was very good as well. But no point to
>>> use it as we
> replicated to 2nd SAN
>>> 
>>> When we tested our new SANs, we
>>> failed 1
> drive and set the rebuild rate to 30% on raid 6, with 40% scan
>>> rate.
> The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we
>>> 
> removed a 2nd disc, and replaced with blank, and it was rebuild in
>>> 
> another 4 hours. With no real impact to performance tests.
>>> 
>>> 
> Separating
>>> the underlying 6Gb/s Sata ports x 8 into effectively 16 x
> 3Gb/s Channels
>>> using a SAS Expander gives each Disc 300MB/s of
> bandwidth. They cant
>>> physically perform better than 200~220MB/s so
> there is enough bandwidth
>>> on the SATA and PCIE Bus to cope. LSI rate
> the 9280-8i @ 2500MB/s but it
>>> does work faster with Cache Cade. up to
> 3200MB/s in that test.
>>> 
>>> So a
>>> real SAN has many spindles of high
> performance, and the WD VRs are
>>> better than some cheap SSD
> drives.
>>> 
>>> RDMA is very fast, low cpu cycles,
>>> and reads and write
> directly to RAM@ 40Gb/s, we created a RAM drive and
>>> tested it up to
> 3200MB/s which is as fast as the PCIE 2 Bus / LSI could
>>> Handle.
>>> 
>>> 
> If you cant afford this realtivley cheap equipment (compated to
>>> IBM
> DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast
>>> 
> performance. Any real Cloud would have what we have built. None of our
>>> customers, ALL 300 Of them! have lost a single bit of data, all
> VMs are
>>> running very quickly, and with no problems now for 3
> months.
>>> 
>>> So If you
>>> dont appreciate being shown how to build a
> kick ASS SAN then PISS
>>> OFF!
>>> 
>>> On 24.10.2012 08:17, Jason Davis
> wrote:
>>>> How often do you have
>>> folks cache missing and hitting the
> spinning rust
>>>> below the cache cade
>>> tier?
>>>> On Oct 23, 2012
> 6:25 PM, wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> Hi
>>>>> 
>>>>> with our
>>> SCST
> RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
>>>>> Director 4036
>>> 
> switches + HP Gen2 QDR Chassis Switches.
>>>>> 
>>>>> Our San achieves
>>> 
>>> 
>>> 295,000 - 350,000 Iops max per LSI controller, depending on
> block
>>> size
>>>>> for i/o meter test. LSI9280-8i with battery cache.
> +
>>> cachecade2.0 x 2 in
>>>>> PCIE 16x slot on supermicro
> motherboard.
>>>>> 
>>>>> 1
>>> x 250GB SSD for CACHE Cade +
>>>>> 16 1tb
> WD VR drives (16 - hot stand buy
>>> - 2 for raid 6 parity) =
>>>>> 
> 2600MB/s
>>>>> 
>>>>> most of our test are from 4K
>>> to 128K block size,
> @ 4K we get
>>>>> 295,000 Iops @ 1590 MB/s and @ 128K
>>> 350,000 iops @
> 2400~2600MB/s
>>>>> 
>>>>> we
>>>>> have tuned 64K and 128K block
>>> size
> on different luns. 64K for database
>>>>> and 128K for general
>>> 
> file.
>>>>> 
>>>>> The best thing about Infiniband is low CPU
>>>>> 
> cycles. only
>>> 5~6% during these tests.
>>>>> 
>>>>> latency is as low
> as 4-6ms
>>>>> average
>>> read time. concurrent response times are from
> 5-12ms. even under
>>>>> 
>>> heavy load its below 20ms
>>>>> 
>>>>> 
> Infiniband latency is below 0.01 us
>>> Which is
>>>>> why we chose
> it.
>>>>> 
>>>>> we run MSSQL on the 64K formatted
>>> Luns, its
> massively
>>>>> fast.
>>>>> 
>>>>> If we copy the same data twice the
>>> 
> cache cade kicks in and we
>>>>> achieve even better speeds.
>>>>> 
>>>>> 
> Ive
>>> compared this to block io @ 8Gb/s Fiber
>>>>> channel, and it
> barely gets
>>> 120,000 IOPS. @ much higher latency and
>>>>> 
> bandwidth.
>>>>> 
>>>>> So for our
>>> money RDMA wins!
>>>>> 
>>>>> 
> Bruce
>>>>> 
>>>>> On 23.10.2012 21:20,
>>>>> Jason Davis
>>> wrote:
>>>>> 
>> Out of curiosity, is there any quick performance
>>>>> numbers
>>> for
> these ZFS +
>>>>>> GlusterFS mashups you guys are talking
>>>>> 
> about?
>>>>> 
>>>> 
>>>>>> Specifically, IOPs and latency? Sequential
> read/write
>>>>> 
>>> performance honestly
>>>>>> isn't a very good
> benchmark to determine
>>> your
>>>>> SANs performance. It's like
>>>>>> 
> comparing CPUs based solely on
>>> how many
>>>>> GHz it runs at. Sure
> you can get
>>>>>> great MB or GB/s with
>>> SATA disk but
>>>>> I'd
> reckon that IOP performance is
>>>>>> abismal. If you
>>> are
> utilizing
>>>>> GlusterFS without the cache pooling magic
>>>>>> that
> is
>>> ZFS then I would
>>>>> imagine that latency can be an issue.
>>>>> 
>> 
>>>>>> 
>>>>> 
>>>> 
>>>>>> On Tue, Oct 23, 2012 at
>>>>> 7:56 AM,
> Andreas Huser wrote:
>>>>>> 
>>>>> 
>>>>> Hi Fabrice,
>>>>>>> 
>>>>>>> i
> know
>>>>> OpenSolaris/Solaris Oracle it's so
>>> a thing.
>>>>>>> I'm
> for more then 10
>>>>> years a open source user and
>>> that
>>>>>>> 
> with oracle - i did no like at the
>>>>> beginning of this
>>> 
> constallation.
>>>>>>> But Oracle makes his work good i
>>>>> know
> that. The
>>> cost of one socket
>>>>>>> are 700$ and you can use so
> much
>>>>> quantity of
>>> TB as you will.
>>>>>>> And you can use the
> full premier Support
>>>>> from
>>> Oracle.
>>>>>>> Nexenta develop with
> the Illumos code. And the Licence
>>>>> 
>>> are TB based.
>>>>>>> That
> is not my favorite. As well the pool version
>>> from
>>>>> Nexenta
> comes
>>>>>>> not after. Current Nexenta Infiniband are
>>> not a
> usable
>>>>> solution.
>>>>>>> But every can use what he will.
> Everyone
>>> must decide for
>>>>> themselves.
>>>>>>> 
>>>>>>> SRP
> Targets or iser are not
>>> difficult to configure.
>>>>> Use the SRP
> for
>>>>>>> the Storage unit
>>> connection. Solaris and GlusterFS
>>>>> 
> builds one Storage unit.
>>>>>>> The
>>> GlusterFS Server export the
> final Volume
>>>>> to the Clients as well
>>> KVM,
>>>>>>> VMWare,
> Hyper-V etc.
>>>>>>> You can use
>>>>> native GlusterFS,
>>> RDMA, NFS
> ore CIFS to export the Volume.
>>>>>>> SRP have
>>>>> nothing to do
>>> 
> with VmWare.
>>>>>>> 
>>>>>>> When you are use a 7200 SAS drive the
>>> 
>>> access
>>> time are the same as a
>>>>>>> SATA drive only the quality
> of the
>>>>> 
>>> hardware are better. When you need
>>>>>>> Performance
> you must use SAS
>>>>> 
>>> drives with 15000U/m. But it's not needed
>>> 
>>>>> when you install SSD
>>> for
>>>>> ZIL/L2ARC. ZeusRAM rocks :-)
>>> 
>>>>> 
>>>>>>> I use dedup only at
>>> secondary stroage
>>>>> or on
> Backupserver not on Primary
>>>>>>> Storage.
>>>>> 
>>>>> When you use
> SSD SATA
>>>>> drives then you have an cheap an fast
>>> strorage.
>>>>> 
>>> 1TB drive cost unter
>>>>> 100$. Current i'm not need to
>>> save
> storage volume.
>>>>>>> 
>>>>>>> Which
>>>>> application use atime? I
> know
>>> "find -atime N". atime on a Storage
>>>>>>> 
>>>>> that stores
> only Virtual
>>> Disks?
>>>>>>> I don't need the information when i
>>> 
>>> have last opening
>>> the disk :-)
>>>>>>> 
>>>>>>> For a Solaris GUI
> take a look at
>>>>> napp-it
>>>>> 
>>>>> http://www.napp-it.org/
>>>>> 
>>> 
>>>>>>> greetings from germany
>>>>>>> 
>>>>> 
>>> Andreas
>>>>>>> 
>>> 
>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Ursprüngliche Mail -----
>>>>> 
>>> 
>>> 
>>>>>>> Von:
>>>>> "Fabrice Brazier">> An:
>>> 
> cloudstack-users@incubator.apache.org
>>>>>>> 
>>>>> Gesendet: Dienstag,
> 23.
>>> Oktober 2012 12:30:50
>>>>>>> Betreff: RE: Primary
>>>>> 
> Storage
>>>>>>> 
>>>>>>> 
>>> Hi Andreas,
>>>>>>> 
>>>>>>> Hum that's
> pretty cool, i know they have
>>>>> 
>>> still trouble with infiniband
> on
>>>>>>> nexenta but it's clearly a
>>> priority
>>>>> on their
> roadmap (plus i trust more the
>>>>>>> nexenta team
>>> to evolve zfs
>>> 
>>> than oracle).
>>>>>>> 
>>>>>>> I agree iscsi over infiniband
>>> 
> increase the latency
>>>>> but most of the time
>>>>>>> it's
>>>>>>> 
> just more
>>> simple to use IPOIB than IB,
>>>>> for example if you use
> vmware the
>>>>>>> 
>>> support of infiniband it's assured
>>>>> by
> mellanox and not vmware so if
>>> you
>>>>>>> have
>>>>>>> an issue the
> vmware
>>>>> support probably don't help
>>> you anymore.
>>>>>>> 
>>>>> 
>>> About the raid i'm a fan
>>>>> of raid 10, i
>>> prefer build a raid
> 10 with 7200 sas
>>>>>>> drive than
>>>>> multiple vdev
>>> raidz/raidz2
> 15k sas drive. Particularly for
>>>>>>> virtual
>>>>> 
>>> environment
> with a random workload and multiple access.
>>>>>>> 
>>>>>>> I'm
>>> a
> fan
>>>>> of nfs so agreed about the zil, and zeus ram are the only
> one
>>> for
>>>>>>> 
>>>>> that (with classical ssd you can hit the write
> hole).
>>>>> 
>>>>> 
>>>>>>> Agreed for
>>>>> compression too (but only
> lzjb level gzip use too
>>> much cpu),
>>>>>>> disable
>>>>> atime
> permit to decrease the iops load but
>>> i'm not sure is really a
>>>>> 
>>> 
>>>>> best practice. About the deduplication
>>> I don't totally
> agree. It
>>>>> really
>>>>>>> depends of your array, the
>>> workload
> and the VM type on your
>>>>> cloudstack.
>>>>>>> Actually when i
>>> 
> build a zfs array i count 1Gb ram for 1Tb
>>>>> disks. With
>>>>>>> 
>>> 
> deduplication i count 2Gb ram for 1 Tb disks (plus in
>>>>> nexenta 4
> they
>>> will
>>>>>>> add a possibility to use ssd for the
>>>>> 
> deduplication metada,
>>> like the l2arc
>>>>>>> for the read cache).
>>> 
>>>>> 
>>>>>>> The
>>>>> last point
>>> it's about your drives, why sata
> drives? I mean sata
>>>>> doesn't
>>>>>>> 
>>> respect their latency, and
> the reliability of a sas drive is
>>>>> 10x the
>>> sata.
>>>>>>> Plus
> now you can found many NL-sas drive at low
>>>>> cost
>>>>> 
>>>>> 
>>>>> 
>>> But it's really a nice architecture, i never try glusterfs
>>>>> 
>>> 
> (for the moment)
>>>>>>> plus in that case it's really a good way to
> have
>>> a
>>>>> replacement to a
>>>>>>> metro-cluster for free, i try
> one of is
>>> competitor
>>>>> (onefs) and clearly
>>>>>>> clustered
> filesystem are the
>>> futur.
>>>>>>> 
>>>>>>> 
>>>>> Cheers,
>>>>>>> 
> Fabrice
>>>>>>> 
>>>>>>> -----Message
>>> d'origine-----
>>>>>>> De :
> Andreas
>>>>> Huser
>>> [mailto:ahuser@7five-edv.de]
>>>>>>> Envoyé :
> mardi 23 octobre 2012
>>>>> 
>>> 11:37
>>>>>>> À :
> cloudstack-users@incubator.apache.org
>>>>>>> Objet :
>>> Re:
>>>>> 
> Primary Storage
>>>>>>> 
>>>>>>> Hi Fabrice,
>>>>>>> 
>>>>>>> I don't
> know
>>> what other people
>>>>> do but i have no problems with
> Infiniband +
>>>>>>> 
>>> GlusterFS + Cloudstack
>>>>> I'm not use
> Nexenta it's based on Illumos and
>>> work
>>>>>>> not fine with
>>>>> 
> Infiniband.
>>>>>>> I have two different
>>> clusters in productiv
>>>>> 
> enviroments.
>>>>>>> 
>>>>>>> The first: Solaris 11
>>> with built-in
> GlusterFS 3.3
>>>>> export Gluster Vol. with
>>>>>>> RDMA. >
>>> 
> performance is okey you can use that
>>>>> for smaller enviroments
> The
>>>>> 
>>>>> Second: Is a little bit complex with a GlusterFS
> Server in the
>>> middle
>>>>>>> 
>>>>>>> ZFS Server: based on Solaris
> 11
>>>>>>> 
>>>>> 1.) Create a
>>> zpool min. two vdevs and SSD
> read/write cache
>>>>>>> 2.) Create
>>>>> a thin
>>> provisioning Volume
> "zfs create -V" disable atime and
>>>>>>> enable
>>>>> 
>>> compression
> (do not enable dedup!) and export as (iWarp) SRP Target
>>>>> 
>>>>> 
>>> 
>>> to the GlusterFS Server use a direct connection without IB
>>> 
> Switch.
>>>>>>> 
>>>>>>> 
>>>>> GlusterFS Server:
>>>>>>> 1.) Use
> RedHat, CentOS or
>>> Fedora (I use CentOS 5
>>>>> and 6)
>>>>>>> 2.)
> Use OFED driver from
>>> https://www.openfabrics.org
>>>>>>> 3.)
>>>>> 
> Import the SRP Target from ZFS
>>> Server and format as xfs
>>>>>>> 4.)
> Create a
>>>>> Glustervolume "volume
>>> create xy transport rdma " (use
> only
>>>>>>> rdma)
>>>>>>> 
>>>>> 5.) Connect
>>> with the second IB
> Port to a IB Switch
>>>>>>> 
>>>>>>> Cloudstack
>>>>> 
>>> Hypervisor
> Node:
>>>>>>> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
>>> 
> and
>>>>> 6)
>>>>>>> 2.) Use OFED driver from
> https://www.openfabrics.org
>>>>> 
>>>>> 3.) Import
>>>>> the
> Glustervolume
>>>>>>> 
>>>>>>> ZFS Thin Volume ----
>>> Infinband SRP
> ---->
>>>>> GlusterFS ---- GFSVol rdma ---->
>>>>>>> IB Switch
>>> ---->
> Clients
>>>>>>> 
>>>>>>> The
>>>>> ZFS and GlusterFS Server form a
> Storage
>>> unit connect direct with
>>>>> 40Gbit
>>>>>>> Infiniband
> Point-to-Point You
>>> do not feel that is a cable
>>>>> between!
>>>>> 
>>> 
>>>>>>> Important: When you
>>> have Infiniband use not IPoIB with
>>> 
>>> iSCSI! If one
>>>>>>> already
>>>>>>> 
>>> has Infiniband then you
> should also use the
>>>>> advantage.
>>>>>>> IPoIB
>>> have a higher
> Latency as iWarp SRP!
>>>>>>> 
>>>>>>> 
>>>>>>> SRP =
>>>>> usec
>>> 
>>> 
>>>>> 
>>>>>>> -- SRP --
>>>>>>> local address: LID 0x01 QPN
> 0x44004b PSN
>>>>> 
>>> 0xf3265b RKey 0x9804237c VAddr
>>>>>>> 
> 0x00000001dda000 remote address:
>>> LID
>>>>> 0x0a QPN 0x10004a PSN
> 0x44072e RKey
>>>>>>> 0x1c0f115 VAddr
>>>>> 
>>> 0x000000088e6000
>>>>> 
>>> 
>>>>> 
>>> 
> ------------------------------------------------------------------
>>> 
>>> 
>>>>> 
>>>>> #bytes #iterations t_min[usec] t_max[usec]
> t_typical[usec]
>>>>>>> 2
>>> 1000
>>>>> 1.29 125.30 1.31
>>>>>>> 
>>> 
>>> 
>>> 
> ------------------------------------------------------------------
>>> 
>>> 
>>>>> 
>>>>>>> 
>>>>> -- IPoIB ---
>>>>>>> [root@sv01sfogaa ~]# ping
> 10.200.0.10
>>>>> 
>>>>> PING
>>>>> 10.200.0.10 (10.200.0.10) 56(84)
> bytes of data.
>>>>>>> 64
>>> bytes from
>>>>> 10.200.0.10: icmp_seq=1
> ttl=255 time=0.147 ms
>>>>>>> 64
>>> bytes from
>>>>> 10.200.0.10:
> icmp_seq=2 ttl=255 time=0.116 ms
>>>>>>> 
>>>>>>> 
>>> When you put load
> on
>>>>> IPoIB then increases the latency that is not
>>> good
>>>>> 
>>> 
>>>>>>> 
>>>>>>> That is my
>>>>> recommendation for a simple
>>> 
> GlusterFS Mirror:
>>>>>>> 
>>>>>>> - Supermicro Server
>>>>> with Intel
> Hardware
>>> and Expander Backplane
>>>>>>> - 1x Crucal-M4 SSD Read
>>> 
>>> Cache
>>>>>>> - 2x
>>> ZeusIOPs SSD Write cache (mirror)
>>>>>>> -
> SATA 24/7
>>>>> Harddrives
>>>>>>> 
>>> - LSI HBA 9207 or 9211
>>>>>>> -
> ConnectX-2 QDR Dualport
>>>>> Infiniband
>>> Adaper (HP refurbished with
> full
>>>>>>> warrany for 100$)
>>>>> Importent >
>>> Flash newest
> Firmware from Mellanox!
>>>>>>> - Mellanox IB
>>>>> Swich
>>>>>>> 
>>> -
> Solaris 11
>>>>>>> - GlusterFS 3.3 compiled with ib_verbs
>>>>>>> -
>>> 
>>> 
>>> Gluster Volume transport only rdma
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>> 
>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>>> Throughput
>>>>> constant up to 200
> Mbyte/s >> more throughput with
>>> more storage servers
>>>>> or more
> harddrives on jbod >> >> >>
>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>>>> Info:
>>> 
>>>>> 
>>>>>>> - I have some
>>>>> problems with infiniband
>>> RDMA or
> SRP with OpenIndiana or
>>>>>>> Illumos or
>>>>> Nexenta. Some
>>> 
> adapters have a high latency or not a stable
>>>>>>> 
>>>>> connection.
> Use
>>> Solaris that's the right way!
>>>>>>> - OpenIndiana is Beta!
>>> 
>>> Infiniband
>>> ib_verbs works not or not fine!
>>>>>>> - Use Solaris
> 11
>>>>> Infiniband
>>> ib_verbs are native and stable
>>>>>>> - Don't
> use Ubuntu Client
>>>>> Server
>>> for Infiniband! Use RedHat, Fedora
> or
>>>>>>> CentOS and install the
>>>>> 
>>> right drivers from
>>>>>>> 
> https://www.openfabrics.org/downloads/OFED/
>>>>> 
>>>>> -
>>>>> You have
> not SSD Cache? Disable at the ZFS Volume the sync!
>>>>> 
>>> 
> Important!
>>>>>>> You
>>>>>>> loose Security for your Data but
> some
>>> protocols use
>>>>> sync flags in
>>>>>>> transport.
>>>>>>> 
> Example NFS use by
>>> default fsync. write
>>>>> cache are not active.
> NFS writes
>>>>>>> data
>>> direct to the Harddrive. For
>>>>> Data
> Security and Performance give
>>> the
>>>>>>> Storage Server a SSD
> write
>>>>> cache. ZFS works at default with
>>> sync=standard
>>>>>>> 
> that prevent write
>>>>> holes. (COW System)
>>>>>>> 
>>>>> 
>>>>> I hope
> that I could help a little
>>>>>>> 
>>>>>>> 
>>>>> Greeting from
>>> 
> Germany
>>>>>>> Andreas
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----
> Ursprüngliche
>>> Mail
>>>>> -----
>>>>>>> 
>>>>>>> Von: "Fabrice
> Brazier"
>>>>>>> An:
>>>>> 
>>> cloudstack-users@incubator.apache.org
>>> 
>>>>> Gesendet: Dienstag, 23.
>>> Oktober
>>>>> 2012 09:55:15
>>>>>>> 
> Betreff: RE: Primary Storage
>>>>>>> 
>>>>>>> 
>>> Hi Andreas,
>>>>> 
>>> 
>>>>>>> 
>>>>> Hello i just see your configuration, it seems
>>> quite
> interesting.
>>>>>>> If i
>>>>> understand well you want to build
> some
>>> zfs array on the backend.
>>>>>>> 
>>>>> Export luns (probably
> by iscsi over
>>> infiniband) to you linux cluster,
>>>>> and
>>>>>>> 
> on
>>>>>>> the linux
>>> cluster you put glusterFS.
>>>>>>> I can
> understand
>>>>> the point, with
>>> that you can have very good
> performance
>>>>>>> and
>>>>>>> 
>>>>> reliability
>>> (zfs),
> scalability and redundancy (gluster) for very low
>>>>> cost.
>>>>> 
>>> 
>>> So just one question, did you try the global namespace
>>>>> 
> implementation
>>> from
>>>>>>> nexenta?
>>>>>>> If yes can you tell me
> what
>>>>> configuration
>>> is the best for you?
>>>>>>> I mean the
> fact you have a gluster
>>>>> cluster
>>> in the middle must impact
> the
>>>>>>> overral performance no?
>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
> Fabrice
>>>>>>> 
>>>>>>> -----Message d'origine-----
>>>>>>> De :
> Andreas
>>> Huser
>>>>> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23
> octobre
>>>>>>> 
>>> 2012 05:40 À :
>>>>> 
> cloudstack-users@incubator.apache.org
>>>>>>> Objet :
>>> Re: Primary
>>> 
>>> Storage
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> for Cloudstack i
> use
>>> Solaris 11 ZFS + GlusterFS
>>>>> over Infiniband (RDMA).
>>>>>>> 
> That
>>>>>>> 
>>> gives the best performance and most
>>>>> scalable
> Storage.
>>>>>>> I have
>>> tasted some different solutions for
> primary
>>>>> Storage but the most
>>> are
>>>>>>> to expensive and for
> a CloudStack Cluster
>>>>> not economic or
>>> have a poor
>>>>>>> 
> performance.
>>>>>>> 
>>>>>>> My Configuration:
>>>>>>> 
>>>>> 
>>> 
> Storage Node:
>>>>>>> Supermicro Server (Intel Hardware) with Solaris
> 11
>>> with
>>>>> SSD write and read
>>>>>>> cache (read crucial-m4,
> write ZeusIOPS)
>>> GlusterFS
>>>>> and dualport ConnectX
>>>>>>> 
> 40Gbit/s Infiniband adapter.
>>>>> 
>>>>> 
>>>>>>> I have
>>>>> installed
> GlusterFS direct on Solaris with a modified
>>> code.
>>>>>>> Want
> you
>>>>> build bigger systems for more then 50 VMs it is
>>> better you
> split
>>>>>>> the
>>>>> Solaris and GlusterFS with a separte
>>> 
> headnode for GlusterFS
>>>>>>> 
>>>>>>> That
>>>>> looks like:
>>>>>>> 
> Solaris ZFS
>>> Backendstorage with a dataset Volume (Thin
>>>>> 
> Provision) --> ( SRP
>>>>>>> 
>>> Target attached direct without
> Infiniband switch
>>>>> to GF Node) -->
>>> GlusterFS
>>>>>>> Node the
> srp target formatted with xfs
>>>>> filesystem,
>>> create a GlusterFS
>>> 
>>>>> Volume --> ( Infiniband over a
>>>>> Mellanox Port
>>> Switch) -->
> Cloudstack Node
>>>>>>> mount glusterFS Volume over
>>>>> RDMA
>>>>> 
>>> 
>>> 
>>>>>>> For the Dataset Volume at the ZFS Storage, disable atime
> and
>>>>> 
>>> enable
>>>>>>> compression.
>>>>>>> (Space reclaim) With
> compression you
>>> can
>>>>> shrink the ZFS Volume with command
>>>>>>> 
> at Linux dd /dev/zero or
>>> In a
>>>>> Windows VM with sdelete That
> gives you space
>>>>>>> left on the
>>> Primary
>>>>> Storage for
> deleted Files in a VM or for deleted vhd's
>>>>>>> 
>>> or vm's in the
>>> 
>>> cloudstack
>>>>>>> 
>>>>>>> greeting Andreas
>>>>>>> 
>>>>>>> 
>>> 
>>> 
>>>>> 
>>>>>>> 
>>>>>>> Mit freundlichen
>>>>> Grüßen
>>>>>>> 
>>>>> 
>>> Andreas Huser
>>>>> 
>>>>> Geschäftsführer
>>>>>>> System Engineer
> /
>>>>> Consultant
>>>>>>> (Cisco
>>> CSE, SMBAM, LCSE, ASAM)
>>>>> 
>>> 
>>>>> 
>>> ---------------------------------------
>>>>>>> 
> Zellerstraße 28 - 77654
>>>>> 
>>> Offenburg
>>>>>>> Tel: +49(781)
> 12786898
>>>>>>> Mobil: +49(176) 10308549
>>>>> 
>>>>> 
>>>>> 
> ahuser@7five-edv.de
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
> -----
>>> Ursprüngliche Mail -----
>>>>>>> 
>>>>>>> 
>>>>> Von: "Outback
> Dingo"
>>>>>>> An:
>>> cloudstack-users@incubator.apache.org
>>>>>>> 
>>> 
>>> Gesendet: Dienstag, 23.
>>> Oktober 2012 02:15:16
>>>>>>> Betreff:
> Re: Primary
>>>>> Storage
>>>>>>> 
>>>>>>> 
>>> On Mon, Oct 22, 2012 at
> 8:09 PM, Ivan
>>>>> Rodriguezwrote:
>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>>>> 
> Solaris 11 ZFS and yes we tried different
>>>>> setups, raids
>>> levels
> number >> of SSD cache, ARC zfs options etc etc
>>>>> etc. >> >>
>>> 
> Cheers >> >> >> VMWare ?? >>
>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>>>>> 
>>> 
>>>

RE: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

Posted by Fabrice Brazier <fa...@apalia.net>.

Hi,

@bruce, it's pretty fun because you bought 30 discs but only 2 LSI 9280-8i
or the 8i mean 8 internals ports (so with 2 lsi you can have only 16 disks).

After that a velociraptor disk it's only a 10k discs so we can count 120
IOPs by disks. If we put them in a raid 5 (by the way everybody know that
raid 5 don't scale very well after 12 disk the performance doesn't increase
so much but nevermind you built your raid 5 of 30 disks).
So you have a maximum of (30-1)*120=3480 iops for read and
((30-1)*120)/5=700 iops write (in a 100% random profile) And you have only
32Gb ram wich is peanuts to caching your read iops.

So the only way you have to make performance is your cahecade.
Now a cachecade is made of ssd (weird actually it's the same as zil and
l2arc in zfs xD) plus the max size of a cachecade and the maximum size of a
cachecade is 512 Gb by controller. You have 2 controller so 1Tb cachecade
max.

So you say your storage doesn't cost anything but you don't talk about the
size and the price of your cachecade.
And another thing the day you will have a real workload (not only 50 Vms...)
and this workload will be larger than 1Tb come again and explain us your
very large performance.

Me too I can built a zfs server with 4 zeusram in zil and 5 disks behind and
make a benchmark on files of 2Gb and say hey see I have 500 000 Iops and 4Gb
of throughput...

Regards

-----Message d'origine-----
De : bruce.m@v365.com.au [mailto:bruce.m@v365.com.au]
Envoyé : mercredi 24 octobre 2012 06:04
À : cloudstack-users@incubator.apache.org
Objet : Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT



These Sans we have build have saved us a fortune $ compared to FC Block I/O
Sans from IBM, HP etc

Ive been a contractor for IBM and HP
for 20 years. I know how much they are $$$$$ and what they can do.

A
Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if your
lucky you may get 100,000 perhaps 150,000 IOPS

What ive build is
better. cheaper, and frankly as i support Linux and SCST, much more
reliable.

the WD VR 1TB we get new for $310 so 30 x =$9,300 LSI 92808-8i x 2 $1800
($900 ea) The case $600 the mobo and 1 x Xeon 5650 $1000, 32GB ECCRam $800,
Dual PSU $650, about $14,000 for a much superior SAN.

As i explained another $15K for infiniband network. which does SRP and IP

As for your issues with cache misses, the CacheCade doesn't Miss much from
our tests. with WriteBack enabled, BBU enabled, and the SANs are on our
primary 22KVA UPS It never fails.

Also Its not
doing very much unless a user in the cloud is hitting sequential blocks or
R/W the same data repeatedly for some weird reason. Even then we see little
lagg or slow down, when as you say its hitting the Disc.
Infiniband with LSI Just rocks even more than we had thought!

The R/W
pattern from groups of 50 VM.s appear to be completely random, web servers,
virtual firewalls, email servers, DNS etc, FTP, windows domains, sharepoint,
MSSQL, MYSQL working on the same Lun which are not considered critical (as
stated by clients) are each VM getting over 300-1200MB/s constantly even
when other users are pulling or pushing a file copy @ 250Mb/s between VMs
etc. Weve actually had to figure out how to slow down the SRP connections.
If a user is provisioned with 2 Luns from each SAN, in windows you can copy
between VM Drives @ 1220MB/s!!!!!
faster than SSD in Linux its even faster, and that is in ESXi4.1 with
15~20 Servers per VM Host.

we have snmp monitoring the 2 primary sans, and they never go above 30%
bandwidth. so the point is the Cache has no real bearing on Cloud I/O. cloud
IO is fairly low. We have LSI supplying us with the new PICE3 cards for a
trial, so i expect even better performance. So long STORY SHORT. GET LSI AND
16 x WD VR's and try it yourself! And yes we compared IOPS on 5 x 500MB/s
SSD's its SLOWER! MORE SPINDLES = MORE SPEED!

the main reason we wanted fast IO is for backups, snapshots, so as not to
interfere with the daily operational workings of the cloud VM guests. Which
is doesn't. SOOO many TIMES with FC ive seen the attitude yeah that will
work, and some smart ass does a restore or backup in prod hours, only to see
the whole cluster slow down to a crawl. on iSCSI its even more pathetic.

With SRP on LSI " WE DONT
HAVE THAT PROBLEM!"

I did a snapshot restore for 2 machines yesterday for a client, it took 2
mins to restore 100GB. And none of the other servers were affected. SNMP
latency on all VM's no impact. Virtually 0CPU cycles on the VMware hosts for
the IO move. IO latency/storage path for other VM's from snmp data from
VMserver was 2ms which is normal. the machine doing the copy crept up to
10ms but no more.

So my conclusion
ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN ANYTHING!

Cache misses we don't have a problem there at all. Were more worried about
Data Rot, which is why were interested in GFS to ZFS Luns.
But ZFS is still slow as compared to what we have.....

we are going to
launch a new script which moves a file full of 16-bit random io for 100MB,
and 1TB, it will create a binary checksum for each file, and we can compare
it over time to check for data rot.

Just so you can
appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts.
the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we started
the 2nd, while the 1st was running, same results across the 2 VM's. then the
3rd concurrently. they all dropped to 2000MB/s and IOPS went down to 220,000
per host. so yes there is a penalty. But the way infiniband works is that
the switch balances the load across all servers. 4th server online same
test. and we get 150,000 IOPS on each server, all doing 1700~1800MB/s @ 128K
blocks 50%50% random seq. change the distribution to 100% random makes
little difference. We confirmed all the drives were being accessed! all
were. we even took a look at the camera out DC to monitor the array LEDS...
alll Flickers like mad! we took a look at the metrics of the Megaraid as we
got it to work on CentOS, no errors, no issues or problems, no dropped reads
or writes.

we set the file on IO meter as high as poss 900000000 units. We then fired
up 2 more hosts and did real world file copy, no real impact still get
2200MB/s+ read write easily. So we stopped IO Meter and did 6 concurrent
file copies between the 2 sans and each host blew us away at 2000MB/s read
and write across all 6 hosts.!! FXXK we were amazed

Aside
from that LSI have done a super job on the Cache modules on their cards.
I dont see how ZFS will ever approach this kind of performance.

Anyway
take from what ive said what you like. We have an awsum SAN... and its CHEAP
and Amazingly fast!

were in the process of developing a public Cloud, so when the web site is
finn we will publish all of this in a whitepaper there.

On 24.10.2012 10:57, Jason Davis wrote:
> Passionate
are you? Take it easy I am not knocking on what you are doing.
>
> The
term "spinning rust" is a catch all for hard drives, nothing personal
>
at all, in fact the setup you have is impressive.
>
> Impressive until
you have a cache miss. Cache misses are definitely an
> issue even in
"capital E" enterprise storage.
>
> Big bad XIV disk arrays or 3par are
impressive, fast arrays (crazy
> expensive ;]) when you are hitting DRAM
or SSD but if you are running cache
> unfriendly workloads then
ultimately your array is only as fast as the
> spinning disk sitting at
the lowest tier.
>
> I agree that openstorage is the way to go, from a
service providers
> perspective it's hard being competitive with AWS or
RackSpace if you are
> using expensive traditional SAN.
>
> So anyway,
my question wasn't a poke at what you are doing, just curious
> about
how big of an IOP/latency drop you see once a read or write has to
>
actually see the spinning disk. :)
> On Oct 23, 2012 8:10 PM, wrote:
>

>>
>>
>> oh well if your negative, and make rude comments then that's
your
>> problem, i don't care about you.
>>
>> If you don't believe that
RDMA can
>> perform this fast then again, your problem.
>>
>> Ask anyone
with a decent
>> 12+ Disc Raid Array, and RDMA and they will tell you it
will kick you in
>> YOUR ballz! Anyone in storage knows to get iops and
performance you need
>> 12 preferably 16 spindles.
>>
>> the spinning
rust as you call it, are the
>> newest 200MB/s WD Veloci Raprors with
64MB Cache. @ 10KRPM 2.5" dics in
>> 32 Bay Storage Chassis.
>>
>> we've
used them in the past and they are as
>> reliable as the SAS drives we
use, but faster!
>>
>> RAID 10 IS BULLSHIT!
>> Bloody slow and 50% goes
up in smoke for nothing. All you get is the
>> mirror speed of 2 drives
which is barley 400MB/s tops! I Spit on RAID 10
>>
>>
>> We also tried
Raid 60 and it was very good as well. But no point to
>> use it as we
replicated to 2nd SAN
>>
>> When we tested our new SANs, we
>> failed 1
drive and set the rebuild rate to 30% on raid 6, with 40% scan
>> rate.
The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we
>>
removed a 2nd disc, and replaced with blank, and it was rebuild in
>>
another 4 hours. With no real impact to performance tests.
>>
>>
Separating
>> the underlying 6Gb/s Sata ports x 8 into effectively 16 x
3Gb/s Channels
>> using a SAS Expander gives each Disc 300MB/s of
bandwidth. They cant
>> physically perform better than 200~220MB/s so
there is enough bandwidth
>> on the SATA and PCIE Bus to cope. LSI rate
the 9280-8i @ 2500MB/s but it
>> does work faster with Cache Cade. up to
3200MB/s in that test.
>>
>> So a
>> real SAN has many spindles of high
performance, and the WD VRs are
>> better than some cheap SSD
drives.
>>
>> RDMA is very fast, low cpu cycles,
>> and reads and write
directly to RAM@ 40Gb/s, we created a RAM drive and
>> tested it up to
3200MB/s which is as fast as the PCIE 2 Bus / LSI could
>> Handle.
>>
>>
If you cant afford this realtivley cheap equipment (compated to
>> IBM
DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast
>>
performance. Any real Cloud would have what we have built. None of our
>> customers, ALL 300 Of them! have lost a single bit of data, all
VMs are
>> running very quickly, and with no problems now for 3
months.
>>
>> So If you
>> dont appreciate being shown how to build a
kick ASS SAN then PISS
>> OFF!
>>
>> On 24.10.2012 08:17, Jason Davis
wrote:
>> > How often do you have
>> folks cache missing and hitting the
spinning rust
>> > below the cache cade
>> tier?
>> > On Oct 23, 2012
6:25 PM, wrote:
>> >
>> >>
>> >>
>> >> Hi
>> >>
>> >> with our
>> SCST
RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
>> >> Director 4036
>>
switches + HP Gen2 QDR Chassis Switches.
>> >>
>> >> Our San achieves
>>
>>
>> 295,000 - 350,000 Iops max per LSI controller, depending on
block
>> size
>> >> for i/o meter test. LSI9280-8i with battery cache.
+
>> cachecade2.0 x 2 in
>> >> PCIE 16x slot on supermicro
motherboard.
>> >>
>> >> 1
>> x 250GB SSD for CACHE Cade +
>> >> 16 1tb
WD VR drives (16 - hot stand buy
>> - 2 for raid 6 parity) =
>> >>
2600MB/s
>> >>
>> >> most of our test are from 4K
>> to 128K block size,
@ 4K we get
>> >> 295,000 Iops @ 1590 MB/s and @ 128K
>> 350,000 iops @
2400~2600MB/s
>> >>
>> >> we
>> >> have tuned 64K and 128K block
>> size
on different luns. 64K for database
>> >> and 128K for general
>>
file.
>> >>
>> >> The best thing about Infiniband is low CPU
>> >>
cycles. only
>> 5~6% during these tests.
>> >>
>> >> latency is as low
as 4-6ms
>> >> average
>> read time. concurrent response times are from
5-12ms. even under
>> >>
>> heavy load its below 20ms
>> >>
>> >>
Infiniband latency is below 0.01 us
>> Which is
>> >> why we chose
it.
>> >>
>> >> we run MSSQL on the 64K formatted
>> Luns, its
massively
>> >> fast.
>> >>
>> >> If we copy the same data twice the
>>
cache cade kicks in and we
>> >> achieve even better speeds.
>> >>
>> >>
Ive
>> compared this to block io @ 8Gb/s Fiber
>> >> channel, and it
barely gets
>> 120,000 IOPS. @ much higher latency and
>> >>
bandwidth.
>> >>
>> >> So for our
>> money RDMA wins!
>> >>
>> >>
Bruce
>> >>
>> >> On 23.10.2012 21:20,
>> >> Jason Davis
>> wrote:
>> >>
> Out of curiosity, is there any quick performance
>> >> numbers
>> for
these ZFS +
>> >> > GlusterFS mashups you guys are talking
>> >>
about?
>> >>
>> >
>> >> > Specifically, IOPs and latency? Sequential
read/write
>> >>
>> performance honestly
>> >> > isn't a very good
benchmark to determine
>> your
>> >> SANs performance. It's like
>> >> >
comparing CPUs based solely on
>> how many
>> >> GHz it runs at. Sure
you can get
>> >> > great MB or GB/s with
>> SATA disk but
>> >> I'd
reckon that IOP performance is
>> >> > abismal. If you
>> are
utilizing
>> >> GlusterFS without the cache pooling magic
>> >> > that
is
>> ZFS then I would
>> >> imagine that latency can be an issue.
>> >>
>
>> >> >
>> >>
>> >
>> >> > On Tue, Oct 23, 2012 at
>> >> 7:56 AM,
Andreas Huser wrote:
>> >> >
>> >>
>> >> Hi Fabrice,
>> >> >>
>> >> >> i
know
>> >> OpenSolaris/Solaris Oracle it's so
>> a thing.
>> >> >> I'm
for more then 10
>> >> years a open source user and
>> that
>> >> >>
with oracle - i did no like at the
>> >> beginning of this
>>
constallation.
>> >> >> But Oracle makes his work good i
>> >> know
that. The
>> cost of one socket
>> >> >> are 700$ and you can use so
much
>> >> quantity of
>> TB as you will.
>> >> >> And you can use the
full premier Support
>> >> from
>> Oracle.
>> >> >> Nexenta develop with
the Illumos code. And the Licence
>> >>
>> are TB based.
>> >> >> That
is not my favorite. As well the pool version
>> from
>> >> Nexenta
comes
>> >> >> not after. Current Nexenta Infiniband are
>> not a
usable
>> >> solution.
>> >> >> But every can use what he will.
Everyone
>> must decide for
>> >> themselves.
>> >> >>
>> >> >> SRP
Targets or iser are not
>> difficult to configure.
>> >> Use the SRP
for
>> >> >> the Storage unit
>> connection. Solaris and GlusterFS
>> >>
builds one Storage unit.
>> >> >> The
>> GlusterFS Server export the
final Volume
>> >> to the Clients as well
>> KVM,
>> >> >> VMWare,
Hyper-V etc.
>> >> >> You can use
>> >> native GlusterFS,
>> RDMA, NFS
ore CIFS to export the Volume.
>> >> >> SRP have
>> >> nothing to do
>>
with VmWare.
>> >> >>
>> >> >> When you are use a 7200 SAS drive the
>>
>> access
>> time are the same as a
>> >> >> SATA drive only the quality
of the
>> >>
>> hardware are better. When you need
>> >> >> Performance
you must use SAS
>> >>
>> drives with 15000U/m. But it's not needed
>>
>> >> when you install SSD
>> for
>> >> ZIL/L2ARC. ZeusRAM rocks :-)
>>
>> >>
>> >> >> I use dedup only at
>> secondary stroage
>> >> or on
Backupserver not on Primary
>> >> >> Storage.
>> >>
>> >> When you use
SSD SATA
>> >> drives then you have an cheap an fast
>> strorage.
>> >>
>> 1TB drive cost unter
>> >> 100$. Current i'm not need to
>> save
storage volume.
>> >> >>
>> >> >> Which
>> >> application use atime? I
know
>> "find -atime N". atime on a Storage
>> >> >>
>> >> that stores
only Virtual
>> Disks?
>> >> >> I don't need the information when i
>>
>> have last opening
>> the disk :-)
>> >> >>
>> >> >> For a Solaris GUI
take a look at
>> >> napp-it
>> >>
>> >> http://www.napp-it.org/
>> >>
>>
>> >> >> greetings from germany
>> >> >>
>> >>
>> Andreas
>> >> >>
>>
>> >>
>> >> >>
>> >> >>
>> >> >> ----- Ursprüngliche Mail -----
>> >>
>>
>>
>> >> >> Von:
>> >> "Fabrice Brazier">> An:
>>
cloudstack-users@incubator.apache.org
>> >> >>
>> >> Gesendet: Dienstag,
23.
>> Oktober 2012 12:30:50
>> >> >> Betreff: RE: Primary
>> >>
Storage
>> >> >>
>> >> >>
>> Hi Andreas,
>> >> >>
>> >> >> Hum that's
pretty cool, i know they have
>> >>
>> still trouble with infiniband
on
>> >> >> nexenta but it's clearly a
>> priority
>> >> on their
roadmap (plus i trust more the
>> >> >> nexenta team
>> to evolve zfs
>>
>> than oracle).
>> >> >>
>> >> >> I agree iscsi over infiniband
>>
increase the latency
>> >> but most of the time
>> >> >> it's
>> >> >>
just more
>> simple to use IPOIB than IB,
>> >> for example if you use
vmware the
>> >> >>
>> support of infiniband it's assured
>> >> by
mellanox and not vmware so if
>> you
>> >> >> have
>> >> >> an issue the
vmware
>> >> support probably don't help
>> you anymore.
>> >> >>
>> >>
>> About the raid i'm a fan
>> >> of raid 10, i
>> prefer build a raid
10 with 7200 sas
>> >> >> drive than
>> >> multiple vdev
>> raidz/raidz2
15k sas drive. Particularly for
>> >> >> virtual
>> >>
>> environment
with a random workload and multiple access.
>> >> >>
>> >> >> I'm
>> a
fan
>> >> of nfs so agreed about the zil, and zeus ram are the only
one
>> for
>> >> >>
>> >> that (with classical ssd you can hit the write
hole).
>> >>
>> >>
>> >> >> Agreed for
>> >> compression too (but only
lzjb level gzip use too
>> much cpu),
>> >> >> disable
>> >> atime
permit to decrease the iops load but
>> i'm not sure is really a
>> >>
>>
>> >> best practice. About the deduplication
>> I don't totally
agree. It
>> >> really
>> >> >> depends of your array, the
>> workload
and the VM type on your
>> >> cloudstack.
>> >> >> Actually when i
>>
build a zfs array i count 1Gb ram for 1Tb
>> >> disks. With
>> >> >>
>>
deduplication i count 2Gb ram for 1 Tb disks (plus in
>> >> nexenta 4
they
>> will
>> >> >> add a possibility to use ssd for the
>> >>
deduplication metada,
>> like the l2arc
>> >> >> for the read cache).
>>
>> >>
>> >> >> The
>> >> last point
>> it's about your drives, why sata
drives? I mean sata
>> >> doesn't
>> >> >>
>> respect their latency, and
the reliability of a sas drive is
>> >> 10x the
>> sata.
>> >> >> Plus
now you can found many NL-sas drive at low
>> >> cost
>> >>
>> >>
>> >>
>> But it's really a nice architecture, i never try glusterfs
>> >>
>>
(for the moment)
>> >> >> plus in that case it's really a good way to
have
>> a
>> >> replacement to a
>> >> >> metro-cluster for free, i try
one of is
>> competitor
>> >> (onefs) and clearly
>> >> >> clustered
filesystem are the
>> futur.
>> >> >>
>> >> >>
>> >> Cheers,
>> >> >>
Fabrice
>> >> >>
>> >> >> -----Message
>> d'origine-----
>> >> >> De :
Andreas
>> >> Huser
>> [mailto:ahuser@7five-edv.de]
>> >> >> Envoyé :
mardi 23 octobre 2012
>> >>
>> 11:37
>> >> >> À :
cloudstack-users@incubator.apache.org
>> >> >> Objet :
>> Re:
>> >>
Primary Storage
>> >> >>
>> >> >> Hi Fabrice,
>> >> >>
>> >> >> I don't
know
>> what other people
>> >> do but i have no problems with
Infiniband +
>> >> >>
>> GlusterFS + Cloudstack
>> >> I'm not use
Nexenta it's based on Illumos and
>> work
>> >> >> not fine with
>> >>
Infiniband.
>> >> >> I have two different
>> clusters in productiv
>> >>
enviroments.
>> >> >>
>> >> >> The first: Solaris 11
>> with built-in
GlusterFS 3.3
>> >> export Gluster Vol. with
>> >> >> RDMA. >
>>
performance is okey you can use that
>> >> for smaller enviroments
The
>> >>
>> >> Second: Is a little bit complex with a GlusterFS
Server in the
>> middle
>> >> >>
>> >> >> ZFS Server: based on Solaris
11
>> >> >>
>> >> 1.) Create a
>> zpool min. two vdevs and SSD
read/write cache
>> >> >> 2.) Create
>> >> a thin
>> provisioning Volume
"zfs create -V" disable atime and
>> >> >> enable
>> >>
>> compression
(do not enable dedup!) and export as (iWarp) SRP Target
>> >>
>> >>
>>
>> to the GlusterFS Server use a direct connection without IB
>>
Switch.
>> >> >>
>> >> >>
>> >> GlusterFS Server:
>> >> >> 1.) Use
RedHat, CentOS or
>> Fedora (I use CentOS 5
>> >> and 6)
>> >> >> 2.)
Use OFED driver from
>> https://www.openfabrics.org
>> >> >> 3.)
>> >>
Import the SRP Target from ZFS
>> Server and format as xfs
>> >> >> 4.)
Create a
>> >> Glustervolume "volume
>> create xy transport rdma " (use
only
>> >> >> rdma)
>> >> >>
>> >> 5.) Connect
>> with the second IB
Port to a IB Switch
>> >> >>
>> >> >> Cloudstack
>> >>
>> Hypervisor
Node:
>> >> >> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
>>
and
>> >> 6)
>> >> >> 2.) Use OFED driver from
https://www.openfabrics.org
>> >>
>> >> 3.) Import
>> >> the
Glustervolume
>> >> >>
>> >> >> ZFS Thin Volume ----
>> Infinband SRP
---->
>> >> GlusterFS ---- GFSVol rdma ---->
>> >> >> IB Switch
>> ---->
Clients
>> >> >>
>> >> >> The
>> >> ZFS and GlusterFS Server form a
Storage
>> unit connect direct with
>> >> 40Gbit
>> >> >> Infiniband
Point-to-Point You
>> do not feel that is a cable
>> >> between!
>> >>
>>
>> >> >> Important: When you
>> have Infiniband use not IPoIB with
>>
>> iSCSI! If one
>> >> >> already
>> >> >>
>> has Infiniband then you
should also use the
>> >> advantage.
>> >> >> IPoIB
>> have a higher
Latency as iWarp SRP!
>> >> >>
>> >> >>
>> >> >> SRP =
>> >> usec
>>
>>
>> >>
>> >> >> -- SRP --
>> >> >> local address: LID 0x01 QPN
0x44004b PSN
>> >>
>> 0xf3265b RKey 0x9804237c VAddr
>> >> >>
0x00000001dda000 remote address:
>> LID
>> >> 0x0a QPN 0x10004a PSN
0x44072e RKey
>> >> >> 0x1c0f115 VAddr
>> >>
>> 0x000000088e6000
>> >>
>>
>> >>
>>
------------------------------------------------------------------
>>
>>
>> >>
>> >> #bytes #iterations t_min[usec] t_max[usec]
t_typical[usec]
>> >> >> 2
>> 1000
>> >> 1.29 125.30 1.31
>> >> >>
>>
>>
>>
------------------------------------------------------------------
>>
>>
>> >>
>> >> >>
>> >> -- IPoIB ---
>> >> >> [root@sv01sfogaa ~]# ping
10.200.0.10
>> >>
>> >> PING
>> >> 10.200.0.10 (10.200.0.10) 56(84)
bytes of data.
>> >> >> 64
>> bytes from
>> >> 10.200.0.10: icmp_seq=1
ttl=255 time=0.147 ms
>> >> >> 64
>> bytes from
>> >> 10.200.0.10:
icmp_seq=2 ttl=255 time=0.116 ms
>> >> >>
>> >> >>
>> When you put load
on
>> >> IPoIB then increases the latency that is not
>> good
>> >>
>>
>> >> >>
>> >> >> That is my
>> >> recommendation for a simple
>>
GlusterFS Mirror:
>> >> >>
>> >> >> - Supermicro Server
>> >> with Intel
Hardware
>> and Expander Backplane
>> >> >> - 1x Crucal-M4 SSD Read
>>
>> Cache
>> >> >> - 2x
>> ZeusIOPs SSD Write cache (mirror)
>> >> >> -
SATA 24/7
>> >> Harddrives
>> >> >>
>> - LSI HBA 9207 or 9211
>> >> >> -
ConnectX-2 QDR Dualport
>> >> Infiniband
>> Adaper (HP refurbished with
full
>> >> >> warrany for 100$)
>> >> Importent >
>> Flash newest
Firmware from Mellanox!
>> >> >> - Mellanox IB
>> >> Swich
>> >> >>
>> -
Solaris 11
>> >> >> - GlusterFS 3.3 compiled with ib_verbs
>> >> >> -
>>
>>
>> Gluster Volume transport only rdma
>> >> >>
>> >> >>
>> >> >>
>>
>> >>
>> >> >>
>> >>
>> >>
>> >>> Throughput
>> >> constant up to 200
Mbyte/s >> more throughput with
>> more storage servers
>> >> or more
harddrives on jbod >> >> >>
>> >>
>> >> >>
>> >>
>> >>
>> >> >> Info:
>>
>> >>
>> >> >> - I have some
>> >> problems with infiniband
>> RDMA or
SRP with OpenIndiana or
>> >> >> Illumos or
>> >> Nexenta. Some
>>
adapters have a high latency or not a stable
>> >> >>
>> >> connection.
Use
>> Solaris that's the right way!
>> >> >> - OpenIndiana is Beta!
>>
>> Infiniband
>> ib_verbs works not or not fine!
>> >> >> - Use Solaris
11
>> >> Infiniband
>> ib_verbs are native and stable
>> >> >> - Don't
use Ubuntu Client
>> >> Server
>> for Infiniband! Use RedHat, Fedora
or
>> >> >> CentOS and install the
>> >>
>> right drivers from
>> >> >>
https://www.openfabrics.org/downloads/OFED/
>> >>
>> >> -
>> >> You have
not SSD Cache? Disable at the ZFS Volume the sync!
>> >>
>>
Important!
>> >> >> You
>> >> >> loose Security for your Data but
some
>> protocols use
>> >> sync flags in
>> >> >> transport.
>> >> >>
Example NFS use by
>> default fsync. write
>> >> cache are not active.
NFS writes
>> >> >> data
>> direct to the Harddrive. For
>> >> Data
Security and Performance give
>> the
>> >> >> Storage Server a SSD
write
>> >> cache. ZFS works at default with
>> sync=standard
>> >> >>
that prevent write
>> >> holes. (COW System)
>> >> >>
>> >>
>> >> I hope
that I could help a little
>> >> >>
>> >> >>
>> >> Greeting from
>>
Germany
>> >> >> Andreas
>> >> >>
>> >> >>
>> >> >>
>> >> >> -----
Ursprüngliche
>> Mail
>> >> -----
>> >> >>
>> >> >> Von: "Fabrice
Brazier"
>> >> >> An:
>> >>
>> cloudstack-users@incubator.apache.org
>>
>> >> Gesendet: Dienstag, 23.
>> Oktober
>> >> 2012 09:55:15
>> >> >>
Betreff: RE: Primary Storage
>> >> >>
>> >> >>
>> Hi Andreas,
>> >>
>>
>> >> >>
>> >> Hello i just see your configuration, it seems
>> quite
interesting.
>> >> >> If i
>> >> understand well you want to build
some
>> zfs array on the backend.
>> >> >>
>> >> Export luns (probably
by iscsi over
>> infiniband) to you linux cluster,
>> >> and
>> >> >>
on
>> >> >> the linux
>> cluster you put glusterFS.
>> >> >> I can
understand
>> >> the point, with
>> that you can have very good
performance
>> >> >> and
>> >> >>
>> >> reliability
>> (zfs),
scalability and redundancy (gluster) for very low
>> >> cost.
>> >>
>>
>> So just one question, did you try the global namespace
>> >>
implementation
>> from
>> >> >> nexenta?
>> >> >> If yes can you tell me
what
>> >> configuration
>> is the best for you?
>> >> >> I mean the
fact you have a gluster
>> >> cluster
>> in the middle must impact
the
>> >> >> overral performance no?
>> >> >>
>> >>
>> >>
>> >>
Fabrice
>> >> >>
>> >> >> -----Message d'origine-----
>> >> >> De :
Andreas
>> Huser
>> >> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23
octobre
>> >> >>
>> 2012 05:40 À :
>> >>
cloudstack-users@incubator.apache.org
>> >> >> Objet :
>> Re: Primary
>>
>> Storage
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> for Cloudstack i
use
>> Solaris 11 ZFS + GlusterFS
>> >> over Infiniband (RDMA).
>> >> >>
That
>> >> >>
>> gives the best performance and most
>> >> scalable
Storage.
>> >> >> I have
>> tasted some different solutions for
primary
>> >> Storage but the most
>> are
>> >> >> to expensive and for
a CloudStack Cluster
>> >> not economic or
>> have a poor
>> >> >>
performance.
>> >> >>
>> >> >> My Configuration:
>> >> >>
>> >>
>>
Storage Node:
>> >> >> Supermicro Server (Intel Hardware) with Solaris
11
>> with
>> >> SSD write and read
>> >> >> cache (read crucial-m4,
write ZeusIOPS)
>> GlusterFS
>> >> and dualport ConnectX
>> >> >>
40Gbit/s Infiniband adapter.
>> >>
>> >>
>> >> >> I have
>> >> installed
GlusterFS direct on Solaris with a modified
>> code.
>> >> >> Want
you
>> >> build bigger systems for more then 50 VMs it is
>> better you
split
>> >> >> the
>> >> Solaris and GlusterFS with a separte
>>
headnode for GlusterFS
>> >> >>
>> >> >> That
>> >> looks like:
>> >> >>
Solaris ZFS
>> Backendstorage with a dataset Volume (Thin
>> >>
Provision) --> ( SRP
>> >> >>
>> Target attached direct without
Infiniband switch
>> >> to GF Node) -->
>> GlusterFS
>> >> >> Node the
srp target formatted with xfs
>> >> filesystem,
>> create a GlusterFS
>>
>> >> Volume --> ( Infiniband over a
>> >> Mellanox Port
>> Switch) -->
Cloudstack Node
>> >> >> mount glusterFS Volume over
>> >> RDMA
>> >>
>>
>>
>> >> >> For the Dataset Volume at the ZFS Storage, disable atime
and
>> >>
>> enable
>> >> >> compression.
>> >> >> (Space reclaim) With
compression you
>> can
>> >> shrink the ZFS Volume with command
>> >> >>
at Linux dd /dev/zero or
>> In a
>> >> Windows VM with sdelete That
gives you space
>> >> >> left on the
>> Primary
>> >> Storage for
deleted Files in a VM or for deleted vhd's
>> >> >>
>> or vm's in the
>>
>> cloudstack
>> >> >>
>> >> >> greeting Andreas
>> >> >>
>> >> >>
>>
>>
>> >>
>> >> >>
>> >> >> Mit freundlichen
>> >> Grüßen
>> >> >>
>> >>
>> Andreas Huser
>> >>
>> >> Geschäftsführer
>> >> >> System Engineer
/
>> >> Consultant
>> >> >> (Cisco
>> CSE, SMBAM, LCSE, ASAM)
>> >>
>>
>> >>
>> ---------------------------------------
>> >> >>
Zellerstraße 28 - 77654
>> >>
>> Offenburg
>> >> >> Tel: +49(781)
12786898
>> >> >> Mobil: +49(176) 10308549
>> >>
>> >>
>> >>
ahuser@7five-edv.de
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
-----
>> Ursprüngliche Mail -----
>> >> >>
>> >> >>
>> >> Von: "Outback
Dingo"
>> >> >> An:
>> cloudstack-users@incubator.apache.org
>> >> >>
>>
>> Gesendet: Dienstag, 23.
>> Oktober 2012 02:15:16
>> >> >> Betreff:
Re: Primary
>> >> Storage
>> >> >>
>> >> >>
>> On Mon, Oct 22, 2012 at
8:09 PM, Ivan
>> >> Rodriguezwrote:
>> >> >>
>> >>
>> >>
>> >>
>> >> >>>
Solaris 11 ZFS and yes we tried different
>> >> setups, raids
>> levels
number >> of SSD cache, ARC zfs options etc etc
>> >> etc. >> >>
>>
Cheers >> >> >> VMWare ?? >>
>> >>
>> >> >>
>> >> >>
>> >> >>
>> >>
>>
>> >>
>>
>>

Re: Primary Storage - DATA-ROT IS MORE TO WORRY ABOUT

Posted by br...@v365.com.au.

  

These Sans we have build have saved us a fortune $ compared to FC
Block I/O Sans from IBM, HP etc

Ive been a contractor for IBM and HP
for 20 years. I know how much they are $$$$$ and what they can do.

A
Decent DS4000 you'll be looking at 80K, or Storageworks $90K. And if
your lucky you may get 100,000 perhaps 150,000 IOPS

What ive build is
better. cheaper, and frankly as i support Linux and SCST, much more
reliable. 

the WD VR 1TB we get new for $310 so 30 x =$9,300
LSI
92808-8i x 2 $1800 ($900 ea)
The case $600 the mobo and 1 x Xeon 5650
$1000, 32GB ECCRam $800, Dual PSU $650, about $14,000 for a much
superior SAN.

As i explained another $15K for infiniband network. which
does SRP and IP

As for your issues with cache misses, the CacheCade
doesn't Miss much from our tests. with WriteBack enabled, BBU enabled,
and the SANs are on our primary 22KVA UPS It never fails.

Also Its not
doing very much unless a user in the cloud is hitting sequential blocks
or R/W the same data repeatedly for some weird reason. Even then we see
little lagg or slow down, when as you say its hitting the Disc.
Infiniband with LSI Just rocks even more than we had thought!

The R/W
pattern from groups of 50 VM.s appear to be completely random, web
servers, virtual firewalls, email servers, DNS etc, FTP, windows
domains, sharepoint, MSSQL, MYSQL working on the same Lun which are not
considered critical (as stated by clients) are each VM getting over
300-1200MB/s constantly even when other users are pulling or pushing a
file copy @ 250Mb/s between VMs etc. Weve actually had to figure out how
to slow down the SRP connections. If a user is provisioned with 2 Luns
from each SAN, in windows you can copy between VM Drives @ 1220MB/s!!!!!
faster than SSD in Linux its even faster, and that is in ESXi4.1 with
15~20 Servers per VM Host.

we have snmp monitoring the 2 primary sans,
and they never go above 30% bandwidth. so the point is the Cache has no
real bearing on Cloud I/O. cloud IO is fairly low. We have LSI supplying
us with the new PICE3 cards for a trial, so i expect even better
performance. So long STORY SHORT. GET LSI AND 16 x WD VR's and try it
yourself! And yes we compared IOPS on 5 x 500MB/s SSD's its SLOWER! MORE
SPINDLES = MORE SPEED!

the main reason we wanted fast IO is for
backups, snapshots, so as not to interfere with the daily operational
workings of the cloud VM guests. Which is doesn't. SOOO many TIMES with
FC ive seen the attitude yeah that will work, and some smart ass does a
restore or backup in prod hours, only to see the whole cluster slow down
to a crawl. on iSCSI its even more pathetic.

With SRP on LSI " WE DONT
HAVE THAT PROBLEM!"

I did a snapshot restore for 2 machines yesterday
for a client, it took 2 mins to restore 100GB. And none of the other
servers were affected. SNMP latency on all VM's no impact. Virtually
0CPU cycles on the VMware hosts for the IO move. IO latency/storage path
for other VM's from snmp data from VMserver was 2ms which is normal. the
machine doing the copy crept up to 10ms but no more.

So my conclusion
ON REAL WORLD OBSERVATIONS IS.... INFINIBAND SRP IS BETTER THAN
ANYTHING!

Cache misses we don't have a problem there at all. Were more
worried about Data Rot, which is why were interested in GFS to ZFS Luns.
But ZFS is still slow as compared to what we have.....

we are going to
launch a new script which moves a file full of 16-bit random io for
100MB, and 1TB, it will create a binary checksum for each file, and we
can compare it over time to check for data rot.

Just so you can
appreciate the speed. We Ran IO meter from 3 VMs on 3 separate VM Hosts.
the 1st get max i/o ie up to 400,000 IOPS and pushing 2400MB/s we
started the 2nd, while the 1st was running, same results across the 2
VM's. then the 3rd concurrently. they all dropped to 2000MB/s and IOPS
went down to 220,000 per host. so yes there is a penalty. But the way
infiniband works is that the switch balances the load across all
servers. 4th server online same test. and we get 150,000 IOPS on each
server, all doing 1700~1800MB/s @ 128K blocks 50%50% random seq. change
the distribution to 100% random makes little difference. We confirmed
all the drives were being accessed! all were. we even took a look at the
camera out DC to monitor the array LEDS... alll Flickers like mad! we
took a look at the metrics of the Megaraid as we got it to work on
CentOS, no errors, no issues or problems, no dropped reads or
writes.

we set the file on IO meter as high as poss 900000000 units. We
then fired up 2 more hosts and did real world file copy, no real impact
still get 2200MB/s+ read write easily. So we stopped IO Meter and did 6
concurrent file copies between the 2 sans and each host blew us away at
2000MB/s read and write across all 6 hosts.!! FXXK we were amazed

Aside
from that LSI have done a super job on the Cache modules on their cards.
I dont see how ZFS will ever approach this kind of performance.

Anyway
take from what ive said what you like. We have an awsum SAN... and its
CHEAP and Amazingly fast!

were in the process of developing a public
Cloud, so when the web site is finn we will publish all of this in a
whitepaper there.

On 24.10.2012 10:57, Jason Davis wrote:
> Passionate
are you? Take it easy I am not knocking on what you are doing.
> 
> The
term "spinning rust" is a catch all for hard drives, nothing personal
>
at all, in fact the setup you have is impressive.
> 
> Impressive until
you have a cache miss. Cache misses are definitely an
> issue even in
"capital E" enterprise storage.
> 
> Big bad XIV disk arrays or 3par are
impressive, fast arrays (crazy
> expensive ;]) when you are hitting DRAM
or SSD but if you are running cache
> unfriendly workloads then
ultimately your array is only as fast as the
> spinning disk sitting at
the lowest tier.
> 
> I agree that openstorage is the way to go, from a
service providers
> perspective it's hard being competitive with AWS or
RackSpace if you are
> using expensive traditional SAN.
> 
> So anyway,
my question wasn't a poke at what you are doing, just curious
> about
how big of an IOP/latency drop you see once a read or write has to
>
actually see the spinning disk. :)
> On Oct 23, 2012 8:10 PM, wrote:
>

>>
>>
>> oh well if your negative, and make rude comments then that's
your
>> problem, i don't care about you.
>>
>> If you don't believe that
RDMA can
>> perform this fast then again, your problem.
>>
>> Ask anyone
with a decent
>> 12+ Disc Raid Array, and RDMA and they will tell you it
will kick you in
>> YOUR ballz! Anyone in storage knows to get iops and
performance you need
>> 12 preferably 16 spindles.
>>
>> the spinning
rust as you call it, are the
>> newest 200MB/s WD Veloci Raprors with
64MB Cache. @ 10KRPM 2.5" dics in
>> 32 Bay Storage Chassis.
>>
>> we've
used them in the past and they are as
>> reliable as the SAS drives we
use, but faster!
>>
>> RAID 10 IS BULLSHIT!
>> Bloody slow and 50% goes
up in smoke for nothing. All you get is the
>> mirror speed of 2 drives
which is barley 400MB/s tops! I Spit on RAID 10
>>
>>
>> We also tried
Raid 60 and it was very good as well. But no point to
>> use it as we
replicated to 2nd SAN
>>
>> When we tested our new SANs, we
>> failed 1
drive and set the rebuild rate to 30% on raid 6, with 40% scan
>> rate.
The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we
>>
removed a 2nd disc, and replaced with blank, and it was rebuild in
>>
another 4 hours. With no real impact to performance tests.
>>
>>
Separating
>> the underlying 6Gb/s Sata ports x 8 into effectively 16 x
3Gb/s Channels
>> using a SAS Expander gives each Disc 300MB/s of
bandwidth. They cant
>> physically perform better than 200~220MB/s so
there is enough bandwidth
>> on the SATA and PCIE Bus to cope. LSI rate
the 9280-8i @ 2500MB/s but it
>> does work faster with Cache Cade. up to
3200MB/s in that test.
>>
>> So a
>> real SAN has many spindles of high
performance, and the WD VRs are
>> better than some cheap SSD
drives.
>>
>> RDMA is very fast, low cpu cycles,
>> and reads and write
directly to RAM@ 40Gb/s, we created a RAM drive and
>> tested it up to
3200MB/s which is as fast as the PCIE 2 Bus / LSI could
>> Handle.
>>
>>
If you cant afford this realtivley cheap equipment (compated to
>> IBM
DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast
>>
performance. Any real Cloud would have what we have built. None of
our
>> customers, ALL 300 Of them! have lost a single bit of data, all
VMs are
>> running very quickly, and with no problems now for 3
months.
>>
>> So If you
>> dont appreciate being shown how to build a
kick ASS SAN then PISS
>> OFF!
>>
>> On 24.10.2012 08:17, Jason Davis
wrote:
>> > How often do you have
>> folks cache missing and hitting the
spinning rust
>> > below the cache cade
>> tier?
>> > On Oct 23, 2012
6:25 PM, wrote:
>> >
>> >>
>> >>
>> >> Hi
>> >>
>> >> with our
>> SCST
RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
>> >> Director 4036
>>
switches + HP Gen2 QDR Chassis Switches.
>> >>
>> >> Our San achieves
>>
>>
>> 295,000 - 350,000 Iops max per LSI controller, depending on
block
>> size
>> >> for i/o meter test. LSI9280-8i with battery cache.
+
>> cachecade2.0 x 2 in
>> >> PCIE 16x slot on supermicro
motherboard.
>> >>
>> >> 1
>> x 250GB SSD for CACHE Cade +
>> >> 16 1tb
WD VR drives (16 - hot stand buy
>> - 2 for raid 6 parity) =
>> >>
2600MB/s
>> >>
>> >> most of our test are from 4K
>> to 128K block size,
@ 4K we get
>> >> 295,000 Iops @ 1590 MB/s and @ 128K
>> 350,000 iops @
2400~2600MB/s
>> >>
>> >> we
>> >> have tuned 64K and 128K block
>> size
on different luns. 64K for database
>> >> and 128K for general
>>
file.
>> >>
>> >> The best thing about Infiniband is low CPU
>> >>
cycles. only
>> 5~6% during these tests.
>> >>
>> >> latency is as low
as 4-6ms
>> >> average
>> read time. concurrent response times are from
5-12ms. even under
>> >>
>> heavy load its below 20ms
>> >>
>> >>
Infiniband latency is below 0.01 us
>> Which is
>> >> why we chose
it.
>> >>
>> >> we run MSSQL on the 64K formatted
>> Luns, its
massively
>> >> fast.
>> >>
>> >> If we copy the same data twice the
>>
cache cade kicks in and we
>> >> achieve even better speeds.
>> >>
>> >>
Ive
>> compared this to block io @ 8Gb/s Fiber
>> >> channel, and it
barely gets
>> 120,000 IOPS. @ much higher latency and
>> >>
bandwidth.
>> >>
>> >> So for our
>> money RDMA wins!
>> >>
>> >>
Bruce
>> >>
>> >> On 23.10.2012 21:20,
>> >> Jason Davis
>> wrote:
>> >>
> Out of curiosity, is there any quick performance
>> >> numbers
>> for
these ZFS +
>> >> > GlusterFS mashups you guys are talking
>> >>
about?
>> >>
>> >
>> >> > Specifically, IOPs and latency? Sequential
read/write
>> >>
>> performance honestly
>> >> > isn't a very good
benchmark to determine
>> your
>> >> SANs performance. It's like
>> >> >
comparing CPUs based solely on
>> how many
>> >> GHz it runs at. Sure
you can get
>> >> > great MB or GB/s with
>> SATA disk but
>> >> I'd
reckon that IOP performance is
>> >> > abismal. If you
>> are
utilizing
>> >> GlusterFS without the cache pooling magic
>> >> > that
is
>> ZFS then I would
>> >> imagine that latency can be an issue.
>> >>
>
>> >> >
>> >>
>> >
>> >> > On Tue, Oct 23, 2012 at
>> >> 7:56 AM,
Andreas Huser wrote:
>> >> >
>> >>
>> >> Hi Fabrice,
>> >> >>
>> >> >> i
know
>> >> OpenSolaris/Solaris Oracle it's so
>> a thing.
>> >> >> I'm
for more then 10
>> >> years a open source user and
>> that
>> >> >>
with oracle - i did no like at the
>> >> beginning of this
>>
constallation.
>> >> >> But Oracle makes his work good i
>> >> know
that. The
>> cost of one socket
>> >> >> are 700$ and you can use so
much
>> >> quantity of
>> TB as you will.
>> >> >> And you can use the
full premier Support
>> >> from
>> Oracle.
>> >> >> Nexenta develop with
the Illumos code. And the Licence
>> >>
>> are TB based.
>> >> >> That
is not my favorite. As well the pool version
>> from
>> >> Nexenta
comes
>> >> >> not after. Current Nexenta Infiniband are
>> not a
usable
>> >> solution.
>> >> >> But every can use what he will.
Everyone
>> must decide for
>> >> themselves.
>> >> >>
>> >> >> SRP
Targets or iser are not
>> difficult to configure.
>> >> Use the SRP
for
>> >> >> the Storage unit
>> connection. Solaris and GlusterFS
>> >>
builds one Storage unit.
>> >> >> The
>> GlusterFS Server export the
final Volume
>> >> to the Clients as well
>> KVM,
>> >> >> VMWare,
Hyper-V etc.
>> >> >> You can use
>> >> native GlusterFS,
>> RDMA, NFS
ore CIFS to export the Volume.
>> >> >> SRP have
>> >> nothing to do
>>
with VmWare.
>> >> >>
>> >> >> When you are use a 7200 SAS drive the
>>
>> access
>> time are the same as a
>> >> >> SATA drive only the quality
of the
>> >>
>> hardware are better. When you need
>> >> >> Performance
you must use SAS
>> >>
>> drives with 15000U/m. But it's not needed
>>
>> >> when you install SSD
>> for
>> >> ZIL/L2ARC. ZeusRAM rocks :-)
>>
>> >>
>> >> >> I use dedup only at
>> secondary stroage
>> >> or on
Backupserver not on Primary
>> >> >> Storage.
>> >>
>> >> When you use
SSD SATA
>> >> drives then you have an cheap an fast
>> strorage.
>> >>
>> 1TB drive cost unter
>> >> 100$. Current i'm not need to
>> save
storage volume.
>> >> >>
>> >> >> Which
>> >> application use atime? I
know
>> "find -atime N". atime on a Storage
>> >> >>
>> >> that stores
only Virtual
>> Disks?
>> >> >> I don't need the information when i
>>
>> have last opening
>> the disk :-)
>> >> >>
>> >> >> For a Solaris GUI
take a look at
>> >> napp-it
>> >>
>> >> http://www.napp-it.org/
>> >>
>>
>> >> >> greetings from germany
>> >> >>
>> >>
>> Andreas
>> >> >>
>>
>> >>
>> >> >>
>> >> >>
>> >> >> ----- Ursprüngliche Mail -----
>> >>
>>
>>
>> >> >> Von:
>> >> "Fabrice Brazier">> An:
>>
cloudstack-users@incubator.apache.org
>> >> >>
>> >> Gesendet: Dienstag,
23.
>> Oktober 2012 12:30:50
>> >> >> Betreff: RE: Primary
>> >>
Storage
>> >> >>
>> >> >>
>> Hi Andreas,
>> >> >>
>> >> >> Hum that's
pretty cool, i know they have
>> >>
>> still trouble with infiniband
on
>> >> >> nexenta but it's clearly a
>> priority
>> >> on their
roadmap (plus i trust more the
>> >> >> nexenta team
>> to evolve zfs
>>
>> than oracle).
>> >> >>
>> >> >> I agree iscsi over infiniband
>>
increase the latency
>> >> but most of the time
>> >> >> it's
>> >> >>
just more
>> simple to use IPOIB than IB,
>> >> for example if you use
vmware the
>> >> >>
>> support of infiniband it's assured
>> >> by
mellanox and not vmware so if
>> you
>> >> >> have
>> >> >> an issue the
vmware
>> >> support probably don't help
>> you anymore.
>> >> >>
>> >>
>> About the raid i'm a fan
>> >> of raid 10, i
>> prefer build a raid
10 with 7200 sas
>> >> >> drive than
>> >> multiple vdev
>> raidz/raidz2
15k sas drive. Particularly for
>> >> >> virtual
>> >>
>> environment
with a random workload and multiple access.
>> >> >>
>> >> >> I'm
>> a
fan
>> >> of nfs so agreed about the zil, and zeus ram are the only
one
>> for
>> >> >>
>> >> that (with classical ssd you can hit the write
hole).
>> >>
>> >>
>> >> >> Agreed for
>> >> compression too (but only
lzjb level gzip use too
>> much cpu),
>> >> >> disable
>> >> atime
permit to decrease the iops load but
>> i'm not sure is really a
>> >>
>>
>> >> best practice. About the deduplication
>> I don't totally
agree. It
>> >> really
>> >> >> depends of your array, the
>> workload
and the VM type on your
>> >> cloudstack.
>> >> >> Actually when i
>>
build a zfs array i count 1Gb ram for 1Tb
>> >> disks. With
>> >> >>
>>
deduplication i count 2Gb ram for 1 Tb disks (plus in
>> >> nexenta 4
they
>> will
>> >> >> add a possibility to use ssd for the
>> >>
deduplication metada,
>> like the l2arc
>> >> >> for the read cache).
>>
>> >>
>> >> >> The
>> >> last point
>> it's about your drives, why sata
drives? I mean sata
>> >> doesn't
>> >> >>
>> respect their latency, and
the reliability of a sas drive is
>> >> 10x the
>> sata.
>> >> >> Plus
now you can found many NL-sas drive at low
>> >> cost
>> >>
>> >>
>> >>
>> But it's really a nice architecture, i never try glusterfs
>> >>
>>
(for the moment)
>> >> >> plus in that case it's really a good way to
have
>> a
>> >> replacement to a
>> >> >> metro-cluster for free, i try
one of is
>> competitor
>> >> (onefs) and clearly
>> >> >> clustered
filesystem are the
>> futur.
>> >> >>
>> >> >>
>> >> Cheers,
>> >> >>
Fabrice
>> >> >>
>> >> >> -----Message
>> d'origine-----
>> >> >> De :
Andreas
>> >> Huser
>> [mailto:ahuser@7five-edv.de]
>> >> >> Envoyé :
mardi 23 octobre 2012
>> >>
>> 11:37
>> >> >> À :
cloudstack-users@incubator.apache.org
>> >> >> Objet :
>> Re:
>> >>
Primary Storage
>> >> >>
>> >> >> Hi Fabrice,
>> >> >>
>> >> >> I don't
know
>> what other people
>> >> do but i have no problems with
Infiniband +
>> >> >>
>> GlusterFS + Cloudstack
>> >> I'm not use
Nexenta it's based on Illumos and
>> work
>> >> >> not fine with
>> >>
Infiniband.
>> >> >> I have two different
>> clusters in productiv
>> >>
enviroments.
>> >> >>
>> >> >> The first: Solaris 11
>> with built-in
GlusterFS 3.3
>> >> export Gluster Vol. with
>> >> >> RDMA. >
>>
performance is okey you can use that
>> >> for smaller enviroments
The
>> >>
>> >> Second: Is a little bit complex with a
>> >> GlusterFS
Server in the
>> middle
>> >> >>
>> >> >> ZFS Server: based on Solaris
11
>> >> >>
>> >> 1.) Create a
>> zpool min. two vdevs and SSD
read/write cache
>> >> >> 2.) Create
>> >> a thin
>> provisioning Volume
"zfs create -V" disable atime and
>> >> >> enable
>> >>
>> compression
(do not enable dedup!) and export as (iWarp) SRP Target
>> >>
>> >>
>>
>> to the GlusterFS Server use a direct connection without IB
>>
Switch.
>> >> >>
>> >> >>
>> >> GlusterFS Server:
>> >> >> 1.) Use
RedHat, CentOS or
>> Fedora (I use CentOS 5
>> >> and 6)
>> >> >> 2.)
Use OFED driver from
>> https://www.openfabrics.org
>> >> >> 3.)
>> >>
Import the SRP Target from ZFS
>> Server and format as xfs
>> >> >> 4.)
Create a
>> >> Glustervolume "volume
>> create xy transport rdma " (use
only
>> >> >> rdma)
>> >> >>
>> >> 5.) Connect
>> with the second IB
Port to a IB Switch
>> >> >>
>> >> >> Cloudstack
>> >>
>> Hypervisor
Node:
>> >> >> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
>>
and
>> >> 6)
>> >> >> 2.) Use OFED driver from
https://www.openfabrics.org
>> >>
>> >> 3.) Import
>> >> the
Glustervolume
>> >> >>
>> >> >> ZFS Thin Volume ----
>> Infinband SRP
---->
>> >> GlusterFS ---- GFSVol rdma ---->
>> >> >> IB Switch
>> ---->
Clients
>> >> >>
>> >> >> The
>> >> ZFS and GlusterFS Server form a
Storage
>> unit connect direct with
>> >> 40Gbit
>> >> >> Infiniband
Point-to-Point You
>> do not feel that is a cable
>> >> between!
>> >>
>>
>> >> >> Important: When you
>> have Infiniband use not IPoIB with
>>
>> iSCSI! If one
>> >> >> already
>> >> >>
>> has Infiniband then you
should also use the
>> >> advantage.
>> >> >> IPoIB
>> have a higher
Latency as iWarp SRP!
>> >> >>
>> >> >>
>> >> >> SRP =
>> >> usec
>>
>>
>> >>
>> >> >> -- SRP --
>> >> >> local address: LID 0x01 QPN
0x44004b PSN
>> >>
>> 0xf3265b RKey 0x9804237c VAddr
>> >> >>
0x00000001dda000 remote address:
>> LID
>> >> 0x0a QPN 0x10004a PSN
0x44072e RKey
>> >> >> 0x1c0f115 VAddr
>> >>
>> 0x000000088e6000
>> >>
>>
>> >>
>>
------------------------------------------------------------------
>>
>>
>> >>
>> >> #bytes #iterations t_min[usec] t_max[usec]
t_typical[usec]
>> >> >> 2
>> 1000
>> >> 1.29 125.30 1.31
>> >> >>
>>
>>
>>
------------------------------------------------------------------
>>
>>
>> >>
>> >> >>
>> >> -- IPoIB ---
>> >> >> [root@sv01sfogaa ~]# ping
10.200.0.10
>> >>
>> >> PING
>> >> 10.200.0.10 (10.200.0.10) 56(84)
bytes of data.
>> >> >> 64
>> bytes from
>> >> 10.200.0.10: icmp_seq=1
ttl=255 time=0.147 ms
>> >> >> 64
>> bytes from
>> >> 10.200.0.10:
icmp_seq=2 ttl=255 time=0.116 ms
>> >> >>
>> >> >>
>> When you put load
on
>> >> IPoIB then increases the latency that is not
>> good
>> >>
>>
>> >> >>
>> >> >> That is my
>> >> recommendation for a simple
>>
GlusterFS Mirror:
>> >> >>
>> >> >> - Supermicro Server
>> >> with Intel
Hardware
>> and Expander Backplane
>> >> >> - 1x Crucal-M4 SSD Read
>>
>> Cache
>> >> >> - 2x
>> ZeusIOPs SSD Write cache (mirror)
>> >> >> -
SATA 24/7
>> >> Harddrives
>> >> >>
>> - LSI HBA 9207 or 9211
>> >> >> -
ConnectX-2 QDR Dualport
>> >> Infiniband
>> Adaper (HP refurbished with
full
>> >> >> warrany for 100$)
>> >> Importent >
>> Flash newest
Firmware from Mellanox!
>> >> >> - Mellanox IB
>> >> Swich
>> >> >>
>> -
Solaris 11
>> >> >> - GlusterFS 3.3 compiled with ib_verbs
>> >> >> -
>>
>>
>> Gluster Volume transport only rdma
>> >> >>
>> >> >>
>> >> >>
>>
>> >>
>> >> >>
>> >>
>> >>
>> >>> Throughput
>> >> constant up to 200
Mbyte/s >> more throughput with
>> more storage servers
>> >> or more
harddrives on jbod >> >> >>
>> >>
>> >> >>
>> >>
>> >>
>> >> >> Info:
>>
>> >>
>> >> >> - I have some
>> >> problems with infiniband
>> RDMA or
SRP with OpenIndiana or
>> >> >> Illumos or
>> >> Nexenta. Some
>>
adapters have a high latency or not a stable
>> >> >>
>> >> connection.
Use
>> Solaris that's the right way!
>> >> >> - OpenIndiana is Beta!
>>
>> Infiniband
>> ib_verbs works not or not fine!
>> >> >> - Use Solaris
11
>> >> Infiniband
>> ib_verbs are native and stable
>> >> >> - Don't
use Ubuntu Client
>> >> Server
>> for Infiniband! Use RedHat, Fedora
or
>> >> >> CentOS and install the
>> >>
>> right drivers from
>> >> >>
https://www.openfabrics.org/downloads/OFED/
>> >>
>> >> -
>> >> You have
not SSD Cache? Disable at the ZFS Volume the sync!
>> >>
>>
Important!
>> >> >> You
>> >> >> loose Security for your Data but
some
>> protocols use
>> >> sync flags in
>> >> >> transport.
>> >> >>
Example NFS use by
>> default fsync. write
>> >> cache are not active.
NFS writes
>> >> >> data
>> direct to the Harddrive. For
>> >> Data
Security and Performance give
>> the
>> >> >> Storage Server a SSD
write
>> >> cache. ZFS works at default with
>> sync=standard
>> >> >>
that prevent write
>> >> holes. (COW System)
>> >> >>
>> >>
>> >> I hope
that I could help a little
>> >> >>
>> >> >>
>> >> Greeting from
>>
Germany
>> >> >> Andreas
>> >> >>
>> >> >>
>> >> >>
>> >> >> -----
Ursprüngliche
>> Mail
>> >> -----
>> >> >>
>> >> >> Von: "Fabrice
Brazier"
>> >> >> An:
>> >>
>> cloudstack-users@incubator.apache.org
>>
>> >> Gesendet: Dienstag, 23.
>> Oktober
>> >> 2012 09:55:15
>> >> >>
Betreff: RE: Primary Storage
>> >> >>
>> >> >>
>> Hi Andreas,
>> >>
>>
>> >> >>
>> >> Hello i just see your configuration, it seems
>> quite
interesting.
>> >> >> If i
>> >> understand well you want to build
some
>> zfs array on the backend.
>> >> >>
>> >> Export luns (probably
by iscsi over
>> infiniband) to you linux cluster,
>> >> and
>> >> >>
on
>> >> >> the linux
>> cluster you put glusterFS.
>> >> >> I can
understand
>> >> the point, with
>> that you can have very good
performance
>> >> >> and
>> >> >>
>> >> reliability
>> (zfs),
scalability and redundancy (gluster) for very low
>> >> cost.
>> >>
>>
>> So just one question, did you try the global namespace
>> >>
implementation
>> from
>> >> >> nexenta?
>> >> >> If yes can you tell me
what
>> >> configuration
>> is the best for you?
>> >> >> I mean the
fact you have a gluster
>> >> cluster
>> in the middle must impact
the
>> >> >> overral performance no?
>> >> >>
>> >>
>> >>
>> >>
Fabrice
>> >> >>
>> >> >> -----Message d'origine-----
>> >> >> De :
Andreas
>> Huser
>> >> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23
octobre
>> >> >>
>> 2012 05:40 À :
>> >>
cloudstack-users@incubator.apache.org
>> >> >> Objet :
>> Re: Primary
>>
>> Storage
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> for Cloudstack i
use
>> Solaris 11 ZFS + GlusterFS
>> >> over Infiniband (RDMA).
>> >> >>
That
>> >> >>
>> gives the best performance and most
>> >> scalable
Storage.
>> >> >> I have
>> tasted some different solutions for
primary
>> >> Storage but the most
>> are
>> >> >> to expensive and for
a CloudStack Cluster
>> >> not economic or
>> have a poor
>> >> >>
performance.
>> >> >>
>> >> >> My Configuration:
>> >> >>
>> >>
>>
Storage Node:
>> >> >> Supermicro Server (Intel Hardware) with Solaris
11
>> with
>> >> SSD write and read
>> >> >> cache (read crucial-m4,
write ZeusIOPS)
>> GlusterFS
>> >> and dualport ConnectX
>> >> >>
40Gbit/s Infiniband adapter.
>> >>
>> >>
>> >> >> I have
>> >> installed
GlusterFS direct on Solaris with a modified
>> code.
>> >> >> Want
you
>> >> build bigger systems for more then 50 VMs it is
>> better you
split
>> >> >> the
>> >> Solaris and GlusterFS with a separte
>>
headnode for GlusterFS
>> >> >>
>> >> >> That
>> >> looks like:
>> >> >>
Solaris ZFS
>> Backendstorage with a dataset Volume (Thin
>> >>
Provision) --> ( SRP
>> >> >>
>> Target attached direct without
Infiniband switch
>> >> to GF Node) -->
>> GlusterFS
>> >> >> Node the
srp target formatted with xfs
>> >> filesystem,
>> create a GlusterFS
>>
>> >> Volume --> ( Infiniband over a
>> >> Mellanox Port
>> Switch) -->
Cloudstack Node
>> >> >> mount glusterFS Volume over
>> >> RDMA
>> >>
>>
>>
>> >> >> For the Dataset Volume at the ZFS Storage, disable atime
and
>> >>
>> enable
>> >> >> compression.
>> >> >> (Space reclaim) With
compression you
>> can
>> >> shrink the ZFS Volume with command
>> >> >>
at Linux dd /dev/zero or
>> In a
>> >> Windows VM with sdelete That
gives you space
>> >> >> left on the
>> Primary
>> >> Storage for
deleted Files in a VM or for deleted vhd's
>> >> >>
>> or vm's in the
>>
>> cloudstack
>> >> >>
>> >> >> greeting Andreas
>> >> >>
>> >> >>
>>
>>
>> >>
>> >> >>
>> >> >> Mit freundlichen
>> >> Grüßen
>> >> >>
>> >>
>> Andreas Huser
>> >>
>> >> Geschäftsführer
>> >> >> System Engineer
/
>> >> Consultant
>> >> >> (Cisco
>> CSE, SMBAM, LCSE, ASAM)
>> >>
>>
>> >>
>> ---------------------------------------
>> >> >>
Zellerstraße 28 - 77654
>> >>
>> Offenburg
>> >> >> Tel: +49(781)
12786898
>> >> >> Mobil: +49(176) 10308549
>> >>
>> >>
>> >>
ahuser@7five-edv.de
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
-----
>> Ursprüngliche Mail -----
>> >> >>
>> >> >>
>> >> Von: "Outback
Dingo"
>> >> >> An:
>> cloudstack-users@incubator.apache.org
>> >> >>
>>
>> Gesendet: Dienstag, 23.
>> Oktober 2012 02:15:16
>> >> >> Betreff:
Re: Primary
>> >> Storage
>> >> >>
>> >> >>
>> On Mon, Oct 22, 2012 at
8:09 PM, Ivan
>> >> Rodriguezwrote:
>> >> >>
>> >>
>> >>
>> >>
>> >> >>>
Solaris 11 ZFS and yes we tried different
>> >> setups, raids
>> levels
number >> of SSD cache, ARC zfs options etc etc
>> >> etc. >> >>
>>
Cheers >> >> >> VMWare ?? >>
>> >>
>> >> >>
>> >> >>
>> >> >>
>> >>
>>
>> >>
>>
>>

Re: Primary Storage - All those NEGATIVE TROLLS SHUT UP!

Posted by Jason Davis <sc...@gmail.com>.

Passionate are you? Take it easy I am not knocking on what you are doing.

The term "spinning rust" is a catch all for hard drives, nothing personal
at all, in fact the setup you have is impressive.

Impressive until you have a cache miss. Cache misses are definitely an
issue even in "capital E" enterprise storage.

Big bad XIV disk arrays or 3par are impressive, fast arrays (crazy
expensive ;]) when you are hitting DRAM or SSD but if you are running cache
unfriendly workloads then ultimately your array is only as fast as the
spinning disk sitting at the lowest tier.

I agree that openstorage is the way to go, from a service providers
perspective it's hard being competitive with AWS or RackSpace if you are
using expensive traditional SAN.

So anyway, my question wasn't a poke at what you are doing, just curious
about how big of an IOP/latency drop you see once a read or write has to
actually see the spinning disk. :)
On Oct 23, 2012 8:10 PM, <br...@v365.com.au> wrote:

>
>
> oh well if your negative, and make rude comments then that's your
> problem, i don't care about you.
>
> If you don't believe that RDMA can
> perform this fast then again, your problem.
>
> Ask anyone with a decent
> 12+ Disc Raid Array, and RDMA and they will tell you it will kick you in
> YOUR ballz! Anyone in storage knows to get iops and performance you need
> 12 preferably 16 spindles.
>
> the spinning rust as you call it, are the
> newest 200MB/s WD Veloci Raprors with 64MB Cache. @ 10KRPM 2.5" dics in
> 32 Bay Storage Chassis.
>
> we've used them in the past and they are as
> reliable as the SAS drives we use, but faster!
>
> RAID 10 IS BULLSHIT!
> Bloody slow and 50% goes up in smoke for nothing. All you get is the
> mirror speed of 2 drives which is barley 400MB/s tops! I Spit on RAID 10
>
>
> We also tried Raid 60 and it was very good as well. But no point to
> use it as we replicated to 2nd SAN
>
> When we tested our new SANs, we
> failed 1 drive and set the rebuild rate to 30% on raid 6, with 40% scan
> rate. The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we
> removed a 2nd disc, and replaced with blank, and it was rebuild in
> another 4 hours. With no real impact to performance tests.
>
> Separating
> the underlying 6Gb/s Sata ports x 8 into effectively 16 x 3Gb/s Channels
> using a SAS Expander gives each Disc 300MB/s of bandwidth. They cant
> physically perform better than 200~220MB/s so there is enough bandwidth
> on the SATA and PCIE Bus to cope. LSI rate the 9280-8i @ 2500MB/s but it
> does work faster with Cache Cade. up to 3200MB/s in that test.
>
> So a
> real SAN has many spindles of high performance, and the WD VRs are
> better than some cheap SSD drives.
>
> RDMA is very fast, low cpu cycles,
> and reads and write directly to RAM@ 40Gb/s, we created a RAM drive and
> tested it up to 3200MB/s which is as fast as the PCIE 2 Bus / LSI could
> Handle.
>
> If you cant afford this realtivley cheap equipment (compated to
> IBM DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast
> performance. Any real Cloud would have what we have built. None of our
> customers, ALL 300 Of them! have lost a single bit of data, all VMs are
> running very quickly, and with no problems now for 3 months.
>
> So If you
> dont appreciate being shown how to build a kick ASS SAN then PISS
> OFF!
>
> On 24.10.2012 08:17, Jason Davis wrote:
> > How often do you have
> folks cache missing and hitting the spinning rust
> > below the cache cade
> tier?
> > On Oct 23, 2012 6:25 PM, wrote:
> >
> >>
> >>
> >> Hi
> >>
> >> with our
> SCST RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
> >> Director 4036
> switches + HP Gen2 QDR Chassis Switches.
> >>
> >> Our San achieves
> >>
> 295,000 - 350,000 Iops max per LSI controller, depending on block
> size
> >> for i/o meter test. LSI9280-8i with battery cache. +
> cachecade2.0 x 2 in
> >> PCIE 16x slot on supermicro motherboard.
> >>
> >> 1
> x 250GB SSD for CACHE Cade +
> >> 16 1tb WD VR drives (16 - hot stand buy
> - 2 for raid 6 parity) =
> >> 2600MB/s
> >>
> >> most of our test are from 4K
> to 128K block size, @ 4K we get
> >> 295,000 Iops @ 1590 MB/s and @ 128K
> 350,000 iops @ 2400~2600MB/s
> >>
> >> we
> >> have tuned 64K and 128K block
> size on different luns. 64K for database
> >> and 128K for general
> file.
> >>
> >> The best thing about Infiniband is low CPU
> >> cycles. only
> 5~6% during these tests.
> >>
> >> latency is as low as 4-6ms
> >> average
> read time. concurrent response times are from 5-12ms. even under
> >>
> heavy load its below 20ms
> >>
> >> Infiniband latency is below 0.01 us
> Which is
> >> why we chose it.
> >>
> >> we run MSSQL on the 64K formatted
> Luns, its massively
> >> fast.
> >>
> >> If we copy the same data twice the
> cache cade kicks in and we
> >> achieve even better speeds.
> >>
> >> Ive
> compared this to block io @ 8Gb/s Fiber
> >> channel, and it barely gets
> 120,000 IOPS. @ much higher latency and
> >> bandwidth.
> >>
> >> So for our
> money RDMA wins!
> >>
> >> Bruce
> >>
> >> On 23.10.2012 21:20,
> >> Jason Davis
> wrote:
> >> > Out of curiosity, is there any quick performance
> >> numbers
> for these ZFS +
> >> > GlusterFS mashups you guys are talking
> >> about?
> >>
> >
> >> > Specifically, IOPs and latency? Sequential read/write
> >>
> performance honestly
> >> > isn't a very good benchmark to determine
> your
> >> SANs performance. It's like
> >> > comparing CPUs based solely on
> how many
> >> GHz it runs at. Sure you can get
> >> > great MB or GB/s with
> SATA disk but
> >> I'd reckon that IOP performance is
> >> > abismal. If you
> are utilizing
> >> GlusterFS without the cache pooling magic
> >> > that is
> ZFS then I would
> >> imagine that latency can be an issue.
> >> >
> >> >
> >>
> >
> >> > On Tue, Oct 23, 2012 at
> >> 7:56 AM, Andreas Huser wrote:
> >> >
> >>
> >> Hi Fabrice,
> >> >>
> >> >> i know
> >> OpenSolaris/Solaris Oracle it's so
> a thing.
> >> >> I'm for more then 10
> >> years a open source user and
> that
> >> >> with oracle - i did no like at the
> >> beginning of this
> constallation.
> >> >> But Oracle makes his work good i
> >> know that. The
> cost of one socket
> >> >> are 700$ and you can use so much
> >> quantity of
> TB as you will.
> >> >> And you can use the full premier Support
> >> from
> Oracle.
> >> >> Nexenta develop with the Illumos code. And the Licence
> >>
> are TB based.
> >> >> That is not my favorite. As well the pool version
> from
> >> Nexenta comes
> >> >> not after. Current Nexenta Infiniband are
> not a usable
> >> solution.
> >> >> But every can use what he will. Everyone
> must decide for
> >> themselves.
> >> >>
> >> >> SRP Targets or iser are not
> difficult to configure.
> >> Use the SRP for
> >> >> the Storage unit
> connection. Solaris and GlusterFS
> >> builds one Storage unit.
> >> >> The
> GlusterFS Server export the final Volume
> >> to the Clients as well
> KVM,
> >> >> VMWare, Hyper-V etc.
> >> >> You can use
> >> native GlusterFS,
> RDMA, NFS ore CIFS to export the Volume.
> >> >> SRP have
> >> nothing to do
> with VmWare.
> >> >>
> >> >> When you are use a 7200 SAS drive the
> >> access
> time are the same as a
> >> >> SATA drive only the quality of the
> >>
> hardware are better. When you need
> >> >> Performance you must use SAS
> >>
> drives with 15000U/m. But it's not needed
> >> >> when you install SSD
> for
> >> ZIL/L2ARC. ZeusRAM rocks :-)
> >> >>
> >> >> I use dedup only at
> secondary stroage
> >> or on Backupserver not on Primary
> >> >> Storage.
> >>
> >> When you use SSD SATA
> >> drives then you have an cheap an fast
> strorage.
> >> >> 1TB drive cost unter
> >> 100$. Current i'm not need to
> save storage volume.
> >> >>
> >> >> Which
> >> application use atime? I know
> "find -atime N". atime on a Storage
> >> >>
> >> that stores only Virtual
> Disks?
> >> >> I don't need the information when i
> >> have last opening
> the disk :-)
> >> >>
> >> >> For a Solaris GUI take a look at
> >> napp-it
> >>
> >> http://www.napp-it.org/
> >> >>
> >> >> greetings from germany
> >> >>
> >>
> Andreas
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> ----- Ursprüngliche Mail -----
> >>
> >>
> >> >> Von:
> >> "Fabrice Brazier">> An:
> cloudstack-users@incubator.apache.org
> >> >>
> >> Gesendet: Dienstag, 23.
> Oktober 2012 12:30:50
> >> >> Betreff: RE: Primary
> >> Storage
> >> >>
> >> >>
> Hi Andreas,
> >> >>
> >> >> Hum that's pretty cool, i know they have
> >>
> still trouble with infiniband on
> >> >> nexenta but it's clearly a
> priority
> >> on their roadmap (plus i trust more the
> >> >> nexenta team
> to evolve zfs
> >> than oracle).
> >> >>
> >> >> I agree iscsi over infiniband
> increase the latency
> >> but most of the time
> >> >> it's
> >> >> just more
> simple to use IPOIB than IB,
> >> for example if you use vmware the
> >> >>
> support of infiniband it's assured
> >> by mellanox and not vmware so if
> you
> >> >> have
> >> >> an issue the vmware
> >> support probably don't help
> you anymore.
> >> >>
> >> >> About the raid i'm a fan
> >> of raid 10, i
> prefer build a raid 10 with 7200 sas
> >> >> drive than
> >> multiple vdev
> raidz/raidz2 15k sas drive. Particularly for
> >> >> virtual
> >>
> environment with a random workload and multiple access.
> >> >>
> >> >> I'm
> a fan
> >> of nfs so agreed about the zil, and zeus ram are the only one
> for
> >> >>
> >> that (with classical ssd you can hit the write hole).
> >>
> >>
> >> >> Agreed for
> >> compression too (but only lzjb level gzip use too
> much cpu),
> >> >> disable
> >> atime permit to decrease the iops load but
> i'm not sure is really a
> >> >>
> >> best practice. About the deduplication
> I don't totally agree. It
> >> really
> >> >> depends of your array, the
> workload and the VM type on your
> >> cloudstack.
> >> >> Actually when i
> build a zfs array i count 1Gb ram for 1Tb
> >> disks. With
> >> >>
> deduplication i count 2Gb ram for 1 Tb disks (plus in
> >> nexenta 4 they
> will
> >> >> add a possibility to use ssd for the
> >> deduplication metada,
> like the l2arc
> >> >> for the read cache).
> >> >>
> >> >> The
> >> last point
> it's about your drives, why sata drives? I mean sata
> >> doesn't
> >> >>
> respect their latency, and the reliability of a sas drive is
> >> 10x the
> sata.
> >> >> Plus now you can found many NL-sas drive at low
> >> cost
> >>
> >>
> >> >> But it's really a nice architecture, i never try glusterfs
> >>
> (for the moment)
> >> >> plus in that case it's really a good way to have
> a
> >> replacement to a
> >> >> metro-cluster for free, i try one of is
> competitor
> >> (onefs) and clearly
> >> >> clustered filesystem are the
> futur.
> >> >>
> >> >>
> >> Cheers,
> >> >> Fabrice
> >> >>
> >> >> -----Message
> d'origine-----
> >> >> De : Andreas
> >> Huser
> [mailto:ahuser@7five-edv.de]
> >> >> Envoyé : mardi 23 octobre 2012
> >>
> 11:37
> >> >> À : cloudstack-users@incubator.apache.org
> >> >> Objet :
> Re:
> >> Primary Storage
> >> >>
> >> >> Hi Fabrice,
> >> >>
> >> >> I don't know
> what other people
> >> do but i have no problems with Infiniband +
> >> >>
> GlusterFS + Cloudstack
> >> I'm not use Nexenta it's based on Illumos and
> work
> >> >> not fine with
> >> Infiniband.
> >> >> I have two different
> clusters in productiv
> >> enviroments.
> >> >>
> >> >> The first: Solaris 11
> with built-in GlusterFS 3.3
> >> export Gluster Vol. with
> >> >> RDMA. >
> performance is okey you can use that
> >> for smaller enviroments The
> >>
> >> Second: Is a little bit complex with a
> >> GlusterFS Server in the
> middle
> >> >>
> >> >> ZFS Server: based on Solaris 11
> >> >>
> >> 1.) Create a
> zpool min. two vdevs and SSD read/write cache
> >> >> 2.) Create
> >> a thin
> provisioning Volume "zfs create -V" disable atime and
> >> >> enable
> >>
> compression (do not enable dedup!) and export as (iWarp) SRP Target
> >>
> >>
> >> to the GlusterFS Server use a direct connection without IB
> Switch.
> >> >>
> >> >>
> >> GlusterFS Server:
> >> >> 1.) Use RedHat, CentOS or
> Fedora (I use CentOS 5
> >> and 6)
> >> >> 2.) Use OFED driver from
> https://www.openfabrics.org
> >> >> 3.)
> >> Import the SRP Target from ZFS
> Server and format as xfs
> >> >> 4.) Create a
> >> Glustervolume "volume
> create xy transport rdma " (use only
> >> >> rdma)
> >> >>
> >> 5.) Connect
> with the second IB Port to a IB Switch
> >> >>
> >> >> Cloudstack
> >>
> Hypervisor Node:
> >> >> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
> and
> >> 6)
> >> >> 2.) Use OFED driver from https://www.openfabrics.org
> >>
> >> 3.) Import
> >> the Glustervolume
> >> >>
> >> >> ZFS Thin Volume ----
> Infinband SRP ---->
> >> GlusterFS ---- GFSVol rdma ---->
> >> >> IB Switch
> ----> Clients
> >> >>
> >> >> The
> >> ZFS and GlusterFS Server form a Storage
> unit connect direct with
> >> 40Gbit
> >> >> Infiniband Point-to-Point You
> do not feel that is a cable
> >> between!
> >> >>
> >> >> Important: When you
> have Infiniband use not IPoIB with
> >> iSCSI! If one
> >> >> already
> >> >>
> has Infiniband then you should also use the
> >> advantage.
> >> >> IPoIB
> have a higher Latency as iWarp SRP!
> >> >>
> >> >>
> >> >> SRP =
> >> usec
> >>
> >>
> >> >> -- SRP --
> >> >> local address: LID 0x01 QPN 0x44004b PSN
> >>
> 0xf3265b RKey 0x9804237c VAddr
> >> >> 0x00000001dda000 remote address:
> LID
> >> 0x0a QPN 0x10004a PSN 0x44072e RKey
> >> >> 0x1c0f115 VAddr
> >>
> 0x000000088e6000
> >> >>
> >>
> ------------------------------------------------------------------
> >>
> >>
> >> #bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
> >> >> 2
> 1000
> >> 1.29 125.30 1.31
> >> >>
> >>
> ------------------------------------------------------------------
> >>
> >>
> >> >>
> >> -- IPoIB ---
> >> >> [root@sv01sfogaa ~]# ping 10.200.0.10
> >>
> >> PING
> >> 10.200.0.10 (10.200.0.10) 56(84) bytes of data.
> >> >> 64
> bytes from
> >> 10.200.0.10: icmp_seq=1 ttl=255 time=0.147 ms
> >> >> 64
> bytes from
> >> 10.200.0.10: icmp_seq=2 ttl=255 time=0.116 ms
> >> >>
> >> >>
> When you put load on
> >> IPoIB then increases the latency that is not
> good
> >> >>
> >> >>
> >> >> That is my
> >> recommendation for a simple
> GlusterFS Mirror:
> >> >>
> >> >> - Supermicro Server
> >> with Intel Hardware
> and Expander Backplane
> >> >> - 1x Crucal-M4 SSD Read
> >> Cache
> >> >> - 2x
> ZeusIOPs SSD Write cache (mirror)
> >> >> - SATA 24/7
> >> Harddrives
> >> >>
> - LSI HBA 9207 or 9211
> >> >> - ConnectX-2 QDR Dualport
> >> Infiniband
> Adaper (HP refurbished with full
> >> >> warrany for 100$)
> >> Importent >
> Flash newest Firmware from Mellanox!
> >> >> - Mellanox IB
> >> Swich
> >> >>
> - Solaris 11
> >> >> - GlusterFS 3.3 compiled with ib_verbs
> >> >> -
> >>
> Gluster Volume transport only rdma
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>> Throughput
> >> constant up to 200 Mbyte/s >> more throughput with
> more storage servers
> >> or more harddrives on jbod >> >> >>
> >>
> >> >>
> >>
> >>
> >> >> Info:
> >> >>
> >> >> - I have some
> >> problems with infiniband
> RDMA or SRP with OpenIndiana or
> >> >> Illumos or
> >> Nexenta. Some
> adapters have a high latency or not a stable
> >> >>
> >> connection. Use
> Solaris that's the right way!
> >> >> - OpenIndiana is Beta!
> >> Infiniband
> ib_verbs works not or not fine!
> >> >> - Use Solaris 11
> >> Infiniband
> ib_verbs are native and stable
> >> >> - Don't use Ubuntu Client
> >> Server
> for Infiniband! Use RedHat, Fedora or
> >> >> CentOS and install the
> >>
> right drivers from
> >> >> https://www.openfabrics.org/downloads/OFED/
> >>
> >> -
> >> You have not SSD Cache? Disable at the ZFS Volume the sync!
> >>
> Important!
> >> >> You
> >> >> loose Security for your Data but some
> protocols use
> >> sync flags in
> >> >> transport.
> >> >> Example NFS use by
> default fsync. write
> >> cache are not active. NFS writes
> >> >> data
> direct to the Harddrive. For
> >> Data Security and Performance give
> the
> >> >> Storage Server a SSD write
> >> cache. ZFS works at default with
> sync=standard
> >> >> that prevent write
> >> holes. (COW System)
> >> >>
> >>
> >> I hope that I could help a little
> >> >>
> >> >>
> >> Greeting from
> Germany
> >> >> Andreas
> >> >>
> >> >>
> >> >>
> >> >> ----- Ursprüngliche
> Mail
> >> -----
> >> >>
> >> >> Von: "Fabrice Brazier"
> >> >> An:
> >>
> cloudstack-users@incubator.apache.org
> >> >> Gesendet: Dienstag, 23.
> Oktober
> >> 2012 09:55:15
> >> >> Betreff: RE: Primary Storage
> >> >>
> >> >>
> Hi Andreas,
> >> >>
> >> >>
> >> Hello i just see your configuration, it seems
> quite interesting.
> >> >> If i
> >> understand well you want to build some
> zfs array on the backend.
> >> >>
> >> Export luns (probably by iscsi over
> infiniband) to you linux cluster,
> >> and
> >> >> on
> >> >> the linux
> cluster you put glusterFS.
> >> >> I can understand
> >> the point, with
> that you can have very good performance
> >> >> and
> >> >>
> >> reliability
> (zfs), scalability and redundancy (gluster) for very low
> >> cost.
> >> >>
> So just one question, did you try the global namespace
> >> implementation
> from
> >> >> nexenta?
> >> >> If yes can you tell me what
> >> configuration
> is the best for you?
> >> >> I mean the fact you have a gluster
> >> cluster
> in the middle must impact the
> >> >> overral performance no?
> >> >>
> >>
> >>
> >> Fabrice
> >> >>
> >> >> -----Message d'origine-----
> >> >> De : Andreas
> Huser
> >> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23 octobre
> >> >>
> 2012 05:40 À :
> >> cloudstack-users@incubator.apache.org
> >> >> Objet :
> Re: Primary
> >> Storage
> >> >>
> >> >> Hi,
> >> >>
> >> >> for Cloudstack i use
> Solaris 11 ZFS + GlusterFS
> >> over Infiniband (RDMA).
> >> >> That
> >> >>
> gives the best performance and most
> >> scalable Storage.
> >> >> I have
> tasted some different solutions for primary
> >> Storage but the most
> are
> >> >> to expensive and for a CloudStack Cluster
> >> not economic or
> have a poor
> >> >> performance.
> >> >>
> >> >> My Configuration:
> >> >>
> >>
> Storage Node:
> >> >> Supermicro Server (Intel Hardware) with Solaris 11
> with
> >> SSD write and read
> >> >> cache (read crucial-m4, write ZeusIOPS)
> GlusterFS
> >> and dualport ConnectX
> >> >> 40Gbit/s Infiniband adapter.
> >>
> >>
> >> >> I have
> >> installed GlusterFS direct on Solaris with a modified
> code.
> >> >> Want you
> >> build bigger systems for more then 50 VMs it is
> better you split
> >> >> the
> >> Solaris and GlusterFS with a separte
> headnode for GlusterFS
> >> >>
> >> >> That
> >> looks like:
> >> >> Solaris ZFS
> Backendstorage with a dataset Volume (Thin
> >> Provision) --> ( SRP
> >> >>
> Target attached direct without Infiniband switch
> >> to GF Node) -->
> GlusterFS
> >> >> Node the srp target formatted with xfs
> >> filesystem,
> create a GlusterFS
> >> >> Volume --> ( Infiniband over a
> >> Mellanox Port
> Switch) --> Cloudstack Node
> >> >> mount glusterFS Volume over
> >> RDMA
> >>
> >>
> >> >> For the Dataset Volume at the ZFS Storage, disable atime and
> >>
> enable
> >> >> compression.
> >> >> (Space reclaim) With compression you
> can
> >> shrink the ZFS Volume with command
> >> >> at Linux dd /dev/zero or
> In a
> >> Windows VM with sdelete That gives you space
> >> >> left on the
> Primary
> >> Storage for deleted Files in a VM or for deleted vhd's
> >> >>
> or vm's in the
> >> cloudstack
> >> >>
> >> >> greeting Andreas
> >> >>
> >> >>
> >>
> >>
> >> >>
> >> >> Mit freundlichen
> >> Grüßen
> >> >>
> >> >> Andreas Huser
> >>
> >> Geschäftsführer
> >> >> System Engineer /
> >> Consultant
> >> >> (Cisco
> CSE, SMBAM, LCSE, ASAM)
> >> >>
> >>
> ---------------------------------------
> >> >> Zellerstraße 28 - 77654
> >>
> Offenburg
> >> >> Tel: +49(781) 12786898
> >> >> Mobil: +49(176) 10308549
> >>
> >>
> >> ahuser@7five-edv.de
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> -----
> Ursprüngliche Mail -----
> >> >>
> >> >>
> >> Von: "Outback Dingo"
> >> >> An:
> cloudstack-users@incubator.apache.org
> >> >>
> >> Gesendet: Dienstag, 23.
> Oktober 2012 02:15:16
> >> >> Betreff: Re: Primary
> >> Storage
> >> >>
> >> >>
> On Mon, Oct 22, 2012 at 8:09 PM, Ivan
> >> Rodriguezwrote:
> >> >>
> >>
> >>
> >>
> >> >>> Solaris 11 ZFS and yes we tried different
> >> setups, raids
> levels number >> of SSD cache, ARC zfs options etc etc
> >> etc. >> >>
> Cheers >> >> >> VMWare ?? >>
> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >>
>
>

Re: Primary Storage - All those NEGATIVE TROLLS SHUT UP!

Posted by br...@v365.com.au.

  

oh well if your negative, and make rude comments then that's your
problem, i don't care about you.

If you don't believe that RDMA can
perform this fast then again, your problem.

Ask anyone with a decent
12+ Disc Raid Array, and RDMA and they will tell you it will kick you in
YOUR ballz! Anyone in storage knows to get iops and performance you need
12 preferably 16 spindles.

the spinning rust as you call it, are the
newest 200MB/s WD Veloci Raprors with 64MB Cache. @ 10KRPM 2.5" dics in
32 Bay Storage Chassis.

we've used them in the past and they are as
reliable as the SAS drives we use, but faster!

RAID 10 IS BULLSHIT!
Bloody slow and 50% goes up in smoke for nothing. All you get is the
mirror speed of 2 drives which is barley 400MB/s tops! I Spit on RAID 10


We also tried Raid 60 and it was very good as well. But no point to
use it as we replicated to 2nd SAN

When we tested our new SANs, we
failed 1 drive and set the rebuild rate to 30% on raid 6, with 40% scan
rate. The SAN only had 6TB of data on it. and it rebuilt in 3 hours. we
removed a 2nd disc, and replaced with blank, and it was rebuild in
another 4 hours. With no real impact to performance tests.

Separating
the underlying 6Gb/s Sata ports x 8 into effectively 16 x 3Gb/s Channels
using a SAS Expander gives each Disc 300MB/s of bandwidth. They cant
physically perform better than 200~220MB/s so there is enough bandwidth
on the SATA and PCIE Bus to cope. LSI rate the 9280-8i @ 2500MB/s but it
does work faster with Cache Cade. up to 3200MB/s in that test.

So a
real SAN has many spindles of high performance, and the WD VRs are
better than some cheap SSD drives.

RDMA is very fast, low cpu cycles,
and reads and write directly to RAM@ 40Gb/s, we created a RAM drive and
tested it up to 3200MB/s which is as fast as the PCIE 2 Bus / LSI could
Handle.

If you cant afford this realtivley cheap equipment (compated to
IBM DS or HP StorageWorks SAN) then dont wing at ME. You cant get fast
performance. Any real Cloud would have what we have built. None of our
customers, ALL 300 Of them! have lost a single bit of data, all VMs are
running very quickly, and with no problems now for 3 months. 

So If you
dont appreciate being shown how to build a kick ASS SAN then PISS
OFF!

On 24.10.2012 08:17, Jason Davis wrote:
> How often do you have
folks cache missing and hitting the spinning rust
> below the cache cade
tier?
> On Oct 23, 2012 6:25 PM, wrote:
> 
>>
>>
>> Hi
>>
>> with our
SCST RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
>> Director 4036
switches + HP Gen2 QDR Chassis Switches.
>>
>> Our San achieves
>>
295,000 - 350,000 Iops max per LSI controller, depending on block
size
>> for i/o meter test. LSI9280-8i with battery cache. +
cachecade2.0 x 2 in
>> PCIE 16x slot on supermicro motherboard.
>>
>> 1
x 250GB SSD for CACHE Cade +
>> 16 1tb WD VR drives (16 - hot stand buy
- 2 for raid 6 parity) =
>> 2600MB/s
>>
>> most of our test are from 4K
to 128K block size, @ 4K we get
>> 295,000 Iops @ 1590 MB/s and @ 128K
350,000 iops @ 2400~2600MB/s
>>
>> we
>> have tuned 64K and 128K block
size on different luns. 64K for database
>> and 128K for general
file.
>>
>> The best thing about Infiniband is low CPU
>> cycles. only
5~6% during these tests.
>>
>> latency is as low as 4-6ms
>> average
read time. concurrent response times are from 5-12ms. even under
>>
heavy load its below 20ms
>>
>> Infiniband latency is below 0.01 us
Which is
>> why we chose it.
>>
>> we run MSSQL on the 64K formatted
Luns, its massively
>> fast.
>>
>> If we copy the same data twice the
cache cade kicks in and we
>> achieve even better speeds.
>>
>> Ive
compared this to block io @ 8Gb/s Fiber
>> channel, and it barely gets
120,000 IOPS. @ much higher latency and
>> bandwidth.
>>
>> So for our
money RDMA wins!
>>
>> Bruce
>>
>> On 23.10.2012 21:20,
>> Jason Davis
wrote:
>> > Out of curiosity, is there any quick performance
>> numbers
for these ZFS +
>> > GlusterFS mashups you guys are talking
>> about?
>>
>
>> > Specifically, IOPs and latency? Sequential read/write
>>
performance honestly
>> > isn't a very good benchmark to determine
your
>> SANs performance. It's like
>> > comparing CPUs based solely on
how many
>> GHz it runs at. Sure you can get
>> > great MB or GB/s with
SATA disk but
>> I'd reckon that IOP performance is
>> > abismal. If you
are utilizing
>> GlusterFS without the cache pooling magic
>> > that is
ZFS then I would
>> imagine that latency can be an issue.
>> >
>> >
>>
>
>> > On Tue, Oct 23, 2012 at
>> 7:56 AM, Andreas Huser wrote:
>> >
>>
>> Hi Fabrice,
>> >>
>> >> i know
>> OpenSolaris/Solaris Oracle it's so
a thing.
>> >> I'm for more then 10
>> years a open source user and
that
>> >> with oracle - i did no like at the
>> beginning of this
constallation.
>> >> But Oracle makes his work good i
>> know that. The
cost of one socket
>> >> are 700$ and you can use so much
>> quantity of
TB as you will.
>> >> And you can use the full premier Support
>> from
Oracle.
>> >> Nexenta develop with the Illumos code. And the Licence
>>
are TB based.
>> >> That is not my favorite. As well the pool version
from
>> Nexenta comes
>> >> not after. Current Nexenta Infiniband are
not a usable
>> solution.
>> >> But every can use what he will. Everyone
must decide for
>> themselves.
>> >>
>> >> SRP Targets or iser are not
difficult to configure.
>> Use the SRP for
>> >> the Storage unit
connection. Solaris and GlusterFS
>> builds one Storage unit.
>> >> The
GlusterFS Server export the final Volume
>> to the Clients as well
KVM,
>> >> VMWare, Hyper-V etc.
>> >> You can use
>> native GlusterFS,
RDMA, NFS ore CIFS to export the Volume.
>> >> SRP have
>> nothing to do
with VmWare.
>> >>
>> >> When you are use a 7200 SAS drive the
>> access
time are the same as a
>> >> SATA drive only the quality of the
>>
hardware are better. When you need
>> >> Performance you must use SAS
>>
drives with 15000U/m. But it's not needed
>> >> when you install SSD
for
>> ZIL/L2ARC. ZeusRAM rocks :-)
>> >>
>> >> I use dedup only at
secondary stroage
>> or on Backupserver not on Primary
>> >> Storage.
>>
>> When you use SSD SATA
>> drives then you have an cheap an fast
strorage.
>> >> 1TB drive cost unter
>> 100$. Current i'm not need to
save storage volume.
>> >>
>> >> Which
>> application use atime? I know
"find -atime N". atime on a Storage
>> >>
>> that stores only Virtual
Disks?
>> >> I don't need the information when i
>> have last opening
the disk :-)
>> >>
>> >> For a Solaris GUI take a look at
>> napp-it
>>
>> http://www.napp-it.org/
>> >>
>> >> greetings from germany
>> >>
>>
Andreas
>> >>
>> >>
>> >>
>> >>
>> >> ----- Ursprüngliche Mail -----
>>
>>
>> >> Von:
>> "Fabrice Brazier">> An:
cloudstack-users@incubator.apache.org
>> >>
>> Gesendet: Dienstag, 23.
Oktober 2012 12:30:50
>> >> Betreff: RE: Primary
>> Storage
>> >>
>> >>
Hi Andreas,
>> >>
>> >> Hum that's pretty cool, i know they have
>>
still trouble with infiniband on
>> >> nexenta but it's clearly a
priority
>> on their roadmap (plus i trust more the
>> >> nexenta team
to evolve zfs
>> than oracle).
>> >>
>> >> I agree iscsi over infiniband
increase the latency
>> but most of the time
>> >> it's
>> >> just more
simple to use IPOIB than IB,
>> for example if you use vmware the
>> >>
support of infiniband it's assured
>> by mellanox and not vmware so if
you
>> >> have
>> >> an issue the vmware
>> support probably don't help
you anymore.
>> >>
>> >> About the raid i'm a fan
>> of raid 10, i
prefer build a raid 10 with 7200 sas
>> >> drive than
>> multiple vdev
raidz/raidz2 15k sas drive. Particularly for
>> >> virtual
>>
environment with a random workload and multiple access.
>> >>
>> >> I'm
a fan
>> of nfs so agreed about the zil, and zeus ram are the only one
for
>> >>
>> that (with classical ssd you can hit the write hole).
>>
>>
>> >> Agreed for
>> compression too (but only lzjb level gzip use too
much cpu),
>> >> disable
>> atime permit to decrease the iops load but
i'm not sure is really a
>> >>
>> best practice. About the deduplication
I don't totally agree. It
>> really
>> >> depends of your array, the
workload and the VM type on your
>> cloudstack.
>> >> Actually when i
build a zfs array i count 1Gb ram for 1Tb
>> disks. With
>> >>
deduplication i count 2Gb ram for 1 Tb disks (plus in
>> nexenta 4 they
will
>> >> add a possibility to use ssd for the
>> deduplication metada,
like the l2arc
>> >> for the read cache).
>> >>
>> >> The
>> last point
it's about your drives, why sata drives? I mean sata
>> doesn't
>> >>
respect their latency, and the reliability of a sas drive is
>> 10x the
sata.
>> >> Plus now you can found many NL-sas drive at low
>> cost
>>
>>
>> >> But it's really a nice architecture, i never try glusterfs
>>
(for the moment)
>> >> plus in that case it's really a good way to have
a
>> replacement to a
>> >> metro-cluster for free, i try one of is
competitor
>> (onefs) and clearly
>> >> clustered filesystem are the
futur.
>> >>
>> >>
>> Cheers,
>> >> Fabrice
>> >>
>> >> -----Message
d'origine-----
>> >> De : Andreas
>> Huser
[mailto:ahuser@7five-edv.de]
>> >> Envoyé : mardi 23 octobre 2012
>>
11:37
>> >> À : cloudstack-users@incubator.apache.org
>> >> Objet :
Re:
>> Primary Storage
>> >>
>> >> Hi Fabrice,
>> >>
>> >> I don't know
what other people
>> do but i have no problems with Infiniband +
>> >>
GlusterFS + Cloudstack
>> I'm not use Nexenta it's based on Illumos and
work
>> >> not fine with
>> Infiniband.
>> >> I have two different
clusters in productiv
>> enviroments.
>> >>
>> >> The first: Solaris 11
with built-in GlusterFS 3.3
>> export Gluster Vol. with
>> >> RDMA. >
performance is okey you can use that
>> for smaller enviroments The
>>
>> Second: Is a little bit complex with a
>> GlusterFS Server in the
middle
>> >>
>> >> ZFS Server: based on Solaris 11
>> >>
>> 1.) Create a
zpool min. two vdevs and SSD read/write cache
>> >> 2.) Create
>> a thin
provisioning Volume "zfs create -V" disable atime and
>> >> enable
>>
compression (do not enable dedup!) and export as (iWarp) SRP Target
>>
>>
>> to the GlusterFS Server use a direct connection without IB
Switch.
>> >>
>> >>
>> GlusterFS Server:
>> >> 1.) Use RedHat, CentOS or
Fedora (I use CentOS 5
>> and 6)
>> >> 2.) Use OFED driver from
https://www.openfabrics.org
>> >> 3.)
>> Import the SRP Target from ZFS
Server and format as xfs
>> >> 4.) Create a
>> Glustervolume "volume
create xy transport rdma " (use only
>> >> rdma)
>> >>
>> 5.) Connect
with the second IB Port to a IB Switch
>> >>
>> >> Cloudstack
>>
Hypervisor Node:
>> >> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
and
>> 6)
>> >> 2.) Use OFED driver from https://www.openfabrics.org
>>
>> 3.) Import
>> the Glustervolume
>> >>
>> >> ZFS Thin Volume ----
Infinband SRP ---->
>> GlusterFS ---- GFSVol rdma ---->
>> >> IB Switch
----> Clients
>> >>
>> >> The
>> ZFS and GlusterFS Server form a Storage
unit connect direct with
>> 40Gbit
>> >> Infiniband Point-to-Point You
do not feel that is a cable
>> between!
>> >>
>> >> Important: When you
have Infiniband use not IPoIB with
>> iSCSI! If one
>> >> already
>> >>
has Infiniband then you should also use the
>> advantage.
>> >> IPoIB
have a higher Latency as iWarp SRP!
>> >>
>> >>
>> >> SRP =
>> usec
>>
>>
>> >> -- SRP --
>> >> local address: LID 0x01 QPN 0x44004b PSN
>>
0xf3265b RKey 0x9804237c VAddr
>> >> 0x00000001dda000 remote address:
LID
>> 0x0a QPN 0x10004a PSN 0x44072e RKey
>> >> 0x1c0f115 VAddr
>>
0x000000088e6000
>> >>
>>
------------------------------------------------------------------
>>
>>
>> #bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
>> >> 2
1000
>> 1.29 125.30 1.31
>> >>
>>
------------------------------------------------------------------
>>
>>
>> >>
>> -- IPoIB ---
>> >> [root@sv01sfogaa ~]# ping 10.200.0.10
>>
>> PING
>> 10.200.0.10 (10.200.0.10) 56(84) bytes of data.
>> >> 64
bytes from
>> 10.200.0.10: icmp_seq=1 ttl=255 time=0.147 ms
>> >> 64
bytes from
>> 10.200.0.10: icmp_seq=2 ttl=255 time=0.116 ms
>> >>
>> >>
When you put load on
>> IPoIB then increases the latency that is not
good
>> >>
>> >>
>> >> That is my
>> recommendation for a simple
GlusterFS Mirror:
>> >>
>> >> - Supermicro Server
>> with Intel Hardware
and Expander Backplane
>> >> - 1x Crucal-M4 SSD Read
>> Cache
>> >> - 2x
ZeusIOPs SSD Write cache (mirror)
>> >> - SATA 24/7
>> Harddrives
>> >>
- LSI HBA 9207 or 9211
>> >> - ConnectX-2 QDR Dualport
>> Infiniband
Adaper (HP refurbished with full
>> >> warrany for 100$)
>> Importent >
Flash newest Firmware from Mellanox!
>> >> - Mellanox IB
>> Swich
>> >>
- Solaris 11
>> >> - GlusterFS 3.3 compiled with ib_verbs
>> >> -
>>
Gluster Volume transport only rdma
>> >>
>> >>
>> >>
>> >>
>> >>
>>
>>
>>> Throughput
>> constant up to 200 Mbyte/s >> more throughput with
more storage servers
>> or more harddrives on jbod >> >> >>
>>
>> >>
>>
>>
>> >> Info:
>> >>
>> >> - I have some
>> problems with infiniband
RDMA or SRP with OpenIndiana or
>> >> Illumos or
>> Nexenta. Some
adapters have a high latency or not a stable
>> >>
>> connection. Use
Solaris that's the right way!
>> >> - OpenIndiana is Beta!
>> Infiniband
ib_verbs works not or not fine!
>> >> - Use Solaris 11
>> Infiniband
ib_verbs are native and stable
>> >> - Don't use Ubuntu Client
>> Server
for Infiniband! Use RedHat, Fedora or
>> >> CentOS and install the
>>
right drivers from
>> >> https://www.openfabrics.org/downloads/OFED/
>>
>> -
>> You have not SSD Cache? Disable at the ZFS Volume the sync!
>>
Important!
>> >> You
>> >> loose Security for your Data but some
protocols use
>> sync flags in
>> >> transport.
>> >> Example NFS use by
default fsync. write
>> cache are not active. NFS writes
>> >> data
direct to the Harddrive. For
>> Data Security and Performance give
the
>> >> Storage Server a SSD write
>> cache. ZFS works at default with
sync=standard
>> >> that prevent write
>> holes. (COW System)
>> >>
>>
>> I hope that I could help a little
>> >>
>> >>
>> Greeting from
Germany
>> >> Andreas
>> >>
>> >>
>> >>
>> >> ----- Ursprüngliche
Mail
>> -----
>> >>
>> >> Von: "Fabrice Brazier"
>> >> An:
>>
cloudstack-users@incubator.apache.org
>> >> Gesendet: Dienstag, 23.
Oktober
>> 2012 09:55:15
>> >> Betreff: RE: Primary Storage
>> >>
>> >>
Hi Andreas,
>> >>
>> >>
>> Hello i just see your configuration, it seems
quite interesting.
>> >> If i
>> understand well you want to build some
zfs array on the backend.
>> >>
>> Export luns (probably by iscsi over
infiniband) to you linux cluster,
>> and
>> >> on
>> >> the linux
cluster you put glusterFS.
>> >> I can understand
>> the point, with
that you can have very good performance
>> >> and
>> >>
>> reliability
(zfs), scalability and redundancy (gluster) for very low
>> cost.
>> >>
So just one question, did you try the global namespace
>> implementation
from
>> >> nexenta?
>> >> If yes can you tell me what
>> configuration
is the best for you?
>> >> I mean the fact you have a gluster
>> cluster
in the middle must impact the
>> >> overral performance no?
>> >>
>>
>>
>> Fabrice
>> >>
>> >> -----Message d'origine-----
>> >> De : Andreas
Huser
>> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23 octobre
>> >>
2012 05:40 À :
>> cloudstack-users@incubator.apache.org
>> >> Objet :
Re: Primary
>> Storage
>> >>
>> >> Hi,
>> >>
>> >> for Cloudstack i use
Solaris 11 ZFS + GlusterFS
>> over Infiniband (RDMA).
>> >> That
>> >>
gives the best performance and most
>> scalable Storage.
>> >> I have
tasted some different solutions for primary
>> Storage but the most
are
>> >> to expensive and for a CloudStack Cluster
>> not economic or
have a poor
>> >> performance.
>> >>
>> >> My Configuration:
>> >>
>>
Storage Node:
>> >> Supermicro Server (Intel Hardware) with Solaris 11
with
>> SSD write and read
>> >> cache (read crucial-m4, write ZeusIOPS)
GlusterFS
>> and dualport ConnectX
>> >> 40Gbit/s Infiniband adapter.
>>
>>
>> >> I have
>> installed GlusterFS direct on Solaris with a modified
code.
>> >> Want you
>> build bigger systems for more then 50 VMs it is
better you split
>> >> the
>> Solaris and GlusterFS with a separte
headnode for GlusterFS
>> >>
>> >> That
>> looks like:
>> >> Solaris ZFS
Backendstorage with a dataset Volume (Thin
>> Provision) --> ( SRP
>> >>
Target attached direct without Infiniband switch
>> to GF Node) -->
GlusterFS
>> >> Node the srp target formatted with xfs
>> filesystem,
create a GlusterFS
>> >> Volume --> ( Infiniband over a
>> Mellanox Port
Switch) --> Cloudstack Node
>> >> mount glusterFS Volume over
>> RDMA
>>
>>
>> >> For the Dataset Volume at the ZFS Storage, disable atime and
>>
enable
>> >> compression.
>> >> (Space reclaim) With compression you
can
>> shrink the ZFS Volume with command
>> >> at Linux dd /dev/zero or
In a
>> Windows VM with sdelete That gives you space
>> >> left on the
Primary
>> Storage for deleted Files in a VM or for deleted vhd's
>> >>
or vm's in the
>> cloudstack
>> >>
>> >> greeting Andreas
>> >>
>> >>
>>
>>
>> >>
>> >> Mit freundlichen
>> Grüßen
>> >>
>> >> Andreas Huser
>>
>> Geschäftsführer
>> >> System Engineer /
>> Consultant
>> >> (Cisco
CSE, SMBAM, LCSE, ASAM)
>> >>
>>
---------------------------------------
>> >> Zellerstraße 28 - 77654
>>
Offenburg
>> >> Tel: +49(781) 12786898
>> >> Mobil: +49(176) 10308549
>>
>>
>> ahuser@7five-edv.de
>> >>
>> >>
>> >>
>> >>
>> >> -----
Ursprüngliche Mail -----
>> >>
>> >>
>> Von: "Outback Dingo"
>> >> An:
cloudstack-users@incubator.apache.org
>> >>
>> Gesendet: Dienstag, 23.
Oktober 2012 02:15:16
>> >> Betreff: Re: Primary
>> Storage
>> >>
>> >>
On Mon, Oct 22, 2012 at 8:09 PM, Ivan
>> Rodriguezwrote:
>> >>
>>
>>
>>
>> >>> Solaris 11 ZFS and yes we tried different
>> setups, raids
levels number >> of SSD cache, ARC zfs options etc etc
>> etc. >> >>
Cheers >> >> >> VMWare ?? >>
>>
>> >>
>> >>
>> >>
>> >>
>>

Re: Primary Storage

Posted by Jason Davis <sc...@gmail.com>.

How often do you have folks cache missing and hitting the spinning rust
below the cache cade tier?
On Oct 23, 2012 6:25 PM, <br...@v365.com.au> wrote:

>
>
> Hi
>
> with our SCST RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
> Director 4036 switches + HP Gen2 QDR Chassis Switches.
>
> Our San achieves
> 295,000 - 350,000 Iops max per LSI controller, depending on block size
> for i/o meter test. LSI9280-8i with battery cache. + cachecade2.0 x 2 in
> PCIE 16x slot on supermicro motherboard.
>
> 1 x 250GB SSD for CACHE Cade +
> 16 1tb WD VR drives (16 - hot stand buy - 2 for raid 6 parity) =
> 2600MB/s
>
> most of our test are from 4K to 128K block size, @ 4K we get
> 295,000 Iops @ 1590 MB/s and @ 128K 350,000 iops @ 2400~2600MB/s
>
> we
> have tuned 64K and 128K block size on different luns. 64K for database
> and 128K for general file.
>
> The best thing about Infiniband is low CPU
> cycles. only 5~6% during these tests.
>
> latency is as low as 4-6ms
> average read time. concurrent response times are from 5-12ms. even under
> heavy load its below 20ms
>
> Infiniband latency is below 0.01 us Which is
> why we chose it.
>
> we run MSSQL on the 64K formatted Luns, its massively
> fast.
>
> If we copy the same data twice the cache cade kicks in and we
> achieve even better speeds.
>
> Ive compared this to block io @ 8Gb/s Fiber
> channel, and it barely gets 120,000 IOPS. @ much higher latency and
> bandwidth.
>
> So for our money RDMA wins!
>
> Bruce
>
> On 23.10.2012 21:20,
> Jason Davis wrote:
> > Out of curiosity, is there any quick performance
> numbers for these ZFS +
> > GlusterFS mashups you guys are talking
> about?
> >
> > Specifically, IOPs and latency? Sequential read/write
> performance honestly
> > isn't a very good benchmark to determine your
> SANs performance. It's like
> > comparing CPUs based solely on how many
> GHz it runs at. Sure you can get
> > great MB or GB/s with SATA disk but
> I'd reckon that IOP performance is
> > abismal. If you are utilizing
> GlusterFS without the cache pooling magic
> > that is ZFS then I would
> imagine that latency can be an issue.
> >
> >
> >
> > On Tue, Oct 23, 2012 at
> 7:56 AM, Andreas Huser wrote:
> >
> >> Hi Fabrice,
> >>
> >> i know
> OpenSolaris/Solaris Oracle it's so a thing.
> >> I'm for more then 10
> years a open source user and that
> >> with oracle - i did no like at the
> beginning of this constallation.
> >> But Oracle makes his work good i
> know that. The cost of one socket
> >> are 700$ and you can use so much
> quantity of TB as you will.
> >> And you can use the full premier Support
> from Oracle.
> >> Nexenta develop with the Illumos code. And the Licence
> are TB based.
> >> That is not my favorite. As well the pool version from
> Nexenta comes
> >> not after. Current Nexenta Infiniband are not a usable
> solution.
> >> But every can use what he will. Everyone must decide for
> themselves.
> >>
> >> SRP Targets or iser are not difficult to configure.
> Use the SRP for
> >> the Storage unit connection. Solaris and GlusterFS
> builds one Storage unit.
> >> The GlusterFS Server export the final Volume
> to the Clients as well KVM,
> >> VMWare, Hyper-V etc.
> >> You can use
> native GlusterFS, RDMA, NFS ore CIFS to export the Volume.
> >> SRP have
> nothing to do with VmWare.
> >>
> >> When you are use a 7200 SAS drive the
> access time are the same as a
> >> SATA drive only the quality of the
> hardware are better. When you need
> >> Performance you must use SAS
> drives with 15000U/m. But it's not needed
> >> when you install SSD for
> ZIL/L2ARC. ZeusRAM rocks :-)
> >>
> >> I use dedup only at secondary stroage
> or on Backupserver not on Primary
> >> Storage.
> >> When you use SSD SATA
> drives then you have an cheap an fast strorage.
> >> 1TB drive cost unter
> 100$. Current i'm not need to save storage volume.
> >>
> >> Which
> application use atime? I know "find -atime N". atime on a Storage
> >>
> that stores only Virtual Disks?
> >> I don't need the information when i
> have last opening the disk :-)
> >>
> >> For a Solaris GUI take a look at
> napp-it
> >> http://www.napp-it.org/
> >>
> >> greetings from germany
> >>
> Andreas
> >>
> >>
> >>
> >>
> >> ----- Ursprüngliche Mail -----
> >>
> >> Von:
> "Fabrice Brazier">> An: cloudstack-users@incubator.apache.org
> >>
> Gesendet: Dienstag, 23. Oktober 2012 12:30:50
> >> Betreff: RE: Primary
> Storage
> >>
> >> Hi Andreas,
> >>
> >> Hum that's pretty cool, i know they have
> still trouble with infiniband on
> >> nexenta but it's clearly a priority
> on their roadmap (plus i trust more the
> >> nexenta team to evolve zfs
> than oracle).
> >>
> >> I agree iscsi over infiniband increase the latency
> but most of the time
> >> it's
> >> just more simple to use IPOIB than IB,
> for example if you use vmware the
> >> support of infiniband it's assured
> by mellanox and not vmware so if you
> >> have
> >> an issue the vmware
> support probably don't help you anymore.
> >>
> >> About the raid i'm a fan
> of raid 10, i prefer build a raid 10 with 7200 sas
> >> drive than
> multiple vdev raidz/raidz2 15k sas drive. Particularly for
> >> virtual
> environment with a random workload and multiple access.
> >>
> >> I'm a fan
> of nfs so agreed about the zil, and zeus ram are the only one for
> >>
> that (with classical ssd you can hit the write hole).
> >>
> >> Agreed for
> compression too (but only lzjb level gzip use too much cpu),
> >> disable
> atime permit to decrease the iops load but i'm not sure is really a
> >>
> best practice. About the deduplication I don't totally agree. It
> really
> >> depends of your array, the workload and the VM type on your
> cloudstack.
> >> Actually when i build a zfs array i count 1Gb ram for 1Tb
> disks. With
> >> deduplication i count 2Gb ram for 1 Tb disks (plus in
> nexenta 4 they will
> >> add a possibility to use ssd for the
> deduplication metada, like the l2arc
> >> for the read cache).
> >>
> >> The
> last point it's about your drives, why sata drives? I mean sata
> doesn't
> >> respect their latency, and the reliability of a sas drive is
> 10x the sata.
> >> Plus now you can found many NL-sas drive at low
> cost
> >>
> >> But it's really a nice architecture, i never try glusterfs
> (for the moment)
> >> plus in that case it's really a good way to have a
> replacement to a
> >> metro-cluster for free, i try one of is competitor
> (onefs) and clearly
> >> clustered filesystem are the futur.
> >>
> >>
> Cheers,
> >> Fabrice
> >>
> >> -----Message d'origine-----
> >> De : Andreas
> Huser [mailto:ahuser@7five-edv.de]
> >> Envoyé : mardi 23 octobre 2012
> 11:37
> >> À : cloudstack-users@incubator.apache.org
> >> Objet : Re:
> Primary Storage
> >>
> >> Hi Fabrice,
> >>
> >> I don't know what other people
> do but i have no problems with Infiniband +
> >> GlusterFS + Cloudstack
> I'm not use Nexenta it's based on Illumos and work
> >> not fine with
> Infiniband.
> >> I have two different clusters in productiv
> enviroments.
> >>
> >> The first: Solaris 11 with built-in GlusterFS 3.3
> export Gluster Vol. with
> >> RDMA. > performance is okey you can use that
> for smaller enviroments The
> >> Second: Is a little bit complex with a
> GlusterFS Server in the middle
> >>
> >> ZFS Server: based on Solaris 11
> >>
> 1.) Create a zpool min. two vdevs and SSD read/write cache
> >> 2.) Create
> a thin provisioning Volume "zfs create -V" disable atime and
> >> enable
> compression (do not enable dedup!) and export as (iWarp) SRP Target
> >>
> to the GlusterFS Server use a direct connection without IB Switch.
> >>
> >>
> GlusterFS Server:
> >> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
> and 6)
> >> 2.) Use OFED driver from https://www.openfabrics.org
> >> 3.)
> Import the SRP Target from ZFS Server and format as xfs
> >> 4.) Create a
> Glustervolume "volume create xy transport rdma " (use only
> >> rdma)
> >>
> 5.) Connect with the second IB Port to a IB Switch
> >>
> >> Cloudstack
> Hypervisor Node:
> >> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and
> 6)
> >> 2.) Use OFED driver from https://www.openfabrics.org
> >> 3.) Import
> the Glustervolume
> >>
> >> ZFS Thin Volume ---- Infinband SRP ---->
> GlusterFS ---- GFSVol rdma ---->
> >> IB Switch ----> Clients
> >>
> >> The
> ZFS and GlusterFS Server form a Storage unit connect direct with
> 40Gbit
> >> Infiniband Point-to-Point You do not feel that is a cable
> between!
> >>
> >> Important: When you have Infiniband use not IPoIB with
> iSCSI! If one
> >> already
> >> has Infiniband then you should also use the
> advantage.
> >> IPoIB have a higher Latency as iWarp SRP!
> >>
> >>
> >> SRP =
> usec
> >>
> >> -- SRP --
> >> local address: LID 0x01 QPN 0x44004b PSN
> 0xf3265b RKey 0x9804237c VAddr
> >> 0x00000001dda000 remote address: LID
> 0x0a QPN 0x10004a PSN 0x44072e RKey
> >> 0x1c0f115 VAddr
> 0x000000088e6000
> >>
> ------------------------------------------------------------------
> >>
> #bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
> >> 2 1000
> 1.29 125.30 1.31
> >>
> ------------------------------------------------------------------
> >>
> >>
> -- IPoIB ---
> >> [root@sv01sfogaa ~]# ping 10.200.0.10
> >> PING
> 10.200.0.10 (10.200.0.10) 56(84) bytes of data.
> >> 64 bytes from
> 10.200.0.10: icmp_seq=1 ttl=255 time=0.147 ms
> >> 64 bytes from
> 10.200.0.10: icmp_seq=2 ttl=255 time=0.116 ms
> >>
> >> When you put load on
> IPoIB then increases the latency that is not good
> >>
> >>
> >> That is my
> recommendation for a simple GlusterFS Mirror:
> >>
> >> - Supermicro Server
> with Intel Hardware and Expander Backplane
> >> - 1x Crucal-M4 SSD Read
> Cache
> >> - 2x ZeusIOPs SSD Write cache (mirror)
> >> - SATA 24/7
> Harddrives
> >> - LSI HBA 9207 or 9211
> >> - ConnectX-2 QDR Dualport
> Infiniband Adaper (HP refurbished with full
> >> warrany for 100$)
> Importent > Flash newest Firmware from Mellanox!
> >> - Mellanox IB
> Swich
> >> - Solaris 11
> >> - GlusterFS 3.3 compiled with ib_verbs
> >> -
> Gluster Volume transport only rdma
> >>
> >>
> >>
> >>
> >>
>
> >>> Throughput
> constant up to 200 Mbyte/s >> more throughput with more storage servers
> or more harddrives on jbod >> >> >>
>
> >>
> >>
> >> Info:
> >>
> >> - I have some
> problems with infiniband RDMA or SRP with OpenIndiana or
> >> Illumos or
> Nexenta. Some adapters have a high latency or not a stable
> >>
> connection. Use Solaris that's the right way!
> >> - OpenIndiana is Beta!
> Infiniband ib_verbs works not or not fine!
> >> - Use Solaris 11
> Infiniband ib_verbs are native and stable
> >> - Don't use Ubuntu Client
> Server for Infiniband! Use RedHat, Fedora or
> >> CentOS and install the
> right drivers from
> >> https://www.openfabrics.org/downloads/OFED/
> >> -
> You have not SSD Cache? Disable at the ZFS Volume the sync!
> Important!
> >> You
> >> loose Security for your Data but some protocols use
> sync flags in
> >> transport.
> >> Example NFS use by default fsync. write
> cache are not active. NFS writes
> >> data direct to the Harddrive. For
> Data Security and Performance give the
> >> Storage Server a SSD write
> cache. ZFS works at default with sync=standard
> >> that prevent write
> holes. (COW System)
> >>
> >> I hope that I could help a little
> >>
> >>
> Greeting from Germany
> >> Andreas
> >>
> >>
> >>
> >> ----- Ursprüngliche Mail
> -----
> >>
> >> Von: "Fabrice Brazier"
> >> An:
> cloudstack-users@incubator.apache.org
> >> Gesendet: Dienstag, 23. Oktober
> 2012 09:55:15
> >> Betreff: RE: Primary Storage
> >>
> >> Hi Andreas,
> >>
> >>
> Hello i just see your configuration, it seems quite interesting.
> >> If i
> understand well you want to build some zfs array on the backend.
> >>
> Export luns (probably by iscsi over infiniband) to you linux cluster,
> and
> >> on
> >> the linux cluster you put glusterFS.
> >> I can understand
> the point, with that you can have very good performance
> >> and
> >>
> reliability (zfs), scalability and redundancy (gluster) for very low
> cost.
> >> So just one question, did you try the global namespace
> implementation from
> >> nexenta?
> >> If yes can you tell me what
> configuration is the best for you?
> >> I mean the fact you have a gluster
> cluster in the middle must impact the
> >> overral performance no?
> >>
> >>
> Fabrice
> >>
> >> -----Message d'origine-----
> >> De : Andreas Huser
> [mailto:ahuser@7five-edv.de] Envoyé : mardi 23 octobre
> >> 2012 05:40 À :
> cloudstack-users@incubator.apache.org
> >> Objet : Re: Primary
> Storage
> >>
> >> Hi,
> >>
> >> for Cloudstack i use Solaris 11 ZFS + GlusterFS
> over Infiniband (RDMA).
> >> That
> >> gives the best performance and most
> scalable Storage.
> >> I have tasted some different solutions for primary
> Storage but the most are
> >> to expensive and for a CloudStack Cluster
> not economic or have a poor
> >> performance.
> >>
> >> My Configuration:
> >>
> Storage Node:
> >> Supermicro Server (Intel Hardware) with Solaris 11 with
> SSD write and read
> >> cache (read crucial-m4, write ZeusIOPS) GlusterFS
> and dualport ConnectX
> >> 40Gbit/s Infiniband adapter.
> >>
> >> I have
> installed GlusterFS direct on Solaris with a modified code.
> >> Want you
> build bigger systems for more then 50 VMs it is better you split
> >> the
> Solaris and GlusterFS with a separte headnode for GlusterFS
> >>
> >> That
> looks like:
> >> Solaris ZFS Backendstorage with a dataset Volume (Thin
> Provision) --> ( SRP
> >> Target attached direct without Infiniband switch
> to GF Node) --> GlusterFS
> >> Node the srp target formatted with xfs
> filesystem, create a GlusterFS
> >> Volume --> ( Infiniband over a
> Mellanox Port Switch) --> Cloudstack Node
> >> mount glusterFS Volume over
> RDMA
> >>
> >> For the Dataset Volume at the ZFS Storage, disable atime and
> enable
> >> compression.
> >> (Space reclaim) With compression you can
> shrink the ZFS Volume with command
> >> at Linux dd /dev/zero or In a
> Windows VM with sdelete That gives you space
> >> left on the Primary
> Storage for deleted Files in a VM or for deleted vhd's
> >> or vm's in the
> cloudstack
> >>
> >> greeting Andreas
> >>
> >>
> >>
> >>
> >> Mit freundlichen
> Grüßen
> >>
> >> Andreas Huser
> >> Geschäftsführer
> >> System Engineer /
> Consultant
> >> (Cisco CSE, SMBAM, LCSE, ASAM)
> >>
> ---------------------------------------
> >> Zellerstraße 28 - 77654
> Offenburg
> >> Tel: +49(781) 12786898
> >> Mobil: +49(176) 10308549
> >>
> ahuser@7five-edv.de
> >>
> >>
> >>
> >>
> >> ----- Ursprüngliche Mail -----
> >>
> >>
> Von: "Outback Dingo"
> >> An: cloudstack-users@incubator.apache.org
> >>
> Gesendet: Dienstag, 23. Oktober 2012 02:15:16
> >> Betreff: Re: Primary
> Storage
> >>
> >> On Mon, Oct 22, 2012 at 8:09 PM, Ivan
> Rodriguezwrote:
> >>
> >>
>
> >>> Solaris 11 ZFS and yes we tried different
> setups, raids levels number >> of SSD cache, ARC zfs options etc etc
> etc. >> >> Cheers >> >> >> VMWare ?? >>
>
> >>
> >>
> >>
> >>
>

Re: Primary Storage

Posted by br...@v365.com.au.

  

Hi

with our SCST RDMA we have 40GB/s QDR Gen 2 with Mellanox Grid
Director 4036 switches + HP Gen2 QDR Chassis Switches.

Our San achieves
295,000 - 350,000 Iops max per LSI controller, depending on block size
for i/o meter test. LSI9280-8i with battery cache. + cachecade2.0 x 2 in
PCIE 16x slot on supermicro motherboard.

1 x 250GB SSD for CACHE Cade +
16 1tb WD VR drives (16 - hot stand buy - 2 for raid 6 parity) =
2600MB/s

most of our test are from 4K to 128K block size, @ 4K we get
295,000 Iops @ 1590 MB/s and @ 128K 350,000 iops @ 2400~2600MB/s

we
have tuned 64K and 128K block size on different luns. 64K for database
and 128K for general file.

The best thing about Infiniband is low CPU
cycles. only 5~6% during these tests.

latency is as low as 4-6ms
average read time. concurrent response times are from 5-12ms. even under
heavy load its below 20ms

Infiniband latency is below 0.01 us Which is
why we chose it. 

we run MSSQL on the 64K formatted Luns, its massively
fast.

If we copy the same data twice the cache cade kicks in and we
achieve even better speeds.

Ive compared this to block io @ 8Gb/s Fiber
channel, and it barely gets 120,000 IOPS. @ much higher latency and
bandwidth.

So for our money RDMA wins!

Bruce

On 23.10.2012 21:20,
Jason Davis wrote:
> Out of curiosity, is there any quick performance
numbers for these ZFS +
> GlusterFS mashups you guys are talking
about?
> 
> Specifically, IOPs and latency? Sequential read/write
performance honestly
> isn't a very good benchmark to determine your
SANs performance. It's like
> comparing CPUs based solely on how many
GHz it runs at. Sure you can get
> great MB or GB/s with SATA disk but
I'd reckon that IOP performance is
> abismal. If you are utilizing
GlusterFS without the cache pooling magic
> that is ZFS then I would
imagine that latency can be an issue.
> 
> 
> 
> On Tue, Oct 23, 2012 at
7:56 AM, Andreas Huser wrote:
> 
>> Hi Fabrice,
>>
>> i know
OpenSolaris/Solaris Oracle it's so a thing.
>> I'm for more then 10
years a open source user and that
>> with oracle - i did no like at the
beginning of this constallation.
>> But Oracle makes his work good i
know that. The cost of one socket
>> are 700$ and you can use so much
quantity of TB as you will.
>> And you can use the full premier Support
from Oracle.
>> Nexenta develop with the Illumos code. And the Licence
are TB based.
>> That is not my favorite. As well the pool version from
Nexenta comes
>> not after. Current Nexenta Infiniband are not a usable
solution.
>> But every can use what he will. Everyone must decide for
themselves.
>>
>> SRP Targets or iser are not difficult to configure.
Use the SRP for
>> the Storage unit connection. Solaris and GlusterFS
builds one Storage unit.
>> The GlusterFS Server export the final Volume
to the Clients as well KVM,
>> VMWare, Hyper-V etc.
>> You can use
native GlusterFS, RDMA, NFS ore CIFS to export the Volume.
>> SRP have
nothing to do with VmWare.
>>
>> When you are use a 7200 SAS drive the
access time are the same as a
>> SATA drive only the quality of the
hardware are better. When you need
>> Performance you must use SAS
drives with 15000U/m. But it's not needed
>> when you install SSD for
ZIL/L2ARC. ZeusRAM rocks :-)
>>
>> I use dedup only at secondary stroage
or on Backupserver not on Primary
>> Storage.
>> When you use SSD SATA
drives then you have an cheap an fast strorage.
>> 1TB drive cost unter
100$. Current i'm not need to save storage volume.
>>
>> Which
application use atime? I know "find -atime N". atime on a Storage
>>
that stores only Virtual Disks?
>> I don't need the information when i
have last opening the disk :-)
>>
>> For a Solaris GUI take a look at
napp-it
>> http://www.napp-it.org/
>>
>> greetings from germany
>>
Andreas
>>
>>
>>
>>
>> ----- Ursprüngliche Mail -----
>>
>> Von:
"Fabrice Brazier">> An: cloudstack-users@incubator.apache.org
>>
Gesendet: Dienstag, 23. Oktober 2012 12:30:50
>> Betreff: RE: Primary
Storage
>>
>> Hi Andreas,
>>
>> Hum that's pretty cool, i know they have
still trouble with infiniband on
>> nexenta but it's clearly a priority
on their roadmap (plus i trust more the
>> nexenta team to evolve zfs
than oracle).
>>
>> I agree iscsi over infiniband increase the latency
but most of the time
>> it's
>> just more simple to use IPOIB than IB,
for example if you use vmware the
>> support of infiniband it's assured
by mellanox and not vmware so if you
>> have
>> an issue the vmware
support probably don't help you anymore.
>>
>> About the raid i'm a fan
of raid 10, i prefer build a raid 10 with 7200 sas
>> drive than
multiple vdev raidz/raidz2 15k sas drive. Particularly for
>> virtual
environment with a random workload and multiple access.
>>
>> I'm a fan
of nfs so agreed about the zil, and zeus ram are the only one for
>>
that (with classical ssd you can hit the write hole).
>>
>> Agreed for
compression too (but only lzjb level gzip use too much cpu),
>> disable
atime permit to decrease the iops load but i'm not sure is really a
>>
best practice. About the deduplication I don't totally agree. It
really
>> depends of your array, the workload and the VM type on your
cloudstack.
>> Actually when i build a zfs array i count 1Gb ram for 1Tb
disks. With
>> deduplication i count 2Gb ram for 1 Tb disks (plus in
nexenta 4 they will
>> add a possibility to use ssd for the
deduplication metada, like the l2arc
>> for the read cache).
>>
>> The
last point it's about your drives, why sata drives? I mean sata
doesn't
>> respect their latency, and the reliability of a sas drive is
10x the sata.
>> Plus now you can found many NL-sas drive at low
cost
>>
>> But it's really a nice architecture, i never try glusterfs
(for the moment)
>> plus in that case it's really a good way to have a
replacement to a
>> metro-cluster for free, i try one of is competitor
(onefs) and clearly
>> clustered filesystem are the futur.
>>
>>
Cheers,
>> Fabrice
>>
>> -----Message d'origine-----
>> De : Andreas
Huser [mailto:ahuser@7five-edv.de]
>> Envoyé : mardi 23 octobre 2012
11:37
>> À : cloudstack-users@incubator.apache.org
>> Objet : Re:
Primary Storage
>>
>> Hi Fabrice,
>>
>> I don't know what other people
do but i have no problems with Infiniband +
>> GlusterFS + Cloudstack
I'm not use Nexenta it's based on Illumos and work
>> not fine with
Infiniband.
>> I have two different clusters in productiv
enviroments.
>>
>> The first: Solaris 11 with built-in GlusterFS 3.3
export Gluster Vol. with
>> RDMA. > performance is okey you can use that
for smaller enviroments The
>> Second: Is a little bit complex with a
GlusterFS Server in the middle
>>
>> ZFS Server: based on Solaris 11
>>
1.) Create a zpool min. two vdevs and SSD read/write cache
>> 2.) Create
a thin provisioning Volume "zfs create -V" disable atime and
>> enable
compression (do not enable dedup!) and export as (iWarp) SRP Target
>>
to the GlusterFS Server use a direct connection without IB Switch.
>>
>>
GlusterFS Server:
>> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5
and 6)
>> 2.) Use OFED driver from https://www.openfabrics.org
>> 3.)
Import the SRP Target from ZFS Server and format as xfs
>> 4.) Create a
Glustervolume "volume create xy transport rdma " (use only
>> rdma)
>>
5.) Connect with the second IB Port to a IB Switch
>>
>> Cloudstack
Hypervisor Node:
>> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and
6)
>> 2.) Use OFED driver from https://www.openfabrics.org
>> 3.) Import
the Glustervolume
>>
>> ZFS Thin Volume ---- Infinband SRP ---->
GlusterFS ---- GFSVol rdma ---->
>> IB Switch ----> Clients
>>
>> The
ZFS and GlusterFS Server form a Storage unit connect direct with
40Gbit
>> Infiniband Point-to-Point You do not feel that is a cable
between!
>>
>> Important: When you have Infiniband use not IPoIB with
iSCSI! If one
>> already
>> has Infiniband then you should also use the
advantage.
>> IPoIB have a higher Latency as iWarp SRP!
>>
>>
>> SRP =
usec
>>
>> -- SRP --
>> local address: LID 0x01 QPN 0x44004b PSN
0xf3265b RKey 0x9804237c VAddr
>> 0x00000001dda000 remote address: LID
0x0a QPN 0x10004a PSN 0x44072e RKey
>> 0x1c0f115 VAddr
0x000000088e6000
>>
------------------------------------------------------------------
>>
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
>> 2 1000
1.29 125.30 1.31
>>
------------------------------------------------------------------
>>
>>
-- IPoIB ---
>> [root@sv01sfogaa ~]# ping 10.200.0.10
>> PING
10.200.0.10 (10.200.0.10) 56(84) bytes of data.
>> 64 bytes from
10.200.0.10: icmp_seq=1 ttl=255 time=0.147 ms
>> 64 bytes from
10.200.0.10: icmp_seq=2 ttl=255 time=0.116 ms
>>
>> When you put load on
IPoIB then increases the latency that is not good
>>
>>
>> That is my
recommendation for a simple GlusterFS Mirror:
>>
>> - Supermicro Server
with Intel Hardware and Expander Backplane
>> - 1x Crucal-M4 SSD Read
Cache
>> - 2x ZeusIOPs SSD Write cache (mirror)
>> - SATA 24/7
Harddrives
>> - LSI HBA 9207 or 9211
>> - ConnectX-2 QDR Dualport
Infiniband Adaper (HP refurbished with full
>> warrany for 100$)
Importent > Flash newest Firmware from Mellanox!
>> - Mellanox IB
Swich
>> - Solaris 11
>> - GlusterFS 3.3 compiled with ib_verbs
>> -
Gluster Volume transport only rdma
>>
>>
>>
>>
>>

>>> Throughput
constant up to 200 Mbyte/s >> more throughput with more storage servers
or more harddrives on jbod >> >> >>

>>
>>
>> Info:
>>
>> - I have some
problems with infiniband RDMA or SRP with OpenIndiana or
>> Illumos or
Nexenta. Some adapters have a high latency or not a stable
>>
connection. Use Solaris that's the right way!
>> - OpenIndiana is Beta!
Infiniband ib_verbs works not or not fine!
>> - Use Solaris 11
Infiniband ib_verbs are native and stable
>> - Don't use Ubuntu Client
Server for Infiniband! Use RedHat, Fedora or
>> CentOS and install the
right drivers from
>> https://www.openfabrics.org/downloads/OFED/
>> -
You have not SSD Cache? Disable at the ZFS Volume the sync!
Important!
>> You
>> loose Security for your Data but some protocols use
sync flags in
>> transport.
>> Example NFS use by default fsync. write
cache are not active. NFS writes
>> data direct to the Harddrive. For
Data Security and Performance give the
>> Storage Server a SSD write
cache. ZFS works at default with sync=standard
>> that prevent write
holes. (COW System)
>>
>> I hope that I could help a little
>>
>>
Greeting from Germany
>> Andreas
>>
>>
>>
>> ----- Ursprüngliche Mail
-----
>>
>> Von: "Fabrice Brazier"
>> An:
cloudstack-users@incubator.apache.org
>> Gesendet: Dienstag, 23. Oktober
2012 09:55:15
>> Betreff: RE: Primary Storage
>>
>> Hi Andreas,
>>
>>
Hello i just see your configuration, it seems quite interesting.
>> If i
understand well you want to build some zfs array on the backend.
>>
Export luns (probably by iscsi over infiniband) to you linux cluster,
and
>> on
>> the linux cluster you put glusterFS.
>> I can understand
the point, with that you can have very good performance
>> and
>>
reliability (zfs), scalability and redundancy (gluster) for very low
cost.
>> So just one question, did you try the global namespace
implementation from
>> nexenta?
>> If yes can you tell me what
configuration is the best for you?
>> I mean the fact you have a gluster
cluster in the middle must impact the
>> overral performance no?
>>
>>
Fabrice
>>
>> -----Message d'origine-----
>> De : Andreas Huser
[mailto:ahuser@7five-edv.de] Envoyé : mardi 23 octobre
>> 2012 05:40 À :
cloudstack-users@incubator.apache.org
>> Objet : Re: Primary
Storage
>>
>> Hi,
>>
>> for Cloudstack i use Solaris 11 ZFS + GlusterFS
over Infiniband (RDMA).
>> That
>> gives the best performance and most
scalable Storage.
>> I have tasted some different solutions for primary
Storage but the most are
>> to expensive and for a CloudStack Cluster
not economic or have a poor
>> performance.
>>
>> My Configuration:
>>
Storage Node:
>> Supermicro Server (Intel Hardware) with Solaris 11 with
SSD write and read
>> cache (read crucial-m4, write ZeusIOPS) GlusterFS
and dualport ConnectX
>> 40Gbit/s Infiniband adapter.
>>
>> I have
installed GlusterFS direct on Solaris with a modified code.
>> Want you
build bigger systems for more then 50 VMs it is better you split
>> the
Solaris and GlusterFS with a separte headnode for GlusterFS
>>
>> That
looks like:
>> Solaris ZFS Backendstorage with a dataset Volume (Thin
Provision) --> ( SRP
>> Target attached direct without Infiniband switch
to GF Node) --> GlusterFS
>> Node the srp target formatted with xfs
filesystem, create a GlusterFS
>> Volume --> ( Infiniband over a
Mellanox Port Switch) --> Cloudstack Node
>> mount glusterFS Volume over
RDMA
>>
>> For the Dataset Volume at the ZFS Storage, disable atime and
enable
>> compression.
>> (Space reclaim) With compression you can
shrink the ZFS Volume with command
>> at Linux dd /dev/zero or In a
Windows VM with sdelete That gives you space
>> left on the Primary
Storage for deleted Files in a VM or for deleted vhd's
>> or vm's in the
cloudstack
>>
>> greeting Andreas
>>
>>
>>
>>
>> Mit freundlichen
Grüßen
>>
>> Andreas Huser
>> Geschäftsführer
>> System Engineer /
Consultant
>> (Cisco CSE, SMBAM, LCSE, ASAM)
>>
---------------------------------------
>> Zellerstraße 28 - 77654
Offenburg
>> Tel: +49(781) 12786898
>> Mobil: +49(176) 10308549
>>
ahuser@7five-edv.de
>>
>>
>>
>>
>> ----- Ursprüngliche Mail -----
>>
>>
Von: "Outback Dingo"
>> An: cloudstack-users@incubator.apache.org
>>
Gesendet: Dienstag, 23. Oktober 2012 02:15:16
>> Betreff: Re: Primary
Storage
>>
>> On Mon, Oct 22, 2012 at 8:09 PM, Ivan
Rodriguezwrote:
>>
>>

>>> Solaris 11 ZFS and yes we tried different
setups, raids levels number >> of SSD cache, ARC zfs options etc etc
etc. >> >> Cheers >> >> >> VMWare ?? >>

>>
>>
>>
>>

Re: Primary Storage

Posted by Jason Davis <sc...@gmail.com>.

Out of curiosity, is there any quick performance numbers for these ZFS +
GlusterFS mashups you guys are talking about?

Specifically, IOPs and latency? Sequential read/write performance honestly
isn't a very good benchmark to determine your SANs performance. It's like
comparing CPUs based solely on how many GHz it runs at. Sure you can get
great MB or GB/s with SATA disk but I'd reckon that IOP performance is
abismal. If you are utilizing GlusterFS without the cache pooling magic
that is ZFS then I would imagine that latency can be an issue.



On Tue, Oct 23, 2012 at 7:56 AM, Andreas Huser <ah...@7five-edv.de> wrote:

> Hi Fabrice,
>
> i know OpenSolaris/Solaris   Oracle it's so a thing.
> I'm for more then 10 years a open source user and that
> with oracle - i did no like at the beginning of this constallation.
> But Oracle makes his work good i know that. The cost of one socket
> are 700$ and you can use so much quantity of TB  as you will.
> And you can use the full premier Support from Oracle.
> Nexenta develop with the Illumos code. And the Licence are TB based.
> That is not my favorite. As well the pool version from Nexenta comes
> not after. Current Nexenta   Infiniband are not a usable solution.
> But every can use what he will. Everyone must decide for themselves.
>
> SRP Targets or iser are not difficult to configure. Use the SRP for
> the Storage unit connection. Solaris and GlusterFS builds one Storage unit.
> The GlusterFS Server export the final Volume to the Clients as well KVM,
>  VMWare, Hyper-V etc.
> You can use native GlusterFS, RDMA, NFS ore CIFS to export the Volume.
> SRP have nothing to do with VmWare.
>
> When you are use a 7200 SAS drive the access time are the same as a
> SATA drive only the quality of the hardware are better. When you need
> Performance you must use SAS drives with 15000U/m. But it's not needed
> when you install SSD for ZIL/L2ARC.  ZeusRAM rocks :-)
>
> I use dedup only at secondary stroage or on Backupserver not on Primary
> Storage.
> When you use SSD   SATA drives then you have an cheap an fast strorage.
> 1TB drive cost unter 100$. Current i'm not need to save storage volume.
>
> Which application use atime? I know "find -atime N". atime on a Storage
> that stores only Virtual Disks?
> I don't need the information when i have last opening the disk :-)
>
> For a Solaris GUI take a look at napp-it
> http://www.napp-it.org/
>
> greetings from germany
> Andreas
>
>
>
>
> ----- Ursprüngliche Mail -----
>
> Von: "Fabrice Brazier" <fa...@apalia.net>
> An: cloudstack-users@incubator.apache.org
> Gesendet: Dienstag, 23. Oktober 2012 12:30:50
> Betreff: RE: Primary Storage
>
> Hi Andreas,
>
> Hum that's pretty cool, i know they have still trouble with infiniband on
> nexenta but it's clearly a priority on their roadmap (plus i trust more the
> nexenta team to evolve zfs than oracle).
>
> I agree iscsi over infiniband increase the latency but most of the time
> it's
> just more simple to use IPOIB than IB, for example if you use vmware the
> support of infiniband it's assured by mellanox and not vmware so if you
> have
> an issue the vmware support probably don't help you anymore.
>
> About the raid i'm a fan of raid 10, i prefer build a raid 10 with 7200 sas
> drive than multiple vdev raidz/raidz2 15k sas drive. Particularly for
> virtual environment with a random workload and multiple access.
>
> I'm a fan of nfs so agreed about the zil, and zeus ram are the only one for
> that (with classical ssd you can hit the write hole).
>
> Agreed for compression too (but only lzjb level gzip use too much cpu),
> disable atime permit to decrease the iops load but i'm not sure is really a
> best practice. About the deduplication I don't totally agree. It really
> depends of your array, the workload and the VM type on your cloudstack.
> Actually when i build a zfs array i count 1Gb ram for 1Tb disks. With
> deduplication i count 2Gb ram for 1 Tb disks (plus in nexenta 4 they will
> add a possibility to use ssd for the deduplication metada, like the l2arc
> for the read cache).
>
> The last point it's about your drives, why sata drives? I mean sata doesn't
> respect their latency, and the reliability of a sas drive is 10x the sata.
> Plus now you can found many NL-sas drive at low cost
>
> But it's really a nice architecture, i never try glusterfs (for the moment)
> plus in that case it's really a good way to have a replacement to a
> metro-cluster for free, i try one of is competitor (onefs) and clearly
> clustered filesystem are the futur.
>
> Cheers,
> Fabrice
>
> -----Message d'origine-----
> De : Andreas Huser [mailto:ahuser@7five-edv.de]
> Envoyé : mardi 23 octobre 2012 11:37
> À : cloudstack-users@incubator.apache.org
> Objet : Re: Primary Storage
>
> Hi Fabrice,
>
> I don't know what other people do but i have no problems with Infiniband +
> GlusterFS + Cloudstack I'm not use Nexenta it's based on Illumos and work
> not fine with Infiniband.
> I have two different clusters in productiv enviroments.
>
> The first: Solaris 11 with built-in GlusterFS 3.3 export Gluster Vol. with
> RDMA. > performance is okey you can use that for smaller enviroments The
> Second: Is a little bit complex with a GlusterFS Server in the middle
>
> ZFS Server: based on Solaris 11
> 1.) Create a zpool min. two vdevs and SSD read/write cache
> 2.) Create a thin provisioning Volume "zfs create -V" disable atime and
> enable compression (do not enable dedup!) and export as (iWarp) SRP Target
> to the GlusterFS Server use a direct connection without IB Switch.
>
> GlusterFS Server:
> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6)
> 2.) Use OFED driver from https://www.openfabrics.org
> 3.) Import the SRP Target from ZFS Server and format as xfs
> 4.) Create a Glustervolume "volume create xy transport rdma " (use only
> rdma)
> 5.) Connect with the second IB Port to a IB Switch
>
> Cloudstack Hypervisor Node:
> 1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6)
> 2.) Use OFED driver from https://www.openfabrics.org
> 3.) Import the Glustervolume
>
> ZFS Thin Volume ---- Infinband SRP ----> GlusterFS ---- GFSVol rdma ---->
> IB Switch ----> Clients
>
> The ZFS and GlusterFS Server form a Storage unit connect direct with 40Gbit
> Infiniband Point-to-Point You do not feel that is a cable between!
>
> Important: When you have Infiniband use not IPoIB with iSCSI! If one
> already
> has Infiniband then you should also use the advantage.
> IPoIB have a higher Latency as iWarp SRP!
>
>
> SRP = usec
>
> -- SRP --
> local address: LID 0x01 QPN 0x44004b PSN 0xf3265b RKey 0x9804237c VAddr
> 0x00000001dda000 remote address: LID 0x0a QPN 0x10004a PSN 0x44072e RKey
> 0x1c0f115 VAddr 0x000000088e6000
> ------------------------------------------------------------------
> #bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
> 2 1000 1.29 125.30 1.31
> ------------------------------------------------------------------
>
> -- IPoIB ---
> [root@sv01sfogaa ~]# ping 10.200.0.10
> PING 10.200.0.10 (10.200.0.10) 56(84) bytes of data.
> 64 bytes from 10.200.0.10: icmp_seq=1 ttl=255 time=0.147 ms
> 64 bytes from 10.200.0.10: icmp_seq=2 ttl=255 time=0.116 ms
>
> When you put load on IPoIB then increases the latency that is not good
>
>
> That is my recommendation for a simple GlusterFS Mirror:
>
> - Supermicro Server with Intel Hardware and Expander Backplane
> - 1x Crucal-M4 SSD Read Cache
> - 2x ZeusIOPs SSD Write cache (mirror)
> - SATA 24/7 Harddrives
> - LSI HBA 9207 or 9211
> - ConnectX-2 QDR Dualport Infiniband Adaper (HP refurbished with full
> warrany for 100$) Importent > Flash newest Firmware from Mellanox!
> - Mellanox IB Swich
> - Solaris 11
> - GlusterFS 3.3 compiled with ib_verbs
> - Gluster Volume transport only rdma
>
>
>
>
> <blockquote>
> Throughput constant up to 200 Mbyte/s
> more throughput with more storage servers or more harddrives on jbod
>
>
> </blockquote>
>
>
> Info:
>
> - I have some problems with infiniband RDMA or SRP with OpenIndiana or
> Illumos or Nexenta. Some adapters have a high latency or not a stable
> connection. Use Solaris that's the right way!
> - OpenIndiana is Beta! Infiniband ib_verbs works not or not fine!
> - Use Solaris 11 Infiniband ib_verbs are native and stable
> - Don't use Ubuntu Client Server for Infiniband! Use RedHat, Fedora or
> CentOS and install the right drivers from
> https://www.openfabrics.org/downloads/OFED/
> - You have not SSD Cache? Disable at the ZFS Volume the sync! Important!
> You
> loose Security for your Data but some protocols use sync flags in
> transport.
> Example NFS use by default fsync. write cache are not active. NFS writes
> data direct to the Harddrive. For Data Security and Performance give the
> Storage Server a SSD write cache. ZFS works at default with sync=standard
> that prevent write holes. (COW System)
>
> I hope that I could help a little
>
> Greeting from Germany
> Andreas
>
>
>
> ----- Ursprüngliche Mail -----
>
> Von: "Fabrice Brazier" <fa...@apalia.net>
> An: cloudstack-users@incubator.apache.org
> Gesendet: Dienstag, 23. Oktober 2012 09:55:15
> Betreff: RE: Primary Storage
>
> Hi Andreas,
>
> Hello i just see your configuration, it seems quite interesting.
> If i understand well you want to build some zfs array on the backend.
> Export luns (probably by iscsi over infiniband) to you linux cluster, and
> on
> the linux cluster you put glusterFS.
> I can understand the point, with that you can have very good performance
> and
> reliability (zfs), scalability and redundancy (gluster) for very low cost.
> So just one question, did you try the global namespace implementation from
> nexenta?
> If yes can you tell me what configuration is the best for you?
> I mean the fact you have a gluster cluster in the middle must impact the
> overral performance no?
>
> Fabrice
>
> -----Message d'origine-----
> De : Andreas Huser [mailto:ahuser@7five-edv.de] Envoyé : mardi 23 octobre
> 2012 05:40 À : cloudstack-users@incubator.apache.org
> Objet : Re: Primary Storage
>
> Hi,
>
> for Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband (RDMA).
> That
> gives the best performance and most scalable Storage.
> I have tasted some different solutions for primary Storage but the most are
> to expensive and for a CloudStack Cluster not economic or have a poor
> performance.
>
> My Configuration:
> Storage Node:
> Supermicro Server (Intel Hardware) with Solaris 11 with SSD write and read
> cache (read crucial-m4, write ZeusIOPS) GlusterFS and dualport ConnectX
> 40Gbit/s Infiniband adapter.
>
> I have installed GlusterFS direct on Solaris with a modified code.
> Want you build bigger systems for more then 50 VMs it is better you split
> the Solaris and GlusterFS with a separte headnode for GlusterFS
>
> That looks like:
> Solaris ZFS Backendstorage with a dataset Volume (Thin Provision) --> ( SRP
> Target attached direct without Infiniband switch to GF Node) --> GlusterFS
> Node the srp target formatted with xfs filesystem, create a GlusterFS
> Volume --> ( Infiniband over a Mellanox Port Switch) --> Cloudstack Node
> mount glusterFS Volume over RDMA
>
> For the Dataset Volume at the ZFS Storage, disable atime and enable
> compression.
> (Space reclaim) With compression you can shrink the ZFS Volume with command
> at Linux dd /dev/zero or In a Windows VM with sdelete That gives you space
> left on the Primary Storage for deleted Files in a VM or for deleted vhd's
> or vm's in the cloudstack
>
> greeting Andreas
>
>
>
>
> Mit freundlichen Grüßen
>
> Andreas Huser
> Geschäftsführer
> System Engineer / Consultant
> (Cisco CSE, SMBAM, LCSE, ASAM)
> ---------------------------------------
> Zellerstraße 28 - 77654 Offenburg
> Tel: +49(781) 12786898
> Mobil: +49(176) 10308549
> ahuser@7five-edv.de
>
>
>
>
> ----- Ursprüngliche Mail -----
>
> Von: "Outback Dingo" <ou...@gmail.com>
> An: cloudstack-users@incubator.apache.org
> Gesendet: Dienstag, 23. Oktober 2012 02:15:16
> Betreff: Re: Primary Storage
>
> On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez <iv...@gmail.com> wrote:
>
> <blockquote>
> Solaris 11 ZFS and yes we tried different setups, raids levels number
> of SSD cache, ARC zfs options etc etc etc.
>
> Cheers
>
>
> VMWare ??
> </blockquote>
>
>
>
> <blockquote>
> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo
> <ou...@gmail.com>wrote:
>
>
> <blockquote>
> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com>
> wrote:
> > We are using ZFS, with jbod, not in production yet exporting NFS to
> > cloudstack, I'm not really happy about the performance
> > but I think is related to the hardware itself rather than technology,
> > we
> > are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster
> > as
> > well, but we decided to move away from that path since gluster nfs is
> still
> > performing poorly, plus we would like to see cloudstack integrating the
> > gluster-fuse module, we haven't decided the final storage setup but at
> the
> > moment we had better results with ZFS.
> >
> >
>
> question is whos ZFS and have you "tweaked" the zfs / nfs config for
> performance
>
> >
> > On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <nik.martin@nfinausa.com
> >wrote:
> >
> >> On 10/22/2012 05:49 PM, Trevor Francis wrote:
> >>
> >>> ZFS looks really interesting to me and I am leaning that way. I am
> >>> considering using FreeNAS, as people seem to be having good luck with
> >>> it. Can anyone weigh in here?
> >>>
> >>>
> >> My personal opinion, I think FreeNAS and OpenFiler have horrible,
> horrible
> >> User Interfaces - not very intuitive, and they both seem to be file
> servers
> >> with things like iSCSI targets tacked on as an afterthought.
> >>
> >> Nik
> >>
> >>
> >>> Trevor Francis
> >>> Partner
> >>> 46 Labs | The PeerEdge Cloud
> >>> http://www.46labs.com <http://www.46labs.com/> |
> http://www.peeredge.net
> >>> <http://www.peeredge.net/>
> >>> 405-362-0046 - Voice | 405-410-4980 - Cell
> >>> trevorgfrancis - Skype
> >>> trevor@46labs.com <ma...@46labs.com>
> >>> Solutions Provider for the Telecom Industry
> >>>
> >>> <http://www.twitter.com/**peeredge
> >>> <http://www.twitter.com/peeredge>><
> >>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
> >>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
> >>> http://**www.facebook.com/PeerEdge
> >>> <http://www.facebook.com/PeerEdge>>
> >>>
> >>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:
> >>>
> >>> ZFS would be an interesting setup as you can do the cache pools like
> you
> >>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is
> >>>> that
> >>>> they
> >>>> really don't scale out well if you are looking for something with a
> >>>> unified
> >>>> name space. I'll say however that ZFS is a battle hardened FS with
> tons
> >>>> of
> >>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things
> >>>> these
> >>>> smaller start up companies are hocking are just ZFS appliances.
> >>>>
> >>>> RBD looks interesting but I'm not sure if I would be willing to put
> >>>> production data on it, I'm not sure how performant it is IRL. From a
> >>>> purely technical perspective, it looks REALLY cool.
> >>>>
> >>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is
> another
> >>>> option although historically small/random IO has not been it's
> >>>> strong
> >>>> point.
> >>>>
> >>>> If you are ok spending money on software and want a scale out block
> >>>> storage
> >>>> then you might want to consider HP LeftHand's VSA product. I am
> >>>> personally
> >>>> partial to NFS plays:) I went the exact opposite approach and
> >>>> settled
> on
> >>>> Isilon for our primary storage for our CS deployment.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin
> >>>> <nik.martin@nfinausa.com
> >>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote:
> >>>>
> >>>> On 10/22/2012 10:16 AM, Trevor Francis wrote:
> >>>>>
> >>>>> We are looking at building a Primary Storage solution for an
> >>>>>> enterprise/carrier class application. However, we want to build it
> >>>>>> using
> >>>>>> a FOSS solution and not a commercial solution. Do you have a
> >>>>>> recommendation on platform?
> >>>>>>
> >>>>>>
> >>>>>> Trevor,
> >>>>>
> >>>>> I got EXCELLENT results builing a SAN from FOSS using:
> >>>>> OS: Centos
> >>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives. LSI
> MegaRAID
> >>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD
> >>>>> caching
> >>>>> (alternately, look at FlashCache from Facebook)
> >>>>> intel 10GB dual port nics, one port for crossover, on port for up
> link
> >>>>> to
> >>>>> storage network
> >>>>>
> >>>>> DRBD for real time block replication to active-active
> >>>>> Pacemaker+corosync for HA Resource management
> >>>>> tgtd for iSCSI target
> >>>>>
> >>>>> If you want file backed storage, XFS is a very good filesystem on
> Linux
> >>>>> now.
> >>>>>
> >>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but
> that
> >>>>> setup gave me a VERY high performance SAN. The downside is it is
> >>>>> entirely
> >>>>> managed by CLI, no UI whatsoever.
> >>>>>
> >>>>>
> >>>>> Trevor Francis
> >>>>>> Partner
> >>>>>> 46 Labs | The PeerEdge Cloud
> >>>>>> http://www.46labs.com <http://www.46labs.com/> |
> >>>>>> http://www.peeredge.net
> >>>>>> <http://www.peeredge.net/>
> >>>>>>
> >>>>>> 405-362-0046 - Voice | 405-410-4980 - Cell
> >>>>>> trevorgfrancis - Skype
> >>>>>> trevor@46labs.com <ma...@46labs.com> <mailto:
> trevor@46labs.com
> >>>>>> >
> >>>>>>
> >>>>>>
> >>>>>> Solutions Provider for the Telecom Industry
> >>>>>>
> >>>>>> <http://www.twitter.com/****peeredge<
> http://www.twitter.com/**peeredge><
> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
> >>><
> >>>>>> http://www.twitter.**com/**peeredge <
> http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge>
> >>>>>> >><
> >>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge>
> >>>>>> <
> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
> >>><
> >>>>>> http://**www.facebook.com/**PeerEdge<
> http://www.facebook.com/PeerEdge><
> >>>>>> http://www.facebook.com/**PeerEdge <
> http://www.facebook.com/PeerEdge>
> >>>>>> >>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
>
>
> </blockquote>
>
>
> </blockquote>
>

Re: Primary Storage

Posted by Andreas Huser <ah...@7five-edv.de>.

Hi Fabrice,

i know OpenSolaris/Solaris   Oracle it's so a thing.
I'm for more then 10 years a open source user and that 
with oracle - i did no like at the beginning of this constallation.  
But Oracle makes his work good i know that. The cost of one socket 
are 700$ and you can use so much quantity of TB  as you will. 
And you can use the full premier Support from Oracle. 
Nexenta develop with the Illumos code. And the Licence are TB based. 
That is not my favorite. As well the pool version from Nexenta comes 
not after. Current Nexenta   Infiniband are not a usable solution.
But every can use what he will. Everyone must decide for themselves.

SRP Targets or iser are not difficult to configure. Use the SRP for 
the Storage unit connection. Solaris and GlusterFS builds one Storage unit. 
The GlusterFS Server export the final Volume to the Clients as well KVM,  VMWare, Hyper-V etc. 
You can use native GlusterFS, RDMA, NFS ore CIFS to export the Volume. 
SRP have nothing to do with VmWare.

When you are use a 7200 SAS drive the access time are the same as a 
SATA drive only the quality of the hardware are better. When you need 
Performance you must use SAS drives with 15000U/m. But it's not needed 
when you install SSD for ZIL/L2ARC.  ZeusRAM rocks :-)

I use dedup only at secondary stroage or on Backupserver not on Primary Storage.  
When you use SSD   SATA drives then you have an cheap an fast strorage. 
1TB drive cost unter 100$. Current i'm not need to save storage volume.

Which application use atime? I know "find -atime N". atime on a Storage 
that stores only Virtual Disks? 
I don't need the information when i have last opening the disk :-)

For a Solaris GUI take a look at napp-it
http://www.napp-it.org/

greetings from germany
Andreas

----- Ursprüngliche Mail ----- 

Von: "Fabrice Brazier" <fa...@apalia.net> 
An: cloudstack-users@incubator.apache.org 
Gesendet: Dienstag, 23. Oktober 2012 12:30:50 
Betreff: RE: Primary Storage 

Hi Andreas, 

Hum that's pretty cool, i know they have still trouble with infiniband on 
nexenta but it's clearly a priority on their roadmap (plus i trust more the 
nexenta team to evolve zfs than oracle). 

I agree iscsi over infiniband increase the latency but most of the time it's 
just more simple to use IPOIB than IB, for example if you use vmware the 
support of infiniband it's assured by mellanox and not vmware so if you have 
an issue the vmware support probably don't help you anymore. 

About the raid i'm a fan of raid 10, i prefer build a raid 10 with 7200 sas 
drive than multiple vdev raidz/raidz2 15k sas drive. Particularly for 
virtual environment with a random workload and multiple access. 

I'm a fan of nfs so agreed about the zil, and zeus ram are the only one for 
that (with classical ssd you can hit the write hole). 

Agreed for compression too (but only lzjb level gzip use too much cpu), 
disable atime permit to decrease the iops load but i'm not sure is really a 
best practice. About the deduplication I don't totally agree. It really 
depends of your array, the workload and the VM type on your cloudstack. 
Actually when i build a zfs array i count 1Gb ram for 1Tb disks. With 
deduplication i count 2Gb ram for 1 Tb disks (plus in nexenta 4 they will 
add a possibility to use ssd for the deduplication metada, like the l2arc 
for the read cache). 

The last point it's about your drives, why sata drives? I mean sata doesn't 
respect their latency, and the reliability of a sas drive is 10x the sata. 
Plus now you can found many NL-sas drive at low cost 

But it's really a nice architecture, i never try glusterfs (for the moment) 
plus in that case it's really a good way to have a replacement to a 
metro-cluster for free, i try one of is competitor (onefs) and clearly 
clustered filesystem are the futur. 

Cheers, 
Fabrice 

-----Message d'origine----- 
De : Andreas Huser [mailto:ahuser@7five-edv.de] 
Envoyé : mardi 23 octobre 2012 11:37 
À : cloudstack-users@incubator.apache.org 
Objet : Re: Primary Storage 

Hi Fabrice, 

I don't know what other people do but i have no problems with Infiniband + 
GlusterFS + Cloudstack I'm not use Nexenta it's based on Illumos and work 
not fine with Infiniband. 
I have two different clusters in productiv enviroments. 

The first: Solaris 11 with built-in GlusterFS 3.3 export Gluster Vol. with 
RDMA. > performance is okey you can use that for smaller enviroments The 
Second: Is a little bit complex with a GlusterFS Server in the middle 

ZFS Server: based on Solaris 11 
1.) Create a zpool min. two vdevs and SSD read/write cache 
2.) Create a thin provisioning Volume "zfs create -V" disable atime and 
enable compression (do not enable dedup!) and export as (iWarp) SRP Target 
to the GlusterFS Server use a direct connection without IB Switch. 

GlusterFS Server: 
1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6) 
2.) Use OFED driver from https://www.openfabrics.org 
3.) Import the SRP Target from ZFS Server and format as xfs 
4.) Create a Glustervolume "volume create xy transport rdma " (use only 
rdma) 
5.) Connect with the second IB Port to a IB Switch 

Cloudstack Hypervisor Node: 
1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6) 
2.) Use OFED driver from https://www.openfabrics.org 
3.) Import the Glustervolume 

ZFS Thin Volume ---- Infinband SRP ----> GlusterFS ---- GFSVol rdma ----> 
IB Switch ----> Clients 

The ZFS and GlusterFS Server form a Storage unit connect direct with 40Gbit 
Infiniband Point-to-Point You do not feel that is a cable between! 

Important: When you have Infiniband use not IPoIB with iSCSI! If one already 
has Infiniband then you should also use the advantage. 
IPoIB have a higher Latency as iWarp SRP! 

SRP = usec 

-- SRP -- 
local address: LID 0x01 QPN 0x44004b PSN 0xf3265b RKey 0x9804237c VAddr 
0x00000001dda000 remote address: LID 0x0a QPN 0x10004a PSN 0x44072e RKey 
0x1c0f115 VAddr 0x000000088e6000 
------------------------------------------------------------------ 
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 
2 1000 1.29 125.30 1.31 
------------------------------------------------------------------ 

-- IPoIB --- 
[root@sv01sfogaa ~]# ping 10.200.0.10 
PING 10.200.0.10 (10.200.0.10) 56(84) bytes of data. 
64 bytes from 10.200.0.10: icmp_seq=1 ttl=255 time=0.147 ms 
64 bytes from 10.200.0.10: icmp_seq=2 ttl=255 time=0.116 ms 

When you put load on IPoIB then increases the latency that is not good 

That is my recommendation for a simple GlusterFS Mirror: 

- Supermicro Server with Intel Hardware and Expander Backplane 
- 1x Crucal-M4 SSD Read Cache 
- 2x ZeusIOPs SSD Write cache (mirror) 
- SATA 24/7 Harddrives 
- LSI HBA 9207 or 9211 
- ConnectX-2 QDR Dualport Infiniband Adaper (HP refurbished with full 
warrany for 100$) Importent > Flash newest Firmware from Mellanox! 
- Mellanox IB Swich 
- Solaris 11 
- GlusterFS 3.3 compiled with ib_verbs 
- Gluster Volume transport only rdma 

<blockquote>
Throughput constant up to 200 Mbyte/s 
more throughput with more storage servers or more harddrives on jbod 

</blockquote>

Info: 

- I have some problems with infiniband RDMA or SRP with OpenIndiana or 
Illumos or Nexenta. Some adapters have a high latency or not a stable 
connection. Use Solaris that's the right way! 
- OpenIndiana is Beta! Infiniband ib_verbs works not or not fine! 
- Use Solaris 11 Infiniband ib_verbs are native and stable 
- Don't use Ubuntu Client Server for Infiniband! Use RedHat, Fedora or 
CentOS and install the right drivers from 
https://www.openfabrics.org/downloads/OFED/ 
- You have not SSD Cache? Disable at the ZFS Volume the sync! Important! You 
loose Security for your Data but some protocols use sync flags in transport. 
Example NFS use by default fsync. write cache are not active. NFS writes 
data direct to the Harddrive. For Data Security and Performance give the 
Storage Server a SSD write cache. ZFS works at default with sync=standard 
that prevent write holes. (COW System) 

I hope that I could help a little 

Greeting from Germany 
Andreas 

----- Ursprüngliche Mail ----- 

Von: "Fabrice Brazier" <fa...@apalia.net> 
An: cloudstack-users@incubator.apache.org 
Gesendet: Dienstag, 23. Oktober 2012 09:55:15 
Betreff: RE: Primary Storage 

Hi Andreas, 

Hello i just see your configuration, it seems quite interesting. 
If i understand well you want to build some zfs array on the backend. 
Export luns (probably by iscsi over infiniband) to you linux cluster, and on 
the linux cluster you put glusterFS. 
I can understand the point, with that you can have very good performance and 
reliability (zfs), scalability and redundancy (gluster) for very low cost. 
So just one question, did you try the global namespace implementation from 
nexenta? 
If yes can you tell me what configuration is the best for you? 
I mean the fact you have a gluster cluster in the middle must impact the 
overral performance no? 

Fabrice 

-----Message d'origine----- 
De : Andreas Huser [mailto:ahuser@7five-edv.de] Envoyé : mardi 23 octobre 
2012 05:40 À : cloudstack-users@incubator.apache.org 
Objet : Re: Primary Storage 

Hi, 

for Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband (RDMA). That 
gives the best performance and most scalable Storage. 
I have tasted some different solutions for primary Storage but the most are 
to expensive and for a CloudStack Cluster not economic or have a poor 
performance. 

My Configuration: 
Storage Node: 
Supermicro Server (Intel Hardware) with Solaris 11 with SSD write and read 
cache (read crucial-m4, write ZeusIOPS) GlusterFS and dualport ConnectX 
40Gbit/s Infiniband adapter. 

I have installed GlusterFS direct on Solaris with a modified code. 
Want you build bigger systems for more then 50 VMs it is better you split 
the Solaris and GlusterFS with a separte headnode for GlusterFS 

That looks like: 
Solaris ZFS Backendstorage with a dataset Volume (Thin Provision) --> ( SRP 
Target attached direct without Infiniband switch to GF Node) --> GlusterFS 
Node the srp target formatted with xfs filesystem, create a GlusterFS 
Volume --> ( Infiniband over a Mellanox Port Switch) --> Cloudstack Node 
mount glusterFS Volume over RDMA 

For the Dataset Volume at the ZFS Storage, disable atime and enable 
compression. 
(Space reclaim) With compression you can shrink the ZFS Volume with command 
at Linux dd /dev/zero or In a Windows VM with sdelete That gives you space 
left on the Primary Storage for deleted Files in a VM or for deleted vhd's 
or vm's in the cloudstack 

greeting Andreas 

Mit freundlichen Grüßen 

Andreas Huser 
Geschäftsführer 
System Engineer / Consultant 
(Cisco CSE, SMBAM, LCSE, ASAM) 
--------------------------------------- 
Zellerstraße 28 - 77654 Offenburg 
Tel: +49(781) 12786898 
Mobil: +49(176) 10308549 
ahuser@7five-edv.de 

----- Ursprüngliche Mail ----- 

Von: "Outback Dingo" <ou...@gmail.com> 
An: cloudstack-users@incubator.apache.org 
Gesendet: Dienstag, 23. Oktober 2012 02:15:16 
Betreff: Re: Primary Storage 

On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez <iv...@gmail.com> wrote: 

<blockquote>
Solaris 11 ZFS and yes we tried different setups, raids levels number 
of SSD cache, ARC zfs options etc etc etc. 

Cheers 

VMWare ?? 
</blockquote>

<blockquote>
On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo 
<ou...@gmail.com>wrote: 

<blockquote>
On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com> 
wrote: 
> We are using ZFS, with jbod, not in production yet exporting NFS to 
> cloudstack, I'm not really happy about the performance 
> but I think is related to the hardware itself rather than technology, 
> we 
> are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster 
> as 
> well, but we decided to move away from that path since gluster nfs is 
still 
> performing poorly, plus we would like to see cloudstack integrating the 
> gluster-fuse module, we haven't decided the final storage setup but at 
the 
> moment we had better results with ZFS. 
> 
> 

question is whos ZFS and have you "tweaked" the zfs / nfs config for 
performance 

> 
> On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <nik.martin@nfinausa.com 
>wrote: 
> 
>> On 10/22/2012 05:49 PM, Trevor Francis wrote: 
>> 
>>> ZFS looks really interesting to me and I am leaning that way. I am 
>>> considering using FreeNAS, as people seem to be having good luck with 
>>> it. Can anyone weigh in here? 
>>> 
>>> 
>> My personal opinion, I think FreeNAS and OpenFiler have horrible, 
horrible 
>> User Interfaces - not very intuitive, and they both seem to be file 
servers 
>> with things like iSCSI targets tacked on as an afterthought. 
>> 
>> Nik 
>> 
>> 
>>> Trevor Francis 
>>> Partner 
>>> 46 Labs | The PeerEdge Cloud 
>>> http://www.46labs.com <http://www.46labs.com/> | 
http://www.peeredge.net 
>>> <http://www.peeredge.net/> 
>>> 405-362-0046 - Voice | 405-410-4980 - Cell 
>>> trevorgfrancis - Skype 
>>> trevor@46labs.com <ma...@46labs.com> 
>>> Solutions Provider for the Telecom Industry 
>>> 
>>> <http://www.twitter.com/**peeredge 
>>> <http://www.twitter.com/peeredge>>< 
>>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>>< 
>>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>>< 
>>> http://**www.facebook.com/PeerEdge 
>>> <http://www.facebook.com/PeerEdge>> 
>>> 
>>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote: 
>>> 
>>> ZFS would be an interesting setup as you can do the cache pools like 
you 
>>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is 
>>>> that 
>>>> they 
>>>> really don't scale out well if you are looking for something with a 
>>>> unified 
>>>> name space. I'll say however that ZFS is a battle hardened FS with 
tons 
>>>> of 
>>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things 
>>>> these 
>>>> smaller start up companies are hocking are just ZFS appliances. 
>>>> 
>>>> RBD looks interesting but I'm not sure if I would be willing to put 
>>>> production data on it, I'm not sure how performant it is IRL. From a 
>>>> purely technical perspective, it looks REALLY cool. 
>>>> 
>>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is 
another 
>>>> option although historically small/random IO has not been it's 
>>>> strong 
>>>> point. 
>>>> 
>>>> If you are ok spending money on software and want a scale out block 
>>>> storage 
>>>> then you might want to consider HP LeftHand's VSA product. I am 
>>>> personally 
>>>> partial to NFS plays:) I went the exact opposite approach and 
>>>> settled 
on 
>>>> Isilon for our primary storage for our CS deployment. 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin 
>>>> <nik.martin@nfinausa.com 
>>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote: 
>>>> 
>>>> On 10/22/2012 10:16 AM, Trevor Francis wrote: 
>>>>> 
>>>>> We are looking at building a Primary Storage solution for an 
>>>>>> enterprise/carrier class application. However, we want to build it 
>>>>>> using 
>>>>>> a FOSS solution and not a commercial solution. Do you have a 
>>>>>> recommendation on platform? 
>>>>>> 
>>>>>> 
>>>>>> Trevor, 
>>>>> 
>>>>> I got EXCELLENT results builing a SAN from FOSS using: 
>>>>> OS: Centos 
>>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives. LSI 
MegaRAID 
>>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD 
>>>>> caching 
>>>>> (alternately, look at FlashCache from Facebook) 
>>>>> intel 10GB dual port nics, one port for crossover, on port for up 
link 
>>>>> to 
>>>>> storage network 
>>>>> 
>>>>> DRBD for real time block replication to active-active 
>>>>> Pacemaker+corosync for HA Resource management 
>>>>> tgtd for iSCSI target 
>>>>> 
>>>>> If you want file backed storage, XFS is a very good filesystem on 
Linux 
>>>>> now. 
>>>>> 
>>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but 
that 
>>>>> setup gave me a VERY high performance SAN. The downside is it is 
>>>>> entirely 
>>>>> managed by CLI, no UI whatsoever. 
>>>>> 
>>>>> 
>>>>> Trevor Francis 
>>>>>> Partner 
>>>>>> 46 Labs | The PeerEdge Cloud 
>>>>>> http://www.46labs.com <http://www.46labs.com/> | 
>>>>>> http://www.peeredge.net 
>>>>>> <http://www.peeredge.net/> 
>>>>>> 
>>>>>> 405-362-0046 - Voice | 405-410-4980 - Cell 
>>>>>> trevorgfrancis - Skype 
>>>>>> trevor@46labs.com <ma...@46labs.com> <mailto: 
trevor@46labs.com 
>>>>>> > 
>>>>>> 
>>>>>> 
>>>>>> Solutions Provider for the Telecom Industry 
>>>>>> 
>>>>>> <http://www.twitter.com/****peeredge< 
http://www.twitter.com/**peeredge>< 
>>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge 
>>>< 
>>>>>> http://www.twitter.**com/**peeredge < 
http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge> 
>>>>>> >>< 
>>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge> 
>>>>>> < 
>>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge 
>>>< 
>>>>>> http://**www.facebook.com/**PeerEdge< 
http://www.facebook.com/PeerEdge>< 
>>>>>> http://www.facebook.com/**PeerEdge < 
http://www.facebook.com/PeerEdge> 
>>>>>> >> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 

</blockquote>

</blockquote>

RE: Primary Storage

Posted by Fabrice Brazier <fa...@apalia.net>.

Hi Andreas,

Hum that's pretty cool, i know they have still trouble with infiniband on
nexenta but it's clearly a priority on their roadmap (plus i trust more the
nexenta team to evolve zfs than oracle).

I agree iscsi over infiniband increase the latency but most of the time it's
just more simple to use IPOIB than IB, for example if you use vmware the
support of infiniband it's assured by mellanox and not vmware so if you have
an issue the vmware support probably don't help you anymore.

About the raid i'm a fan of raid 10, i prefer build a raid 10 with 7200 sas
drive than multiple vdev raidz/raidz2 15k sas drive. Particularly for
virtual environment with a random workload and multiple access.

I'm a fan of nfs so agreed about the zil, and zeus ram are the only one for
that (with classical ssd you can hit the write hole).

Agreed for compression too (but only lzjb level gzip use too much cpu),
disable atime permit to decrease the iops load but i'm not sure is really a
best practice. About the deduplication I don't totally agree. It really
depends of your array, the workload and the VM type on your cloudstack.
Actually when i build a zfs array i count 1Gb ram for 1Tb disks. With
deduplication i count 2Gb ram for 1 Tb disks (plus in nexenta 4 they will
add a possibility to use ssd for the deduplication metada, like the l2arc
for the read cache).

The last point it's about your drives, why sata drives? I mean sata doesn't
respect their latency, and the reliability of a sas drive is 10x the sata.
Plus now you can found many NL-sas drive at low cost

But it's really a nice architecture, i never try glusterfs (for the moment)
plus in that case it's really a good way to have a replacement to a
metro-cluster for free, i try one of is competitor (onefs) and clearly
clustered filesystem are the futur.

Cheers,
Fabrice

-----Message d'origine-----
De : Andreas Huser [mailto:ahuser@7five-edv.de]
Envoyé : mardi 23 octobre 2012 11:37
À : cloudstack-users@incubator.apache.org
Objet : Re: Primary Storage

Hi Fabrice,

I don't know  what other people do but i have no problems with Infiniband +
GlusterFS + Cloudstack I'm  not use  Nexenta it's based on Illumos and work
not fine with Infiniband.
I have two different clusters in productiv enviroments.

The first: Solaris 11  with built-in GlusterFS 3.3 export Gluster Vol. with
RDMA. > performance is okey you can use that for smaller enviroments The
Second: Is a little bit complex with a GlusterFS Server in the middle

ZFS Server: based on Solaris 11
1.) Create a zpool min. two vdevs and SSD read/write cache
2.) Create a thin provisioning Volume "zfs create -V" disable atime and
enable compression (do not enable dedup!) and export as (iWarp) SRP Target
to the GlusterFS Server use a direct connection without IB Switch.

GlusterFS Server:
1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6)
2.) Use OFED driver from https://www.openfabrics.org
3.) Import the SRP Target from ZFS Server and format as xfs
4.) Create a Glustervolume "volume create xy transport rdma " (use only
rdma)
5.) Connect with the second IB Port to a IB Switch

Cloudstack Hypervisor Node:
1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6)
2.) Use OFED driver from https://www.openfabrics.org
3.) Import the Glustervolume

ZFS Thin Volume ---- Infinband SRP  ----> GlusterFS ---- GFSVol rdma ---->
IB Switch ----> Clients

The ZFS and GlusterFS Server form a Storage unit connect direct with 40Gbit
Infiniband Point-to-Point You do not feel that is a cable between!

Important: When you have Infiniband use not IPoIB with iSCSI! If one already
has Infiniband then you should also use the advantage.
IPoIB have a higher Latency as iWarp SRP!

SRP = usec

-- SRP --
local address: LID 0x01 QPN 0x44004b PSN 0xf3265b RKey 0x9804237c VAddr
0x00000001dda000 remote address: LID 0x0a QPN 0x10004a PSN 0x44072e RKey
0x1c0f115 VAddr 0x000000088e6000
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
 2       1000          1.29           125.30       1.31
------------------------------------------------------------------

-- IPoIB ---
[root@sv01sfogaa ~]# ping 10.200.0.10
PING 10.200.0.10 (10.200.0.10) 56(84) bytes of data.
64 bytes from 10.200.0.10: icmp_seq=1 ttl=255 time=0.147 ms
64 bytes from 10.200.0.10: icmp_seq=2 ttl=255 time=0.116 ms

When you put load on IPoIB then increases the latency that is not good

That is my recommendation for a simple GlusterFS Mirror:

- Supermicro Server with Intel Hardware and Expander Backplane
- 1x Crucal-M4 SSD Read Cache
- 2x ZeusIOPs SSD Write cache (mirror)
- SATA 24/7 Harddrives
- LSI HBA 9207 or 9211
- ConnectX-2 QDR Dualport Infiniband Adaper (HP refurbished with full
warrany for 100$) Importent > Flash newest Firmware from Mellanox!
- Mellanox IB Swich
- Solaris 11
- GlusterFS 3.3 compiled with ib_verbs
- Gluster Volume transport only rdma

>> Throughput constant up to 200 Mbyte/s
more throughput with more storage servers or more harddrives on jbod

Info:

- I have some problems with infiniband RDMA or SRP with OpenIndiana or
Illumos or Nexenta. Some adapters have a high latency or not a stable
connection. Use Solaris that's the right way!
- OpenIndiana is Beta! Infiniband ib_verbs works not or not fine!
- Use Solaris 11 Infiniband ib_verbs are native and stable
- Don't use Ubuntu Client Server for Infiniband! Use RedHat, Fedora or
CentOS and install the right drivers from
https://www.openfabrics.org/downloads/OFED/
- You have not SSD Cache? Disable at the ZFS Volume the sync! Important! You
loose Security for your Data but some protocols use sync flags in transport.
Example NFS use by default fsync. write cache are not active. NFS writes
data direct to the Harddrive. For Data Security and Performance give the
Storage Server a SSD write cache. ZFS works at default with sync=standard
that prevent write holes. (COW System)

I hope that I could help a little

Greeting from Germany
Andreas

----- Ursprüngliche Mail -----

Von: "Fabrice Brazier" <fa...@apalia.net>
An: cloudstack-users@incubator.apache.org
Gesendet: Dienstag, 23. Oktober 2012 09:55:15
Betreff: RE: Primary Storage

Hi Andreas,

Hello i just see your configuration, it seems quite interesting.
If i understand well you want to build some zfs array on the backend.
Export luns (probably by iscsi over infiniband) to you linux cluster, and on
the linux cluster you put glusterFS.
I can understand the point, with that you can have very good performance and
reliability (zfs), scalability and redundancy (gluster) for very low cost.
So just one question, did you try the global namespace implementation from
nexenta?
If yes can you tell me what configuration is the best for you?
I mean the fact you have a gluster cluster in the middle must impact the
overral performance no?

Fabrice

-----Message d'origine-----
De : Andreas Huser [mailto:ahuser@7five-edv.de] Envoyé : mardi 23 octobre
2012 05:40 À : cloudstack-users@incubator.apache.org
Objet : Re: Primary Storage

Hi,

for Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband (RDMA). That
gives the best performance and most scalable Storage.
I have tasted some different solutions for primary Storage but the most are
to expensive and for a CloudStack Cluster not economic or have a poor
performance.

My Configuration:
Storage Node:
Supermicro Server (Intel Hardware) with Solaris 11 with SSD write and read
cache (read crucial-m4, write ZeusIOPS) GlusterFS and dualport ConnectX
40Gbit/s Infiniband adapter.

I have installed GlusterFS direct on Solaris with a modified code.
Want you build bigger systems for more then 50 VMs it is better you split
the Solaris and GlusterFS with a separte headnode for GlusterFS

That looks like:
Solaris ZFS Backendstorage with a dataset Volume (Thin Provision) --> ( SRP
Target attached direct without Infiniband switch to GF Node) --> GlusterFS
Node the srp target formatted with xfs filesystem, create a GlusterFS
Volume --> ( Infiniband over a Mellanox Port Switch) --> Cloudstack Node
mount glusterFS Volume over RDMA

For the Dataset Volume at the ZFS Storage, disable atime and enable
compression.
(Space reclaim) With compression you can shrink the ZFS Volume with command
at Linux dd /dev/zero or In a Windows VM with sdelete That gives you space
left on the Primary Storage for deleted Files in a VM or for deleted vhd's
or vm's in the cloudstack

greeting Andreas

Mit freundlichen Grüßen

Andreas Huser
Geschäftsführer
System Engineer / Consultant
(Cisco CSE, SMBAM, LCSE, ASAM)
---------------------------------------
Zellerstraße 28 - 77654 Offenburg
Tel: +49(781) 12786898
Mobil: +49(176) 10308549
ahuser@7five-edv.de

----- Ursprüngliche Mail -----

Von: "Outback Dingo" <ou...@gmail.com>
An: cloudstack-users@incubator.apache.org
Gesendet: Dienstag, 23. Oktober 2012 02:15:16
Betreff: Re: Primary Storage

On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez <iv...@gmail.com> wrote:
> Solaris 11 ZFS and yes we tried different setups, raids levels number
> of SSD cache, ARC zfs options etc etc etc.
>
> Cheers
>

VMWare ??

> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo
> <ou...@gmail.com>wrote:
>
>> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com>
>> wrote:
>> > We are using ZFS, with jbod, not in production yet exporting NFS to
>> > cloudstack, I'm not really happy about the performance
>> > but I think is related to the hardware itself rather than technology,
>> > we
>> > are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster
>> > as
>> > well, but we decided to move away from that path since gluster nfs is
>> still
>> > performing poorly, plus we would like to see cloudstack integrating the
>> > gluster-fuse module, we haven't decided the final storage setup but at
>> the
>> > moment we had better results with ZFS.
>> >
>> >
>>
>> question is whos ZFS and have you "tweaked" the zfs / nfs config for
>> performance
>>
>> >
>> > On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <nik.martin@nfinausa.com
>> >wrote:
>> >
>> >> On 10/22/2012 05:49 PM, Trevor Francis wrote:
>> >>
>> >>> ZFS looks really interesting to me and I am leaning that way. I am
>> >>> considering using FreeNAS, as people seem to be having good luck with
>> >>> it. Can anyone weigh in here?
>> >>>
>> >>>
>> >> My personal opinion, I think FreeNAS and OpenFiler have horrible,
>> horrible
>> >> User Interfaces - not very intuitive, and they both seem to be file
>> servers
>> >> with things like iSCSI targets tacked on as an afterthought.
>> >>
>> >> Nik
>> >>
>> >>
>> >>> Trevor Francis
>> >>> Partner
>> >>> 46 Labs | The PeerEdge Cloud
>> >>> http://www.46labs.com <http://www.46labs.com/> |
>> http://www.peeredge.net
>> >>> <http://www.peeredge.net/>
>> >>> 405-362-0046 - Voice | 405-410-4980 - Cell
>> >>> trevorgfrancis - Skype
>> >>> trevor@46labs.com <ma...@46labs.com>
>> >>> Solutions Provider for the Telecom Industry
>> >>>
>> >>> <http://www.twitter.com/**peeredge
>> >>> <http://www.twitter.com/peeredge>><
>> >>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
>> >>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
>> >>> http://**www.facebook.com/PeerEdge
>> >>> <http://www.facebook.com/PeerEdge>>
>> >>>
>> >>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:
>> >>>
>> >>> ZFS would be an interesting setup as you can do the cache pools like
>> you
>> >>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is
>> >>>> that
>> >>>> they
>> >>>> really don't scale out well if you are looking for something with a
>> >>>> unified
>> >>>> name space. I'll say however that ZFS is a battle hardened FS with
>> tons
>> >>>> of
>> >>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things
>> >>>> these
>> >>>> smaller start up companies are hocking are just ZFS appliances.
>> >>>>
>> >>>> RBD looks interesting but I'm not sure if I would be willing to put
>> >>>> production data on it, I'm not sure how performant it is IRL. From a
>> >>>> purely technical perspective, it looks REALLY cool.
>> >>>>
>> >>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is
>> another
>> >>>> option although historically small/random IO has not been it's
>> >>>> strong
>> >>>> point.
>> >>>>
>> >>>> If you are ok spending money on software and want a scale out block
>> >>>> storage
>> >>>> then you might want to consider HP LeftHand's VSA product. I am
>> >>>> personally
>> >>>> partial to NFS plays:) I went the exact opposite approach and
>> >>>> settled
>> on
>> >>>> Isilon for our primary storage for our CS deployment.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin
>> >>>> <nik.martin@nfinausa.com
>> >>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote:
>> >>>>
>> >>>> On 10/22/2012 10:16 AM, Trevor Francis wrote:
>> >>>>>
>> >>>>> We are looking at building a Primary Storage solution for an
>> >>>>>> enterprise/carrier class application. However, we want to build it
>> >>>>>> using
>> >>>>>> a FOSS solution and not a commercial solution. Do you have a
>> >>>>>> recommendation on platform?
>> >>>>>>
>> >>>>>>
>> >>>>>> Trevor,
>> >>>>>
>> >>>>> I got EXCELLENT results builing a SAN from FOSS using:
>> >>>>> OS: Centos
>> >>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives. LSI
>> MegaRAID
>> >>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD
>> >>>>> caching
>> >>>>> (alternately, look at FlashCache from Facebook)
>> >>>>> intel 10GB dual port nics, one port for crossover, on port for up
>> link
>> >>>>> to
>> >>>>> storage network
>> >>>>>
>> >>>>> DRBD for real time block replication to active-active
>> >>>>> Pacemaker+corosync for HA Resource management
>> >>>>> tgtd for iSCSI target
>> >>>>>
>> >>>>> If you want file backed storage, XFS is a very good filesystem on
>> Linux
>> >>>>> now.
>> >>>>>
>> >>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but
>> that
>> >>>>> setup gave me a VERY high performance SAN. The downside is it is
>> >>>>> entirely
>> >>>>> managed by CLI, no UI whatsoever.
>> >>>>>
>> >>>>>
>> >>>>> Trevor Francis
>> >>>>>> Partner
>> >>>>>> 46 Labs | The PeerEdge Cloud
>> >>>>>> http://www.46labs.com <http://www.46labs.com/> |
>> >>>>>> http://www.peeredge.net
>> >>>>>> <http://www.peeredge.net/>
>> >>>>>>
>> >>>>>> 405-362-0046 - Voice | 405-410-4980 - Cell
>> >>>>>> trevorgfrancis - Skype
>> >>>>>> trevor@46labs.com <ma...@46labs.com> <mailto:
>> trevor@46labs.com
>> >>>>>> >
>> >>>>>>
>> >>>>>>
>> >>>>>> Solutions Provider for the Telecom Industry
>> >>>>>>
>> >>>>>> <http://www.twitter.com/****peeredge<
>> http://www.twitter.com/**peeredge><
>> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
>> >>><
>> >>>>>> http://www.twitter.**com/**peeredge <
>> http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge>
>> >>>>>> >><
>> >>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge>
>> >>>>>> <
>> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
>> >>><
>> >>>>>> http://**www.facebook.com/**PeerEdge<
>> http://www.facebook.com/PeerEdge><
>> >>>>>> http://www.facebook.com/**PeerEdge <
>> http://www.facebook.com/PeerEdge>
>> >>>>>> >>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>>

Re: Primary Storage

Posted by Andreas Huser <ah...@7five-edv.de>.

Hi Fabrice,

I don't know  what other people do but i have no problems with Infiniband + GlusterFS + Cloudstack
I'm  not use  Nexenta it's based on Illumos and work not fine with Infiniband.
I have two different clusters in productiv enviroments. 

The first: Solaris 11  with built-in GlusterFS 3.3 export Gluster Vol. with RDMA. > performance is okey you can use that for smaller enviroments
The Second: Is a little bit complex with a GlusterFS Server in the middle

ZFS Server: based on Solaris 11
1.) Create a zpool min. two vdevs and SSD read/write cache
2.) Create a thin provisioning Volume "zfs create -V" disable atime and enable compression (do not enable dedup!) and export as (iWarp) SRP Target to the GlusterFS Server use a direct connection without IB Switch.

GlusterFS Server:
1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6)
2.) Use OFED driver from https://www.openfabrics.org
3.) Import the SRP Target from ZFS Server and format as xfs
4.) Create a Glustervolume "volume create xy transport rdma " (use only rdma)
5.) Connect with the second IB Port to a IB Switch 

Cloudstack Hypervisor Node:
1.) Use RedHat, CentOS or Fedora (I use CentOS 5 and 6)
2.) Use OFED driver from https://www.openfabrics.org
3.) Import the Glustervolume

ZFS Thin Volume ---- Infinband SRP  ----> GlusterFS ---- GFSVol rdma ----> IB Switch ----> Clients

The ZFS and GlusterFS Server form a Storage unit connect direct with 40Gbit Infiniband Point-to-Point
You do not feel that is a cable between!

Important: When you have Infiniband use not IPoIB with iSCSI! If one already has Infiniband then you should also use the advantage.
IPoIB have a higher Latency as iWarp SRP!


SRP = usec

-- SRP -- 
local address: LID 0x01 QPN 0x44004b PSN 0xf3265b RKey 0x9804237c VAddr 0x00000001dda000
remote address: LID 0x0a QPN 0x10004a PSN 0x44072e RKey 0x1c0f115 VAddr 0x000000088e6000
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
 2       1000          1.29           125.30       1.31   
------------------------------------------------------------------

-- IPoIB ---
[root@sv01sfogaa ~]# ping 10.200.0.10
PING 10.200.0.10 (10.200.0.10) 56(84) bytes of data.
64 bytes from 10.200.0.10: icmp_seq=1 ttl=255 time=0.147 ms
64 bytes from 10.200.0.10: icmp_seq=2 ttl=255 time=0.116 ms

When you put load on IPoIB then increases the latency that is not good


That is my recommendation for a simple GlusterFS Mirror:

- Supermicro Server with Intel Hardware and Expander Backplane
- 1x Crucal-M4 SSD Read Cache
- 2x ZeusIOPs SSD Write cache (mirror)
- SATA 24/7 Harddrives
- LSI HBA 9207 or 9211
- ConnectX-2 QDR Dualport Infiniband Adaper (HP refurbished with full warrany for 100$) Importent > Flash newest Firmware from Mellanox!
- Mellanox IB Swich
- Solaris 11
- GlusterFS 3.3 compiled with ib_verbs
- Gluster Volume transport only rdma

>> Throughput constant up to 200 Mbyte/s
more throughput with more storage servers or more harddrives on jbod

Info:

- I have some problems with infiniband RDMA or SRP with OpenIndiana or Illumos or Nexenta. Some adapters have a high latency or not a stable connection. Use Solaris that's the right way!
- OpenIndiana is Beta! Infiniband ib_verbs works not or not fine! 
- Use Solaris 11 Infiniband ib_verbs are native and stable
- Don't use Ubuntu Client Server for Infiniband! Use RedHat, Fedora or CentOS and install the right drivers from https://www.openfabrics.org/downloads/OFED/
- You have not SSD Cache? Disable at the ZFS Volume the sync! Important! You loose Security for your Data but some protocols use sync flags in transport. Example NFS use by default fsync. write cache are not active. NFS writes data direct to the Harddrive. For Data Security and Performance give the Storage Server a SSD write cache. ZFS works at default with sync=standard that prevent write holes. (COW System)

I hope that I could help a little

Greeting from Germany
Andreas



----- Ursprüngliche Mail ----- 

Von: "Fabrice Brazier" <fa...@apalia.net> 
An: cloudstack-users@incubator.apache.org 
Gesendet: Dienstag, 23. Oktober 2012 09:55:15 
Betreff: RE: Primary Storage 

Hi Andreas, 

Hello i just see your configuration, it seems quite interesting. 
If i understand well you want to build some zfs array on the backend. 
Export luns (probably by iscsi over infiniband) to you linux cluster, 
and on the linux cluster you put glusterFS. 
I can understand the point, with that you can have very good performance and 
reliability (zfs), 
scalability and redundancy (gluster) for very low cost. 
So just one question, did you try the global namespace implementation from 
nexenta? 
If yes can you tell me what configuration is the best for you? 
I mean the fact you have a gluster cluster in the middle must impact the 
overral 
performance no? 

Fabrice 

-----Message d'origine----- 
De : Andreas Huser [mailto:ahuser@7five-edv.de] 
Envoyé : mardi 23 octobre 2012 05:40 
À : cloudstack-users@incubator.apache.org 
Objet : Re: Primary Storage 

Hi, 

for Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband (RDMA). That 
gives the best performance and most scalable Storage. 
I have tasted some different solutions for primary Storage but the most are 
to expensive and for a CloudStack Cluster not economic or have a poor 
performance. 

My Configuration: 
Storage Node: 
Supermicro Server (Intel Hardware) with Solaris 11 with SSD write and read 
cache (read crucial-m4, write ZeusIOPS) GlusterFS and dualport ConnectX 
40Gbit/s Infiniband adapter. 

I have installed GlusterFS direct on Solaris with a modified code. 
Want you build bigger systems for more then 50 VMs it is better you split 
the Solaris and GlusterFS with a separte headnode for GlusterFS 

That looks like: 
Solaris ZFS Backendstorage with a dataset Volume (Thin Provision) --> ( SRP 
Target attached direct without Infiniband switch to GF Node) --> GlusterFS 
Node the srp target formatted with xfs filesystem, create a GlusterFS 
Volume --> ( Infiniband over a Mellanox Port Switch) --> Cloudstack Node 
mount glusterFS Volume over RDMA 

For the Dataset Volume at the ZFS Storage, disable atime and enable 
compression. 
(Space reclaim) With compression you can shrink the ZFS Volume with command 
at Linux dd /dev/zero or In a Windows VM with sdelete That gives you space 
left on the Primary Storage for deleted Files in a VM or for deleted vhd's 
or vm's in the cloudstack 

greeting Andreas 




Mit freundlichen Grüßen 

Andreas Huser 
Geschäftsführer 
System Engineer / Consultant 
(Cisco CSE, SMBAM, LCSE, ASAM) 
--------------------------------------- 
Zellerstraße 28 - 77654 Offenburg 
Tel: +49(781) 12786898 
Mobil: +49(176) 10308549 
ahuser@7five-edv.de 




----- Ursprüngliche Mail ----- 

Von: "Outback Dingo" <ou...@gmail.com> 
An: cloudstack-users@incubator.apache.org 
Gesendet: Dienstag, 23. Oktober 2012 02:15:16 
Betreff: Re: Primary Storage 

On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez <iv...@gmail.com> wrote: 
> Solaris 11 ZFS and yes we tried different setups, raids levels number 
> of SSD cache, ARC zfs options etc etc etc. 
> 
> Cheers 
> 

VMWare ?? 

> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo 
> <ou...@gmail.com>wrote: 
> 
>> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com> 
>> wrote: 
>> > We are using ZFS, with jbod, not in production yet exporting NFS to 
>> > cloudstack, I'm not really happy about the performance 
>> > but I think is related to the hardware itself rather than technology, 
>> > we 
>> > are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster 
>> > as 
>> > well, but we decided to move away from that path since gluster nfs is 
>> still 
>> > performing poorly, plus we would like to see cloudstack integrating the 
>> > gluster-fuse module, we haven't decided the final storage setup but at 
>> the 
>> > moment we had better results with ZFS. 
>> > 
>> > 
>> 
>> question is whos ZFS and have you "tweaked" the zfs / nfs config for 
>> performance 
>> 
>> > 
>> > On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <nik.martin@nfinausa.com 
>> >wrote: 
>> > 
>> >> On 10/22/2012 05:49 PM, Trevor Francis wrote: 
>> >> 
>> >>> ZFS looks really interesting to me and I am leaning that way. I am 
>> >>> considering using FreeNAS, as people seem to be having good luck with 
>> >>> it. Can anyone weigh in here? 
>> >>> 
>> >>> 
>> >> My personal opinion, I think FreeNAS and OpenFiler have horrible, 
>> horrible 
>> >> User Interfaces - not very intuitive, and they both seem to be file 
>> servers 
>> >> with things like iSCSI targets tacked on as an afterthought. 
>> >> 
>> >> Nik 
>> >> 
>> >> 
>> >>> Trevor Francis 
>> >>> Partner 
>> >>> 46 Labs | The PeerEdge Cloud 
>> >>> http://www.46labs.com <http://www.46labs.com/> | 
>> http://www.peeredge.net 
>> >>> <http://www.peeredge.net/> 
>> >>> 405-362-0046 - Voice | 405-410-4980 - Cell 
>> >>> trevorgfrancis - Skype 
>> >>> trevor@46labs.com <ma...@46labs.com> 
>> >>> Solutions Provider for the Telecom Industry 
>> >>> 
>> >>> <http://www.twitter.com/**peeredge 
>> >>> <http://www.twitter.com/peeredge>>< 
>> >>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>>< 
>> >>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>>< 
>> >>> http://**www.facebook.com/PeerEdge 
>> >>> <http://www.facebook.com/PeerEdge>> 
>> >>> 
>> >>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote: 
>> >>> 
>> >>> ZFS would be an interesting setup as you can do the cache pools like 
>> you 
>> >>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is 
>> >>>> that 
>> >>>> they 
>> >>>> really don't scale out well if you are looking for something with a 
>> >>>> unified 
>> >>>> name space. I'll say however that ZFS is a battle hardened FS with 
>> tons 
>> >>>> of 
>> >>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things 
>> >>>> these 
>> >>>> smaller start up companies are hocking are just ZFS appliances. 
>> >>>> 
>> >>>> RBD looks interesting but I'm not sure if I would be willing to put 
>> >>>> production data on it, I'm not sure how performant it is IRL. From a 
>> >>>> purely technical perspective, it looks REALLY cool. 
>> >>>> 
>> >>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is 
>> another 
>> >>>> option although historically small/random IO has not been it's 
>> >>>> strong 
>> >>>> point. 
>> >>>> 
>> >>>> If you are ok spending money on software and want a scale out block 
>> >>>> storage 
>> >>>> then you might want to consider HP LeftHand's VSA product. I am 
>> >>>> personally 
>> >>>> partial to NFS plays:) I went the exact opposite approach and 
>> >>>> settled 
>> on 
>> >>>> Isilon for our primary storage for our CS deployment. 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin 
>> >>>> <nik.martin@nfinausa.com 
>> >>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote: 
>> >>>> 
>> >>>> On 10/22/2012 10:16 AM, Trevor Francis wrote: 
>> >>>>> 
>> >>>>> We are looking at building a Primary Storage solution for an 
>> >>>>>> enterprise/carrier class application. However, we want to build it 
>> >>>>>> using 
>> >>>>>> a FOSS solution and not a commercial solution. Do you have a 
>> >>>>>> recommendation on platform? 
>> >>>>>> 
>> >>>>>> 
>> >>>>>> Trevor, 
>> >>>>> 
>> >>>>> I got EXCELLENT results builing a SAN from FOSS using: 
>> >>>>> OS: Centos 
>> >>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives. LSI 
>> MegaRAID 
>> >>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD 
>> >>>>> caching 
>> >>>>> (alternately, look at FlashCache from Facebook) 
>> >>>>> intel 10GB dual port nics, one port for crossover, on port for up 
>> link 
>> >>>>> to 
>> >>>>> storage network 
>> >>>>> 
>> >>>>> DRBD for real time block replication to active-active 
>> >>>>> Pacemaker+corosync for HA Resource management 
>> >>>>> tgtd for iSCSI target 
>> >>>>> 
>> >>>>> If you want file backed storage, XFS is a very good filesystem on 
>> Linux 
>> >>>>> now. 
>> >>>>> 
>> >>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but 
>> that 
>> >>>>> setup gave me a VERY high performance SAN. The downside is it is 
>> >>>>> entirely 
>> >>>>> managed by CLI, no UI whatsoever. 
>> >>>>> 
>> >>>>> 
>> >>>>> Trevor Francis 
>> >>>>>> Partner 
>> >>>>>> 46 Labs | The PeerEdge Cloud 
>> >>>>>> http://www.46labs.com <http://www.46labs.com/> | 
>> >>>>>> http://www.peeredge.net 
>> >>>>>> <http://www.peeredge.net/> 
>> >>>>>> 
>> >>>>>> 405-362-0046 - Voice | 405-410-4980 - Cell 
>> >>>>>> trevorgfrancis - Skype 
>> >>>>>> trevor@46labs.com <ma...@46labs.com> <mailto: 
>> trevor@46labs.com 
>> >>>>>> > 
>> >>>>>> 
>> >>>>>> 
>> >>>>>> Solutions Provider for the Telecom Industry 
>> >>>>>> 
>> >>>>>> <http://www.twitter.com/****peeredge< 
>> http://www.twitter.com/**peeredge>< 
>> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge 
>> >>>< 
>> >>>>>> http://www.twitter.**com/**peeredge < 
>> http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge> 
>> >>>>>> >>< 
>> >>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge> 
>> >>>>>> < 
>> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge 
>> >>>< 
>> >>>>>> http://**www.facebook.com/**PeerEdge< 
>> http://www.facebook.com/PeerEdge>< 
>> >>>>>> http://www.facebook.com/**PeerEdge < 
>> http://www.facebook.com/PeerEdge> 
>> >>>>>> >> 
>> >>>>>> 
>> >>>>>> 
>> >>>>>> 
>> >>>>> 
>>

RE: Primary Storage

Posted by Fabrice Brazier <fa...@apalia.net>.

Hi Andreas,

Hello i just see your configuration, it seems quite interesting.
If i understand well you want to build some zfs array on the backend.
Export luns (probably by iscsi over infiniband) to you linux cluster,
and on the linux cluster you put glusterFS.
I can understand the point, with that you can have very good performance and
reliability (zfs),
scalability and redundancy (gluster) for very low cost.
So just one question, did you try the global namespace implementation from
nexenta?
If yes can you tell me what configuration is the best for you?
I mean the fact you have a gluster cluster in the middle must impact the
overral
performance no?

Fabrice

-----Message d'origine-----
De : Andreas Huser [mailto:ahuser@7five-edv.de]
Envoyé : mardi 23 octobre 2012 05:40
À : cloudstack-users@incubator.apache.org
Objet : Re: Primary Storage

Hi,

for Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband (RDMA). That
gives the best performance and most scalable Storage.
I have tasted some different  solutions for primary Storage but the most are
to expensive  and for a CloudStack Cluster not economic or have a poor
performance.

My Configuration:
Storage Node:
Supermicro Server (Intel Hardware) with Solaris 11 with SSD write and read
cache (read crucial-m4, write ZeusIOPS) GlusterFS and dualport ConnectX
40Gbit/s Infiniband adapter.

I have installed GlusterFS direct on Solaris with a modified code.
Want you build bigger systems for more then  50 VMs it is better you split
the Solaris and GlusterFS with a separte headnode for GlusterFS

That looks like:
Solaris ZFS Backendstorage with a dataset Volume (Thin Provision) -->  ( SRP
Target  attached direct without Infiniband switch to GF Node)  --> GlusterFS
Node the srp target formatted with xfs filesystem, create a GlusterFS
Volume --> ( Infiniband over a Mellanox Port Switch)  --> Cloudstack Node
mount glusterFS Volume over RDMA

For the Dataset Volume at the ZFS Storage, disable atime and enable
compression.
(Space reclaim) With compression you can shrink the ZFS Volume with command
at Linux dd /dev/zero or  In a Windows VM with sdelete That gives you space
left on the Primary Storage for deleted Files in a VM or for deleted vhd's
or vm's in the cloudstack

greeting Andreas




Mit freundlichen Grüßen

Andreas Huser
Geschäftsführer
System Engineer / Consultant
(Cisco CSE, SMBAM, LCSE, ASAM)
---------------------------------------
Zellerstraße 28 - 77654 Offenburg
Tel:     +49(781) 12786898
Mobil: +49(176) 10308549
ahuser@7five-edv.de




----- Ursprüngliche Mail -----

Von: "Outback Dingo" <ou...@gmail.com>
An: cloudstack-users@incubator.apache.org
Gesendet: Dienstag, 23. Oktober 2012 02:15:16
Betreff: Re: Primary Storage

On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez <iv...@gmail.com> wrote:
> Solaris 11 ZFS and yes we tried different setups, raids levels number
> of SSD cache, ARC zfs options etc etc etc.
>
> Cheers
>

VMWare ??

> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo
> <ou...@gmail.com>wrote:
>
>> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com>
>> wrote:
>> > We are using ZFS, with jbod, not in production yet exporting NFS to
>> > cloudstack, I'm not really happy about the performance
>> > but I think is related to the hardware itself rather than technology,
>> > we
>> > are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster
>> > as
>> > well, but we decided to move away from that path since gluster nfs is
>> still
>> > performing poorly, plus we would like to see cloudstack integrating the
>> > gluster-fuse module, we haven't decided the final storage setup but at
>> the
>> > moment we had better results with ZFS.
>> >
>> >
>>
>> question is whos ZFS and have you "tweaked" the zfs / nfs config for
>> performance
>>
>> >
>> > On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <nik.martin@nfinausa.com
>> >wrote:
>> >
>> >> On 10/22/2012 05:49 PM, Trevor Francis wrote:
>> >>
>> >>> ZFS looks really interesting to me and I am leaning that way. I am
>> >>> considering using FreeNAS, as people seem to be having good luck with
>> >>> it. Can anyone weigh in here?
>> >>>
>> >>>
>> >> My personal opinion, I think FreeNAS and OpenFiler have horrible,
>> horrible
>> >> User Interfaces - not very intuitive, and they both seem to be file
>> servers
>> >> with things like iSCSI targets tacked on as an afterthought.
>> >>
>> >> Nik
>> >>
>> >>
>> >>> Trevor Francis
>> >>> Partner
>> >>> 46 Labs | The PeerEdge Cloud
>> >>> http://www.46labs.com <http://www.46labs.com/> |
>> http://www.peeredge.net
>> >>> <http://www.peeredge.net/>
>> >>> 405-362-0046 - Voice | 405-410-4980 - Cell
>> >>> trevorgfrancis - Skype
>> >>> trevor@46labs.com <ma...@46labs.com>
>> >>> Solutions Provider for the Telecom Industry
>> >>>
>> >>> <http://www.twitter.com/**peeredge
>> >>> <http://www.twitter.com/peeredge>><
>> >>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
>> >>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
>> >>> http://**www.facebook.com/PeerEdge
>> >>> <http://www.facebook.com/PeerEdge>>
>> >>>
>> >>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:
>> >>>
>> >>> ZFS would be an interesting setup as you can do the cache pools like
>> you
>> >>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is
>> >>>> that
>> >>>> they
>> >>>> really don't scale out well if you are looking for something with a
>> >>>> unified
>> >>>> name space. I'll say however that ZFS is a battle hardened FS with
>> tons
>> >>>> of
>> >>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things
>> >>>> these
>> >>>> smaller start up companies are hocking are just ZFS appliances.
>> >>>>
>> >>>> RBD looks interesting but I'm not sure if I would be willing to put
>> >>>> production data on it, I'm not sure how performant it is IRL. From a
>> >>>> purely technical perspective, it looks REALLY cool.
>> >>>>
>> >>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is
>> another
>> >>>> option although historically small/random IO has not been it's
>> >>>> strong
>> >>>> point.
>> >>>>
>> >>>> If you are ok spending money on software and want a scale out block
>> >>>> storage
>> >>>> then you might want to consider HP LeftHand's VSA product. I am
>> >>>> personally
>> >>>> partial to NFS plays:) I went the exact opposite approach and
>> >>>> settled
>> on
>> >>>> Isilon for our primary storage for our CS deployment.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin
>> >>>> <nik.martin@nfinausa.com
>> >>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote:
>> >>>>
>> >>>> On 10/22/2012 10:16 AM, Trevor Francis wrote:
>> >>>>>
>> >>>>> We are looking at building a Primary Storage solution for an
>> >>>>>> enterprise/carrier class application. However, we want to build it
>> >>>>>> using
>> >>>>>> a FOSS solution and not a commercial solution. Do you have a
>> >>>>>> recommendation on platform?
>> >>>>>>
>> >>>>>>
>> >>>>>> Trevor,
>> >>>>>
>> >>>>> I got EXCELLENT results builing a SAN from FOSS using:
>> >>>>> OS: Centos
>> >>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives. LSI
>> MegaRAID
>> >>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD
>> >>>>> caching
>> >>>>> (alternately, look at FlashCache from Facebook)
>> >>>>> intel 10GB dual port nics, one port for crossover, on port for up
>> link
>> >>>>> to
>> >>>>> storage network
>> >>>>>
>> >>>>> DRBD for real time block replication to active-active
>> >>>>> Pacemaker+corosync for HA Resource management
>> >>>>> tgtd for iSCSI target
>> >>>>>
>> >>>>> If you want file backed storage, XFS is a very good filesystem on
>> Linux
>> >>>>> now.
>> >>>>>
>> >>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but
>> that
>> >>>>> setup gave me a VERY high performance SAN. The downside is it is
>> >>>>> entirely
>> >>>>> managed by CLI, no UI whatsoever.
>> >>>>>
>> >>>>>
>> >>>>> Trevor Francis
>> >>>>>> Partner
>> >>>>>> 46 Labs | The PeerEdge Cloud
>> >>>>>> http://www.46labs.com <http://www.46labs.com/> |
>> >>>>>> http://www.peeredge.net
>> >>>>>> <http://www.peeredge.net/>
>> >>>>>>
>> >>>>>> 405-362-0046 - Voice | 405-410-4980 - Cell
>> >>>>>> trevorgfrancis - Skype
>> >>>>>> trevor@46labs.com <ma...@46labs.com> <mailto:
>> trevor@46labs.com
>> >>>>>> >
>> >>>>>>
>> >>>>>>
>> >>>>>> Solutions Provider for the Telecom Industry
>> >>>>>>
>> >>>>>> <http://www.twitter.com/****peeredge<
>> http://www.twitter.com/**peeredge><
>> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
>> >>><
>> >>>>>> http://www.twitter.**com/**peeredge <
>> http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge>
>> >>>>>> >><
>> >>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge>
>> >>>>>> <
>> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
>> >>><
>> >>>>>> http://**www.facebook.com/**PeerEdge<
>> http://www.facebook.com/PeerEdge><
>> >>>>>> http://www.facebook.com/**PeerEdge <
>> http://www.facebook.com/PeerEdge>
>> >>>>>> >>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>>

Re: Primary Storage

Posted by Outback Dingo <ou...@gmail.com>.

On Tue, Oct 23, 2012 at 12:04 AM,  <br...@v365.com.au> wrote:
>
>
> Id suggest every one have a look at www.osnexus.com
> supports fiber,
> 10gb, infiniband using the SCST iSCSI code from
> http://scst.sourceforge.net/
> has NFS and all the good stuff including a
> pretty good GUI, replication (lumbering is not there yet) runs on
> Ubuntu
>

hahahahaha......... tiered pricing model, hardware not included....FAIL!

License
Licensed Capacity 128TB
Support Tier
$18,695.00 USD

> On 23.10.2012 11:40, Andreas Huser wrote:
>> Hi,
>>
>> for
> Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband
>> (RDMA).
> That gives the best performance and most scalable Storage.
>> I have
> tasted some different solutions for primary Storage but the
>> most are
> to expensive and for a CloudStack Cluster not economic or
>> have a poor
> performance.
>>
>> My Configuration:
>> Storage Node:
>> Supermicro Server
> (Intel Hardware) with Solaris 11 with SSD write and
>> read cache (read
> crucial-m4, write ZeusIOPS) GlusterFS and dualport
>> ConnectX 40Gbit/s
> Infiniband adapter.
>>
>> I have installed GlusterFS direct on Solaris
> with a modified code.
>> Want you build bigger systems for more then 50
> VMs it is better you
>> split the Solaris and GlusterFS with a separte
> headnode for GlusterFS
>>
>> That looks like:
>> Solaris ZFS
> Backendstorage with a dataset Volume (Thin Provision) -->
>> ( SRP Target
> attached direct without Infiniband switch to GF Node)
>> --> GlusterFS
> Node the srp target formatted with xfs filesystem,
>> create a GlusterFS
> Volume --> ( Infiniband over a Mellanox Port
>> Switch) --> Cloudstack
> Node mount glusterFS Volume over RDMA
>>
>> For the Dataset Volume at the
> ZFS Storage, disable atime and enable
>> compression.
>> (Space reclaim)
> With compression you can shrink the ZFS Volume with
>> command at Linux
> dd /dev/zero or In a Windows VM with sdelete
>> That gives you space left
> on the Primary Storage for deleted Files in
>> a VM or for deleted vhd's
> or vm's in the cloudstack
>>
>> greeting Andreas
>>
>>
>>
>>
>> Mit
> freundlichen Grüßen
>>
>> Andreas Huser
>> Geschäftsführer
>> System
> Engineer / Consultant
>> (Cisco CSE, SMBAM, LCSE, ASAM)
>>
> ---------------------------------------
>> Zellerstraße 28 - 77654
> Offenburg
>> Tel: +49(781) 12786898
>> Mobil: +49(176) 10308549
>>
> ahuser@7five-edv.de
>>
>>
>>
>>
>> ----- Ursprüngliche Mail -----
>>
>>
> Von: "Outback Dingo"
>> An: cloudstack-users@incubator.apache.org
>>
> Gesendet: Dienstag, 23. Oktober 2012 02:15:16
>> Betreff: Re: Primary
> Storage
>>
>> On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez wrote:
>>>
> Solaris 11 ZFS and yes we tried different setups, raids levels number
> of
>>> SSD cache, ARC zfs options etc etc etc.
>>>
>>> Cheers
>>>
>>
>>
> VMWare ??
>>
>>> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo
> wrote:
>>>
>>>> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez wrote:
>>>>
>> We are using ZFS, with jbod, not in production yet exporting NFS
> to
>>>> > cloudstack, I'm not really happy about the performance
>>>> >
> but I think is related to the hardware itself rather than technology,
> we
>>>> > are using intel SR2625UR and Intel 320 SSD, we were evaluating
> gluster as
>>>> > well, but we decided to move away from that path since
> gluster nfs is
>>>> still
>>>> > performing poorly, plus we would like to
> see cloudstack integrating the
>>>> > gluster-fuse module, we haven't
> decided the final storage setup but at
>>>> the
>>>> > moment we had
> better results with ZFS.
>>>> >
>>>> >
>>>>
>>>> question is whos ZFS and
> have you "tweaked" the zfs / nfs config for
>>>> performance
>>>>
>>>>
>>
>>>> > On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin >> >wrote:
>>>>
>>
>>>> >> On 10/22/2012 05:49 PM, Trevor Francis wrote:
>>>> >>
>>>> >>>
> ZFS looks really interesting to me and I am leaning that way. I am
>>>>
>>>> considering using FreeNAS, as people seem to be having good luck
> with
>>>> >>> it. Can anyone weigh in here?
>>>> >>>
>>>> >>>
>>>> >> My
> personal opinion, I think FreeNAS and OpenFiler have horrible,
>>>>
> horrible
>>>> >> User Interfaces - not very intuitive, and they both seem
> to be file
>>>> servers
>>>> >> with things like iSCSI targets tacked on
> as an afterthought.
>>>> >>
>>>> >> Nik
>>>> >>
>>>> >>
>>>> >>> Trevor
> Francis
>>>> >>> Partner
>>>> >>> 46 Labs | The PeerEdge Cloud
>>>> >>>
> http://www.46labs.com |
>>>> http://www.peeredge.net
>>>> >>>
>>>> >>>
> 405-362-0046 - Voice | 405-410-4980 - Cell
>>>> >>> trevorgfrancis -
> Skype
>>>> >>> trevor@46labs.com
>>>> >>> Solutions Provider for the
> Telecom Industry
>>>> >>>
>>>> >>> ><
>>>> >>>
> http://www.twitter.**com/peeredge ><
>>>> >>>
> http://www.**twitter.com/peeredge ><
>>>> >>>
> http://**www.facebook.com/PeerEdge >
>>>> >>>
>>>> >>> On Oct 22, 2012, at
> 2:30 PM, Jason Davis wrote:
>>>> >>>
>>>> >>> ZFS would be an interesting
> setup as you can do the cache pools like
>>>> you
>>>> >>>> would do in
> CacheCade. The problem with ZFS or CacheCade+DRBD is that
>>>> >>>>
> they
>>>> >>>> really don't scale out well if you are looking for
> something with a
>>>> >>>> unified
>>>> >>>> name space. I'll say however
> that ZFS is a battle hardened FS with
>>>> tons
>>>> >>>> of
>>>> >>>>
> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things
> these
>>>> >>>> smaller start up companies are hocking are just ZFS
> appliances.
>>>> >>>>
>>>> >>>> RBD looks interesting but I'm not sure if
> I would be willing to put
>>>> >>>> production data on it, I'm not sure
> how performant it is IRL. From a
>>>> >>>> purely technical perspective,
> it looks REALLY cool.
>>>> >>>>
>>>> >>>> I suppose anything is fast if
> you put SSDs in it :) GlusterFS is
>>>> another
>>>> >>>> option although
> historically small/random IO has not been it's strong
>>>> >>>>
> point.
>>>> >>>>
>>>> >>>> If you are ok spending money on software and
> want a scale out block
>>>> >>>> storage
>>>> >>>> then you might want to
> consider HP LeftHand's VSA product. I am
>>>> >>>> personally
>>>> >>>>
> partial to NFS plays:) I went the exact opposite approach and
> settled
>>>> on
>>>> >>>> Isilon for our primary storage for our CS
> deployment.
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Mon, Oct 22,
> 2012 at 10:24 AM, Nik Martin >> >>>> >>wrote:
>>>> >>>>
>>>> >>>> On
> 10/22/2012 10:16 AM, Trevor Francis wrote:
>>>> >>>>>
>>>> >>>>> We are
> looking at building a Primary Storage solution for an
>>>> >>>>>>
> enterprise/carrier class application. However, we want to build it
>>>>
>>>>>>> using
>>>> >>>>>> a FOSS solution and not a commercial solution.
> Do you have a
>>>> >>>>>> recommendation on platform?
>>>> >>>>>>
>>>>
>>>>>>>
>>>> >>>>>> Trevor,
>>>> >>>>>
>>>> >>>>> I got EXCELLENT results
> builing a SAN from FOSS using:
>>>> >>>>> OS: Centos
>>>> >>>>> Hardware:
> 2X storage servers, with 12x2TB 3.5 SATA drives. LSI
>>>> MegaRAID
>>>>
>>>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do
> SSD
>>>> >>>>> caching
>>>> >>>>> (alternately, look at FlashCache from
> Facebook)
>>>> >>>>> intel 10GB dual port nics, one port for crossover,
> on port for up
>>>> link
>>>> >>>>> to
>>>> >>>>> storage network
>>>>
>>>>>>
>>>> >>>>> DRBD for real time block replication to
> active-active
>>>> >>>>> Pacemaker+corosync for HA Resource
> management
>>>> >>>>> tgtd for iSCSI target
>>>> >>>>>
>>>> >>>>> If you
> want file backed storage, XFS is a very good filesystem on
>>>> Linux
>>>>
>>>>>> now.
>>>> >>>>>
>>>> >>>>> Pacemaker+Corosync can be difficult to
> grok at the beginning, but
>>>> that
>>>> >>>>> setup gave me a VERY high
> performance SAN. The downside is it is
>>>> >>>>> entirely
>>>> >>>>>
> managed by CLI, no UI whatsoever.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> Trevor
> Francis
>>>> >>>>>> Partner
>>>> >>>>>> 46 Labs | The PeerEdge Cloud
>>>>
>>>>>>> http://www.46labs.com |
>>>> >>>>>> http://www.peeredge.net
>>>>
>>>>>>>
>>>> >>>>>>
>>>> >>>>>> 405-362-0046 - Voice | 405-410-4980 -
> Cell
>>>> >>>>>> trevorgfrancis - Skype
>>>> >>>>>> trevor@46labs.com >>
> trevor@46labs.com
>>>> >>>>>> >
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
> Solutions Provider for the Telecom Industry
>>>> >>>>>>
>>>> >>>>>> >>
> http://www.twitter.com/**peeredge><
>>>> >>>>>>
> http://www.twitter.com/**peeredge >> >>><
>>>> >>>>>>
> http://www.twitter.**com/**peeredge <
>>>>
> http://www.twitter.com/**peeredge
>>>> >>>>>> >><
>>>> >>>>>>
> http://www.**twitter.com/**peeredge <
>>>> >>>>>>
> http://www.twitter.com/**peeredge >> >>><
>>>> >>>>>>
> http://**www.facebook.com/**PeerEdge<
>>>>
> http://www.facebook.com/PeerEdge><
>>>> >>>>>>
> http://www.facebook.com/**PeerEdge <
>>>>
> http://www.facebook.com/PeerEdge>
>>>> >>>>>> >>
>>>> >>>>>>
>>>>
>>>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>>
>

Re: Primary Storage

Posted by br...@v365.com.au.

hi Julien

sadly there is no simple answer, depends on how much
time and $ you want to spend, and how much of your customers data you
want to retain, and at what cost to the customer is it important to
backup? IE if you suffer form vm loss because of san failure, then
expext many complaints, and customers will leave. Our customers are all
businesses, so we know we would loose them as customers if we lost their
data.

here's our experience. for everyone to criticize... LOL.

Footnote we saved 65K which we pass to our customers, and now have a
80Gb/s replicated SAN. which is very reliable.

we had the same issue.
been working with Fiber for years as contractors. And were faced with
8GB upgrade. + 10GBE upgrade, together would cost over 40K to upgrade
our FC and Ether switches. + $15K for mezzanine adapters, all 2nd hand
pricing. New would cost about $80K + Having worked with All the main
storage vendors for years while contracting i knew it would cost a
bundle. Which is why cloud costs so much.....well partly. 

So we tried
Infiniband, as there is enough support for ESX, Centos and Ubuntu. which
we use as hypervisors.

2 x 2nd hand QDR 40GB/s 36 port switches $3600
each
4 x HP QDR 40GB HP C7000 Chassis switches $1500 each
36 x HP QDR
Infiniband HCA's $129 ea $4.5K

total outlay $15K ish

we already had 2
x c7000 chassis spare, + BL460 servers, so we grabbed some HP BL2X220C
G6 Servers also, as they have 2 x mobo giving us much denser compute
power, 36 Servers (dual X5670s per server, 4 CPU/ each blade ) per C7000
Chassis, with dual 1GB Nic + dual 40GB IB. We use 10:1GB Virtual connect
on the chassis, no need to go up to 10GBE

so each server (2 in each
blade) has 2 x 1GB nics. 2 x Infiniband connections, infiniband has 4 x
10GB Virtual Nics + 2 x SRP Paths @40GB/s per blade motherboard.

Zoning(IB Vlan) on IB is limited so we team the 2 x 1GB Ethernet nics
which carry the production Vlans for Internet only. Thats plenty of BW
for most Virtual Web servers, of which we don't go over 20VM's per
server. so each server has dedicated 100MB/s bursting to 2GB/s to the
internet Vlans. IE routed public IP or load bal VIP. and not all servers
on each blade are internet facing so there is plently of B/W. Also we
load bal to VIP clusters which are on different blades so avoid i/o
bottlenecks for HA web farms of which we have a couple. the biggest is
spread across 8 Blades. and can handle 8Gb/s of internet load, as we
have 8 x 1TB providers in HA loadbal cluster.

 + we have 2 x 40GB SRP
paths which connect to the Primary SAN and 4 x 10GB IPOIB for production
networks, and iSCSI.

To build our clustered Infiniband Storage Array we
learnt Infiniband, then Ubuntu as its fairly easy to implement, and
OSNexus for NFS/IP based iSCSI. OSNexus is ok on infiniband we get
6-8GB/s or about 600-800MB/s read write with OSNEXUS using IPOIB and
iSCSI. But you need IB for 3-6 months to play with to figure it all
out.

we also have open-e DSS licenses but we found the same hardware
with open-e (via infiniband) only did about 350MB/s and we had no
control over the OS at all. So we ditched Open-e, 2 x 4TB licenses for
sale if anyone want them. They cost about $900 for 4TB licenses. 

the
OSNexus licenses cost only $695 x 2 for 8TB x 2 (16TB in total) so
better value for us. We use the OSNexus for NFS for secondary storage
NFS and CIFS which is all carried on production infiniband IP networks.
Using OSN ment we didnt need to CLI all the time. the Gui is very easy
to use and very stable.

This also gives us superb security as the
Infiniband IP network is not routed to the internet, Its only for
iscsi/nfs/cifs backup data. And the San 3 and 4 are also connected using
dual poer 40Gb/s Infiniband using IPOIB 

Each of the 2 OSNexus servers
run as primary/2ndary SANs so San 2 is a replica, which is constantly
replicating from san1. Price of SAN hardware was fairly cheap, 2RU
Server (inc 12 bay sas/sata backplane) + MOBO +dual PSU + Ram + LSI9260
raid +X5650 CPU+ 7 x 2TB WD RE SATAIII drives, in raid 6, with 1
hotspare. came to about $4500

We have now migrated to CENTOS 6.3 Using
RedHat sotrage Manager + SCST SRP + ISCSI. ! WOW its very powerful.
so
primary SAN is 2 commodity servers, Hex core Intel x 2, 32GB ECC Ram,
LSI 9280-8i x 2, 32port 6GB/s Expander chassis, dual power. 2 x 250GB
SSD for Cache Cade, 30 x 1TB WD Vraptor drives (200MB/s R/W each) in
raid 6 array with 1 spare drive. So 16 drives per LSI Array -1 hotspare
- 2 for raid 6 parity data =13 Drives @ 200MB/s R/W =2600MB/s and holds
about 12TB Raw. x 2 arrays in same server gives us 24TB of replicated
SAN. which is presented as file io using SCST.

We format using XFS and
present using SCST as file i/O. Cluster & replicate volumes using RHCM.

Our backup policy is vmware snapshot daily, archive daily using Veeam
backup to SRP SAN. And DR backup to archive daily

as our hypervisors
are vmware we can snap into the primary storage pool.

we decided to let
VMware manage the storage for each vm, with multiple 2TB Luns presented
(6 x 2TB) we have 6 Luns which can be used. we use a simple formula.
VMachine drive size x 2 multiply by customers VM's. Aprox drive size is
20-40GB for linux and win2k VM's so we use the larger. 40GB x 2 =80GB,
so a 2TB Lun fits about 50 Host VMs. of course snapshots never use the
full size of the Lun, and we thin provision, so the Lun never actually
gets close to 100% full. If it did snmp would warn us as would VCenter.
Also VDirector will not provision to Luns with less than 20% storage
left.

The sanpshots happen on SRP very quickly. a 40GB Lun usually only
takes 60-90 seconds. - when vmware begins its backup @2am there are as
many as 20~30 snap shots running and archives being updated. Because of
the IB network and fast primary ,2ndary, and 3rd and 4th sans using IB
over 300 VM's are backed up 3 TIMES in 4-5 hours. 1 snapsot, then veeam
backup, then vCenter backup via cifs

We use VCenter backup and
DRecovery to make a VFMS copy of each VM which compresses the data. it
keeps 8 weeks of backups on CIFS/NFS storage (the 16TB above) using CIFS
over Infiniband to Quantastor sans. this is a FREE SERVICE. we don't
charge for it as we felt it 100% essential that we can retrieve a 2TB
Lun instantly via replicated san, or 1 days Snapshot, and all else
failing 8 weeks of daily snaps from 3rd /4th SAN. Our customers are all
businesses, so we know we would loose them as customer if we lost their
data.

So yes storage is a major investment and has to be managed, takes
time to see what works. But as we use Infiniband the speed of it has
been a god send, as im sure it would be much slower of 10GB or Fiber.
ITs fast and very reliable.

Cheers

Bruce M

On 23.10.2012 12:55, Caleb
Call wrote:
> If I'm using fiber (which is what we do) I'm going
directly to the
> node not through another device that then shares it
out over much
> slower iSCSI or NFS. I realize using fiber is more
expensive for
> most, but for us it's the cheaper option because our
fiber
> infrastructure is already built out, our iSCSI not so much.
> 
>

> On Oct 22, 2012, at 10:04 PM, bruce.m@v365.com.au wrote:
> 
>>
>>
>>
Id suggest every one have a look at www.osnexus.com
>> supports
fiber,
>> 10gb, infiniband using the SCST iSCSI code from
>>
http://scst.sourceforge.net/
>> has NFS and all the good stuff including
a
>> pretty good GUI, replication (lumbering is not there yet) runs
on
>> Ubuntu
>>
>> On 23.10.2012 11:40, Andreas Huser wrote:
>>>
Hi,
>>>
>>> for
>> Cloudstack i use Solaris 11 ZFS + GlusterFS over
Infiniband
>>> (RDMA).
>> That gives the best performance and most
scalable Storage.
>>> I have
>> tasted some different solutions for
primary Storage but the
>>> most are
>> to expensive and for a
CloudStack Cluster not economic or
>>> have a poor
>>
performance.
>>>
>>> My Configuration:
>>> Storage Node:
>>> Supermicro
Server
>> (Intel Hardware) with Solaris 11 with SSD write and
>>> read
cache (read
>> crucial-m4, write ZeusIOPS) GlusterFS and dualport
>>>
ConnectX 40Gbit/s
>> Infiniband adapter.
>>>
>>> I have installed
GlusterFS direct on Solaris
>> with a modified code.
>>> Want you build
bigger systems for more then 50
>> VMs it is better you
>>> split the
Solaris and GlusterFS with a separte
>> headnode for GlusterFS
>>>
>>>
That looks like:
>>> Solaris ZFS
>> Backendstorage with a dataset Volume
(Thin Provision) -->
>>> ( SRP Target
>> attached direct without
Infiniband switch to GF Node)
>>> --> GlusterFS
>> Node the srp target
formatted with xfs filesystem,
>>> create a GlusterFS
>> Volume --> (
Infiniband over a Mellanox Port
>>> Switch) --> Cloudstack
>> Node mount
glusterFS Volume over RDMA
>>>
>>> For the Dataset Volume at the
>> ZFS
Storage, disable atime and enable
>>> compression.
>>> (Space
reclaim)
>> With compression you can shrink the ZFS Volume with
>>>
command at Linux
>> dd /dev/zero or In a Windows VM with sdelete
>>>
That gives you space left
>> on the Primary Storage for deleted Files
in
>>> a VM or for deleted vhd's
>> or vm's in the cloudstack
>>>
>>>
greeting Andreas
>>>
>>>
>>>
>>>
>>> Mit
>> freundlichen Grüßen
>>>
>>>
Andreas Huser
>>> Geschäftsführer
>>> System
>> Engineer /
Consultant
>>> (Cisco CSE, SMBAM, LCSE, ASAM)
>>>
>>
---------------------------------------
>>> Zellerstraße 28 - 77654
>>
Offenburg
>>> Tel: +49(781) 12786898
>>> Mobil: +49(176) 10308549
>>>
>>
ahuser@7five-edv.de
>>>
>>>
>>>
>>>
>>> ----- Ursprüngliche Mail
-----
>>>
>>>
>> Von: "Outback Dingo"
>>> An:
cloudstack-users@incubator.apache.org
>>>
>> Gesendet: Dienstag, 23.
Oktober 2012 02:15:16
>>> Betreff: Re: Primary
>> Storage
>>>
>>> On
Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez wrote:
>>>>
>> Solaris 11
ZFS and yes we tried different setups, raids levels number
>> of
>>>>
SSD cache, ARC zfs options etc etc etc.
>>>>
>>>> Cheers
>>>>
>>>
>>>
>>
VMWare ??
>>>
>>>> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo
>>
wrote:
>>>>
>>>>> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez
wrote:
>>>>>
>>> We are using ZFS, with jbod, not in production yet
exporting NFS
>> to
>>>>>> cloudstack, I'm not really happy about the
performance
>>>>>>
>> but I think is related to the hardware itself
rather than technology,
>> we
>>>>>> are using intel SR2625UR and Intel
320 SSD, we were evaluating
>> gluster as
>>>>>> well, but we decided to
move away from that path since
>> gluster nfs is
>>>>> still
>>>>>>
performing poorly, plus we would like to
>> see cloudstack integrating
the
>>>>>> gluster-fuse module, we haven't
>> decided the final storage
setup but at
>>>>> the
>>>>>> moment we had
>> better results with
ZFS.
>>>>>>
>>>>>>
>>>>>
>>>>> question is whos ZFS and
>> have you
"tweaked" the zfs / nfs config for
>>>>>
performance
>>>>>
>>>>>
>>>
>>>>>> On Tue, Oct 23, 2012 at 10:44 AM, Nik
Martin >> >wrote:
>>>>>
>>>
>>>>>>> On 10/22/2012 05:49 PM, Trevor
Francis wrote:
>>>>>>>
>>>>>>>>
>> ZFS looks really interesting to me
and I am leaning that way. I am
>>>>>
>>>>> considering using FreeNAS,
as people seem to be having good luck
>> with
>>>>>>>> it. Can anyone
weigh in here?
>>>>>>>>
>>>>>>>>
>>>>>>> My
>> personal opinion, I think
FreeNAS and OpenFiler have horrible,
>>>>>
>> horrible
>>>>>>> User
Interfaces - not very intuitive, and they both seem
>> to be file
>>>>>
servers
>>>>>>> with things like iSCSI targets tacked on
>> as an
afterthought.
>>>>>>>
>>>>>>> Nik
>>>>>>>
>>>>>>>
>>>>>>>> Trevor
>>
Francis
>>>>>>>> Partner
>>>>>>>> 46 Labs | The PeerEdge
Cloud
>>>>>>>>
>> http://www.46labs.com |
>>>>>
http://www.peeredge.net
>>>>>>>>
>>>>>>>>
>> 405-362-0046 - Voice |
405-410-4980 - Cell
>>>>>>>> trevorgfrancis -
>> Skype
>>>>>>>>
trevor@46labs.com
>>>>>>>> Solutions Provider for the
>> Telecom
Industry
>>>>>>>>
>>>>>>>>> <
>>>>>>>>
>>
http://www.twitter.**com/peeredge ><
>>>>>>>>
>>
http://www.**twitter.com/peeredge ><
>>>>>>>>
>>
http://**www.facebook.com/PeerEdge >
>>>>>>>>
>>>>>>>> On Oct 22, 2012,
at
>> 2:30 PM, Jason Davis wrote:
>>>>>>>>
>>>>>>>> ZFS would be an
interesting
>> setup as you can do the cache pools like
>>>>>
you
>>>>>>>>> would do in
>> CacheCade. The problem with ZFS or
CacheCade+DRBD is that
>>>>>>>>>
>> they
>>>>>>>>> really don't scale
out well if you are looking for
>> something with a
>>>>>>>>>
unified
>>>>>>>>> name space. I'll say however
>> that ZFS is a battle
hardened FS with
>>>>> tons
>>>>>>>>> of
>>>>>>>>>
>> shops using it. A
lot of the whiz-bang SSD+SATA disk SAN things
>> these
>>>>>>>>> smaller
start up companies are hocking are just ZFS
>>
appliances.
>>>>>>>>>
>>>>>>>>> RBD looks interesting but I'm not sure
if
>> I would be willing to put
>>>>>>>>> production data on it, I'm not
sure
>> how performant it is IRL. From a
>>>>>>>>> purely technical
perspective,
>> it looks REALLY cool.
>>>>>>>>>
>>>>>>>>> I suppose
anything is fast if
>> you put SSDs in it :) GlusterFS is
>>>>>
another
>>>>>>>>> option although
>> historically small/random IO has
not been it's strong
>>>>>>>>>
>> point.
>>>>>>>>>
>>>>>>>>> If you are
ok spending money on software and
>> want a scale out block
>>>>>>>>>
storage
>>>>>>>>> then you might want to
>> consider HP LeftHand's VSA
product. I am
>>>>>>>>> personally
>>>>>>>>>
>> partial to NFS plays:) I
went the exact opposite approach and
>> settled
>>>>> on
>>>>>>>>>
Isilon for our primary storage for our CS
>>
deployment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon,
Oct 22,
>> 2012 at 10:24 AM, Nik Martin >> >>>>
>>wrote:
>>>>>>>>>
>>>>>>>>> On
>> 10/22/2012 10:16 AM, Trevor Francis
wrote:
>>>>>>>>>>
>>>>>>>>>> We are
>> looking at building a Primary
Storage solution for an
>>>>>>>>>>>
>> enterprise/carrier class
application. However, we want to build it
>>>>>
>>>>>>>>
using
>>>>>>>>>>> a FOSS solution and not a commercial solution.
>> Do
you have a
>>>>>>>>>>> recommendation on
platform?
>>>>>>>>>>>
>>>>>
>>>>>>>>
>>>>>>>>>>>
Trevor,
>>>>>>>>>>
>>>>>>>>>> I got EXCELLENT results
>> builing a SAN
from FOSS using:
>>>>>>>>>> OS: Centos
>>>>>>>>>> Hardware:
>> 2X
storage servers, with 12x2TB 3.5 SATA drives. LSI
>>>>>
MegaRAID
>>>>>
>>>>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs
configured to do
>> SSD
>>>>>>>>>> caching
>>>>>>>>>> (alternately, look
at FlashCache from
>> Facebook)
>>>>>>>>>> intel 10GB dual port nics,
one port for crossover,
>> on port for up
>>>>> link
>>>>>>>>>>
to
>>>>>>>>>> storage network
>>>>>
>>>>>>>
>>>>>>>>>> DRBD for real
time block replication to
>> active-active
>>>>>>>>>> Pacemaker+corosync
for HA Resource
>> management
>>>>>>>>>> tgtd for iSCSI
target
>>>>>>>>>>
>>>>>>>>>> If you
>> want file backed storage, XFS is
a very good filesystem on
>>>>> Linux
>>>>>
>>>>>>>
now.
>>>>>>>>>>
>>>>>>>>>> Pacemaker+Corosync can be difficult to
>>
grok at the beginning, but
>>>>> that
>>>>>>>>>> setup gave me a VERY
high
>> performance SAN. The downside is it is
>>>>>>>>>>
entirely
>>>>>>>>>>
>> managed by CLI, no UI
whatsoever.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Trevor
>>
Francis
>>>>>>>>>>> Partner
>>>>>>>>>>> 46 Labs | The PeerEdge
Cloud
>>>>>
>>>>>>>> http://www.46labs.com |
>>>>>>>>>>>
http://www.peeredge.net
>>>>>
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
405-362-0046 - Voice | 405-410-4980 -
>> Cell
>>>>>>>>>>> trevorgfrancis
- Skype
>>>>>>>>>>> trevor@46labs.com >>
>>
trevor@46labs.com
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>
Solutions Provider for the Telecom Industry
>>>>>>>>>>>
>>>>>>>>>>>>>
>>
http://www.twitter.com/**peeredge><
>>>>>>>>>>>
>>
http://www.twitter.com/**peeredge >> >>><
>>>>>>>>>>>
>>
http://www.twitter.**com/**peeredge <
>>>>>
>>
http://www.twitter.com/**peeredge
>>>>>>>>>>>>> <
>>>>>>>>>>>
>>
http://www.**twitter.com/**peeredge <
>>>>>>>>>>>
>>
http://www.twitter.com/**peeredge >> >>><
>>>>>>>>>>>
>>
http://**www.facebook.com/**PeerEdge<
>>>>>
>>
http://www.facebook.com/PeerEdge><
>>>>>>>>>>>
>>
http://www.facebook.com/**PeerEdge <
>>>>>
>>
http://www.facebook.com/PeerEdge>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>

Re: Primary Storage

Posted by Caleb Call <ca...@me.com>.

If I'm using fiber (which is what we do) I'm going directly to the node not through another device that then shares it out over much slower iSCSI or NFS.  I realize using fiber is more expensive for most, but for us it's the cheaper option because our fiber infrastructure is already built out, our iSCSI not so much.


On Oct 22, 2012, at 10:04 PM, bruce.m@v365.com.au wrote:

> 
> 
> Id suggest every one have a look at www.osnexus.com
> supports fiber,
> 10gb, infiniband using the SCST iSCSI code from
> http://scst.sourceforge.net/
> has NFS and all the good stuff including a
> pretty good GUI, replication (lumbering is not there yet) runs on
> Ubuntu
> 
> On 23.10.2012 11:40, Andreas Huser wrote:
>> Hi,
>> 
>> for
> Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband
>> (RDMA).
> That gives the best performance and most scalable Storage.
>> I have
> tasted some different solutions for primary Storage but the
>> most are
> to expensive and for a CloudStack Cluster not economic or
>> have a poor
> performance.
>> 
>> My Configuration:
>> Storage Node: 
>> Supermicro Server
> (Intel Hardware) with Solaris 11 with SSD write and
>> read cache (read
> crucial-m4, write ZeusIOPS) GlusterFS and dualport
>> ConnectX 40Gbit/s
> Infiniband adapter.
>> 
>> I have installed GlusterFS direct on Solaris
> with a modified code.
>> Want you build bigger systems for more then 50
> VMs it is better you
>> split the Solaris and GlusterFS with a separte
> headnode for GlusterFS
>> 
>> That looks like:
>> Solaris ZFS
> Backendstorage with a dataset Volume (Thin Provision) -->
>> ( SRP Target
> attached direct without Infiniband switch to GF Node) 
>> --> GlusterFS
> Node the srp target formatted with xfs filesystem,
>> create a GlusterFS
> Volume --> ( Infiniband over a Mellanox Port
>> Switch) --> Cloudstack
> Node mount glusterFS Volume over RDMA
>> 
>> For the Dataset Volume at the
> ZFS Storage, disable atime and enable
>> compression.
>> (Space reclaim)
> With compression you can shrink the ZFS Volume with
>> command at Linux
> dd /dev/zero or In a Windows VM with sdelete
>> That gives you space left
> on the Primary Storage for deleted Files in
>> a VM or for deleted vhd's
> or vm's in the cloudstack
>> 
>> greeting Andreas
>> 
>> 
>> 
>> 
>> Mit
> freundlichen Grüßen
>> 
>> Andreas Huser
>> Geschäftsführer
>> System
> Engineer / Consultant
>> (Cisco CSE, SMBAM, LCSE, ASAM)
>> 
> ---------------------------------------
>> Zellerstraße 28 - 77654
> Offenburg
>> Tel: +49(781) 12786898
>> Mobil: +49(176) 10308549
>> 
> ahuser@7five-edv.de
>> 
>> 
>> 
>> 
>> ----- Ursprüngliche Mail ----- 
>> 
>> 
> Von: "Outback Dingo" 
>> An: cloudstack-users@incubator.apache.org 
>> 
> Gesendet: Dienstag, 23. Oktober 2012 02:15:16 
>> Betreff: Re: Primary
> Storage 
>> 
>> On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez wrote: 
>>> 
> Solaris 11 ZFS and yes we tried different setups, raids levels number
> of
>>> SSD cache, ARC zfs options etc etc etc.
>>> 
>>> Cheers
>>> 
>> 
>> 
> VMWare ?? 
>> 
>>> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo
> wrote:
>>> 
>>>> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez wrote:
>>>> 
>> We are using ZFS, with jbod, not in production yet exporting NFS
> to
>>>>> cloudstack, I'm not really happy about the performance
>>>>> 
> but I think is related to the hardware itself rather than technology,
> we
>>>>> are using intel SR2625UR and Intel 320 SSD, we were evaluating
> gluster as
>>>>> well, but we decided to move away from that path since
> gluster nfs is
>>>> still
>>>>> performing poorly, plus we would like to
> see cloudstack integrating the
>>>>> gluster-fuse module, we haven't
> decided the final storage setup but at
>>>> the
>>>>> moment we had
> better results with ZFS.
>>>>> 
>>>>> 
>>>> 
>>>> question is whos ZFS and
> have you "tweaked" the zfs / nfs config for
>>>> performance
>>>> 
>>>> 
>> 
>>>>> On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin >> >wrote:
>>>> 
>> 
>>>>>> On 10/22/2012 05:49 PM, Trevor Francis wrote:
>>>>>> 
>>>>>>> 
> ZFS looks really interesting to me and I am leaning that way. I am
>>>> 
>>>> considering using FreeNAS, as people seem to be having good luck
> with
>>>>>>> it. Can anyone weigh in here?
>>>>>>> 
>>>>>>> 
>>>>>> My
> personal opinion, I think FreeNAS and OpenFiler have horrible,
>>>> 
> horrible
>>>>>> User Interfaces - not very intuitive, and they both seem
> to be file
>>>> servers
>>>>>> with things like iSCSI targets tacked on
> as an afterthought.
>>>>>> 
>>>>>> Nik
>>>>>> 
>>>>>> 
>>>>>>> Trevor
> Francis
>>>>>>> Partner
>>>>>>> 46 Labs | The PeerEdge Cloud
>>>>>>> 
> http://www.46labs.com |
>>>> http://www.peeredge.net
>>>>>>> 
>>>>>>> 
> 405-362-0046 - Voice | 405-410-4980 - Cell
>>>>>>> trevorgfrancis -
> Skype
>>>>>>> trevor@46labs.com 
>>>>>>> Solutions Provider for the
> Telecom Industry
>>>>>>> 
>>>>>>>> <
>>>>>>> 
> http://www.twitter.**com/peeredge ><
>>>>>>> 
> http://www.**twitter.com/peeredge ><
>>>>>>> 
> http://**www.facebook.com/PeerEdge >
>>>>>>> 
>>>>>>> On Oct 22, 2012, at
> 2:30 PM, Jason Davis wrote:
>>>>>>> 
>>>>>>> ZFS would be an interesting
> setup as you can do the cache pools like
>>>> you
>>>>>>>> would do in
> CacheCade. The problem with ZFS or CacheCade+DRBD is that
>>>>>>>> 
> they
>>>>>>>> really don't scale out well if you are looking for
> something with a
>>>>>>>> unified
>>>>>>>> name space. I'll say however
> that ZFS is a battle hardened FS with
>>>> tons
>>>>>>>> of
>>>>>>>> 
> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things
> these
>>>>>>>> smaller start up companies are hocking are just ZFS
> appliances.
>>>>>>>> 
>>>>>>>> RBD looks interesting but I'm not sure if
> I would be willing to put
>>>>>>>> production data on it, I'm not sure
> how performant it is IRL. From a
>>>>>>>> purely technical perspective,
> it looks REALLY cool.
>>>>>>>> 
>>>>>>>> I suppose anything is fast if
> you put SSDs in it :) GlusterFS is
>>>> another
>>>>>>>> option although
> historically small/random IO has not been it's strong
>>>>>>>> 
> point.
>>>>>>>> 
>>>>>>>> If you are ok spending money on software and
> want a scale out block
>>>>>>>> storage
>>>>>>>> then you might want to
> consider HP LeftHand's VSA product. I am
>>>>>>>> personally
>>>>>>>> 
> partial to NFS plays:) I went the exact opposite approach and
> settled
>>>> on
>>>>>>>> Isilon for our primary storage for our CS
> deployment.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Oct 22,
> 2012 at 10:24 AM, Nik Martin >> >>>> >>wrote:
>>>>>>>> 
>>>>>>>> On
> 10/22/2012 10:16 AM, Trevor Francis wrote:
>>>>>>>>> 
>>>>>>>>> We are
> looking at building a Primary Storage solution for an
>>>>>>>>>> 
> enterprise/carrier class application. However, we want to build it
>>>> 
>>>>>>> using
>>>>>>>>>> a FOSS solution and not a commercial solution.
> Do you have a
>>>>>>>>>> recommendation on platform?
>>>>>>>>>> 
>>>> 
>>>>>>> 
>>>>>>>>>> Trevor,
>>>>>>>>> 
>>>>>>>>> I got EXCELLENT results
> builing a SAN from FOSS using:
>>>>>>>>> OS: Centos
>>>>>>>>> Hardware:
> 2X storage servers, with 12x2TB 3.5 SATA drives. LSI
>>>> MegaRAID
>>>> 
>>>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do
> SSD
>>>>>>>>> caching
>>>>>>>>> (alternately, look at FlashCache from
> Facebook)
>>>>>>>>> intel 10GB dual port nics, one port for crossover,
> on port for up
>>>> link
>>>>>>>>> to
>>>>>>>>> storage network
>>>> 
>>>>>> 
>>>>>>>>> DRBD for real time block replication to
> active-active
>>>>>>>>> Pacemaker+corosync for HA Resource
> management
>>>>>>>>> tgtd for iSCSI target
>>>>>>>>> 
>>>>>>>>> If you
> want file backed storage, XFS is a very good filesystem on
>>>> Linux
>>>> 
>>>>>> now.
>>>>>>>>> 
>>>>>>>>> Pacemaker+Corosync can be difficult to
> grok at the beginning, but
>>>> that
>>>>>>>>> setup gave me a VERY high
> performance SAN. The downside is it is
>>>>>>>>> entirely
>>>>>>>>> 
> managed by CLI, no UI whatsoever.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Trevor
> Francis
>>>>>>>>>> Partner
>>>>>>>>>> 46 Labs | The PeerEdge Cloud
>>>> 
>>>>>>> http://www.46labs.com |
>>>>>>>>>> http://www.peeredge.net
>>>> 
>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 405-362-0046 - Voice | 405-410-4980 -
> Cell
>>>>>>>>>> trevorgfrancis - Skype
>>>>>>>>>> trevor@46labs.com >>
> trevor@46labs.com
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
> Solutions Provider for the Telecom Industry
>>>>>>>>>> 
>>>>>>>>>>>> 
> http://www.twitter.com/**peeredge><
>>>>>>>>>> 
> http://www.twitter.com/**peeredge >> >>><
>>>>>>>>>> 
> http://www.twitter.**com/**peeredge <
>>>> 
> http://www.twitter.com/**peeredge
>>>>>>>>>>>> <
>>>>>>>>>> 
> http://www.**twitter.com/**peeredge <
>>>>>>>>>> 
> http://www.twitter.com/**peeredge >> >>><
>>>>>>>>>> 
> http://**www.facebook.com/**PeerEdge<
>>>> 
> http://www.facebook.com/PeerEdge><
>>>>>>>>>> 
> http://www.facebook.com/**PeerEdge <
>>>> 
> http://www.facebook.com/PeerEdge>
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> 
>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>

Re: Primary Storage

Posted by br...@v365.com.au.

  

Id suggest every one have a look at www.osnexus.com
supports fiber,
10gb, infiniband using the SCST iSCSI code from
http://scst.sourceforge.net/
has NFS and all the good stuff including a
pretty good GUI, replication (lumbering is not there yet) runs on
Ubuntu

On 23.10.2012 11:40, Andreas Huser wrote:
> Hi,
> 
> for
Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband
> (RDMA).
That gives the best performance and most scalable Storage.
> I have
tasted some different solutions for primary Storage but the
> most are
to expensive and for a CloudStack Cluster not economic or
> have a poor
performance.
> 
> My Configuration:
> Storage Node: 
> Supermicro Server
(Intel Hardware) with Solaris 11 with SSD write and
> read cache (read
crucial-m4, write ZeusIOPS) GlusterFS and dualport
> ConnectX 40Gbit/s
Infiniband adapter.
> 
> I have installed GlusterFS direct on Solaris
with a modified code.
> Want you build bigger systems for more then 50
VMs it is better you
> split the Solaris and GlusterFS with a separte
headnode for GlusterFS
> 
> That looks like:
> Solaris ZFS
Backendstorage with a dataset Volume (Thin Provision) -->
> ( SRP Target
attached direct without Infiniband switch to GF Node) 
> --> GlusterFS
Node the srp target formatted with xfs filesystem,
> create a GlusterFS
Volume --> ( Infiniband over a Mellanox Port
> Switch) --> Cloudstack
Node mount glusterFS Volume over RDMA
> 
> For the Dataset Volume at the
ZFS Storage, disable atime and enable
> compression.
> (Space reclaim)
With compression you can shrink the ZFS Volume with
> command at Linux
dd /dev/zero or In a Windows VM with sdelete
> That gives you space left
on the Primary Storage for deleted Files in
> a VM or for deleted vhd's
or vm's in the cloudstack
> 
> greeting Andreas
> 
> 
> 
> 
> Mit
freundlichen Grüßen
> 
> Andreas Huser
> Geschäftsführer
> System
Engineer / Consultant
> (Cisco CSE, SMBAM, LCSE, ASAM)
>
---------------------------------------
> Zellerstraße 28 - 77654
Offenburg
> Tel: +49(781) 12786898
> Mobil: +49(176) 10308549
>
ahuser@7five-edv.de
> 
> 
> 
> 
> ----- Ursprüngliche Mail ----- 
> 
>
Von: "Outback Dingo" 
> An: cloudstack-users@incubator.apache.org 
>
Gesendet: Dienstag, 23. Oktober 2012 02:15:16 
> Betreff: Re: Primary
Storage 
> 
> On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez wrote: 
>>
Solaris 11 ZFS and yes we tried different setups, raids levels number
of
>> SSD cache, ARC zfs options etc etc etc.
>>
>> Cheers
>>
> 
>
VMWare ?? 
> 
>> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo
wrote:
>>
>>> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez wrote:
>>>
> We are using ZFS, with jbod, not in production yet exporting NFS
to
>>> > cloudstack, I'm not really happy about the performance
>>> >
but I think is related to the hardware itself rather than technology,
we
>>> > are using intel SR2625UR and Intel 320 SSD, we were evaluating
gluster as
>>> > well, but we decided to move away from that path since
gluster nfs is
>>> still
>>> > performing poorly, plus we would like to
see cloudstack integrating the
>>> > gluster-fuse module, we haven't
decided the final storage setup but at
>>> the
>>> > moment we had
better results with ZFS.
>>> >
>>> >
>>>
>>> question is whos ZFS and
have you "tweaked" the zfs / nfs config for
>>> performance
>>>
>>>
>
>>> > On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin >> >wrote:
>>>
>
>>> >> On 10/22/2012 05:49 PM, Trevor Francis wrote:
>>> >>
>>> >>>
ZFS looks really interesting to me and I am leaning that way. I am
>>>
>>> considering using FreeNAS, as people seem to be having good luck
with
>>> >>> it. Can anyone weigh in here?
>>> >>>
>>> >>>
>>> >> My
personal opinion, I think FreeNAS and OpenFiler have horrible,
>>>
horrible
>>> >> User Interfaces - not very intuitive, and they both seem
to be file
>>> servers
>>> >> with things like iSCSI targets tacked on
as an afterthought.
>>> >>
>>> >> Nik
>>> >>
>>> >>
>>> >>> Trevor
Francis
>>> >>> Partner
>>> >>> 46 Labs | The PeerEdge Cloud
>>> >>>
http://www.46labs.com |
>>> http://www.peeredge.net
>>> >>> 
>>> >>>
405-362-0046 - Voice | 405-410-4980 - Cell
>>> >>> trevorgfrancis -
Skype
>>> >>> trevor@46labs.com 
>>> >>> Solutions Provider for the
Telecom Industry
>>> >>>
>>> >>> ><
>>> >>>
http://www.twitter.**com/peeredge ><
>>> >>>
http://www.**twitter.com/peeredge ><
>>> >>>
http://**www.facebook.com/PeerEdge >
>>> >>>
>>> >>> On Oct 22, 2012, at
2:30 PM, Jason Davis wrote:
>>> >>>
>>> >>> ZFS would be an interesting
setup as you can do the cache pools like
>>> you
>>> >>>> would do in
CacheCade. The problem with ZFS or CacheCade+DRBD is that
>>> >>>>
they
>>> >>>> really don't scale out well if you are looking for
something with a
>>> >>>> unified
>>> >>>> name space. I'll say however
that ZFS is a battle hardened FS with
>>> tons
>>> >>>> of
>>> >>>>
shops using it. A lot of the whiz-bang SSD+SATA disk SAN things
these
>>> >>>> smaller start up companies are hocking are just ZFS
appliances.
>>> >>>>
>>> >>>> RBD looks interesting but I'm not sure if
I would be willing to put
>>> >>>> production data on it, I'm not sure
how performant it is IRL. From a
>>> >>>> purely technical perspective,
it looks REALLY cool.
>>> >>>>
>>> >>>> I suppose anything is fast if
you put SSDs in it :) GlusterFS is
>>> another
>>> >>>> option although
historically small/random IO has not been it's strong
>>> >>>>
point.
>>> >>>>
>>> >>>> If you are ok spending money on software and
want a scale out block
>>> >>>> storage
>>> >>>> then you might want to
consider HP LeftHand's VSA product. I am
>>> >>>> personally
>>> >>>>
partial to NFS plays:) I went the exact opposite approach and
settled
>>> on
>>> >>>> Isilon for our primary storage for our CS
deployment.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Mon, Oct 22,
2012 at 10:24 AM, Nik Martin >> >>>> >>wrote:
>>> >>>>
>>> >>>> On
10/22/2012 10:16 AM, Trevor Francis wrote:
>>> >>>>>
>>> >>>>> We are
looking at building a Primary Storage solution for an
>>> >>>>>>
enterprise/carrier class application. However, we want to build it
>>>
>>>>>> using
>>> >>>>>> a FOSS solution and not a commercial solution.
Do you have a
>>> >>>>>> recommendation on platform?
>>> >>>>>>
>>>
>>>>>>
>>> >>>>>> Trevor,
>>> >>>>>
>>> >>>>> I got EXCELLENT results
builing a SAN from FOSS using:
>>> >>>>> OS: Centos
>>> >>>>> Hardware:
2X storage servers, with 12x2TB 3.5 SATA drives. LSI
>>> MegaRAID
>>>
>>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do
SSD
>>> >>>>> caching
>>> >>>>> (alternately, look at FlashCache from
Facebook)
>>> >>>>> intel 10GB dual port nics, one port for crossover,
on port for up
>>> link
>>> >>>>> to
>>> >>>>> storage network
>>>
>>>>>
>>> >>>>> DRBD for real time block replication to
active-active
>>> >>>>> Pacemaker+corosync for HA Resource
management
>>> >>>>> tgtd for iSCSI target
>>> >>>>>
>>> >>>>> If you
want file backed storage, XFS is a very good filesystem on
>>> Linux
>>>
>>>>> now.
>>> >>>>>
>>> >>>>> Pacemaker+Corosync can be difficult to
grok at the beginning, but
>>> that
>>> >>>>> setup gave me a VERY high
performance SAN. The downside is it is
>>> >>>>> entirely
>>> >>>>>
managed by CLI, no UI whatsoever.
>>> >>>>>
>>> >>>>>
>>> >>>>> Trevor
Francis
>>> >>>>>> Partner
>>> >>>>>> 46 Labs | The PeerEdge Cloud
>>>
>>>>>> http://www.46labs.com |
>>> >>>>>> http://www.peeredge.net
>>>
>>>>>> 
>>> >>>>>>
>>> >>>>>> 405-362-0046 - Voice | 405-410-4980 -
Cell
>>> >>>>>> trevorgfrancis - Skype
>>> >>>>>> trevor@46labs.com >>
trevor@46labs.com
>>> >>>>>> >
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
Solutions Provider for the Telecom Industry
>>> >>>>>>
>>> >>>>>> >>
http://www.twitter.com/**peeredge><
>>> >>>>>>
http://www.twitter.com/**peeredge >> >>><
>>> >>>>>>
http://www.twitter.**com/**peeredge <
>>>
http://www.twitter.com/**peeredge
>>> >>>>>> >><
>>> >>>>>>
http://www.**twitter.com/**peeredge <
>>> >>>>>>
http://www.twitter.com/**peeredge >> >>><
>>> >>>>>>
http://**www.facebook.com/**PeerEdge<
>>>
http://www.facebook.com/PeerEdge><
>>> >>>>>>
http://www.facebook.com/**PeerEdge <
>>>
http://www.facebook.com/PeerEdge>
>>> >>>>>> >>
>>> >>>>>>
>>>
>>>>>>
>>> >>>>>>
>>> >>>>>
>>>

Re: Primary Storage

Posted by Andreas Huser <ah...@7five-edv.de>.

Hi,

for Cloudstack i use Solaris 11 ZFS + GlusterFS over Infiniband (RDMA). That gives the best performance and most scalable Storage. 
I have tasted some different  solutions for primary Storage but the most are to expensive  and for a CloudStack Cluster not economic or have a poor performance.

My Configuration:
Storage Node: 
Supermicro Server (Intel Hardware) with Solaris 11 with SSD write and read cache (read crucial-m4, write ZeusIOPS) GlusterFS and dualport ConnectX 40Gbit/s Infiniband adapter.

I have installed GlusterFS direct on Solaris with a modified code.
Want you build bigger systems for more then  50 VMs it is better you split the Solaris and GlusterFS with a separte headnode for GlusterFS

That looks like:
Solaris ZFS Backendstorage with a dataset Volume (Thin Provision) -->  ( SRP Target  attached direct without Infiniband switch to GF Node)  --> GlusterFS Node the srp target formatted with xfs filesystem, create a GlusterFS Volume --> ( Infiniband over a Mellanox Port Switch)  --> Cloudstack Node mount glusterFS Volume over RDMA

For the Dataset Volume at the ZFS Storage, disable atime and enable compression.
(Space reclaim) With compression you can shrink the ZFS Volume with command at Linux dd /dev/zero or  In a Windows VM with sdelete 
That gives you space left on the Primary Storage for deleted Files in a VM or for deleted vhd's or vm's in the cloudstack 

greeting Andreas


 

Mit freundlichen Grüßen

Andreas Huser
Geschäftsführer
System Engineer / Consultant
(Cisco CSE, SMBAM, LCSE, ASAM)
---------------------------------------
Zellerstraße 28 - 77654 Offenburg
Tel:     +49(781) 12786898
Mobil: +49(176) 10308549
ahuser@7five-edv.de




----- Ursprüngliche Mail ----- 

Von: "Outback Dingo" <ou...@gmail.com> 
An: cloudstack-users@incubator.apache.org 
Gesendet: Dienstag, 23. Oktober 2012 02:15:16 
Betreff: Re: Primary Storage 

On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez <iv...@gmail.com> wrote: 
> Solaris 11 ZFS and yes we tried different setups, raids levels number of 
> SSD cache, ARC zfs options etc etc etc. 
> 
> Cheers 
> 

VMWare ?? 

> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo <ou...@gmail.com>wrote: 
> 
>> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com> wrote: 
>> > We are using ZFS, with jbod, not in production yet exporting NFS to 
>> > cloudstack, I'm not really happy about the performance 
>> > but I think is related to the hardware itself rather than technology, we 
>> > are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster as 
>> > well, but we decided to move away from that path since gluster nfs is 
>> still 
>> > performing poorly, plus we would like to see cloudstack integrating the 
>> > gluster-fuse module, we haven't decided the final storage setup but at 
>> the 
>> > moment we had better results with ZFS. 
>> > 
>> > 
>> 
>> question is whos ZFS and have you "tweaked" the zfs / nfs config for 
>> performance 
>> 
>> > 
>> > On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <nik.martin@nfinausa.com 
>> >wrote: 
>> > 
>> >> On 10/22/2012 05:49 PM, Trevor Francis wrote: 
>> >> 
>> >>> ZFS looks really interesting to me and I am leaning that way. I am 
>> >>> considering using FreeNAS, as people seem to be having good luck with 
>> >>> it. Can anyone weigh in here? 
>> >>> 
>> >>> 
>> >> My personal opinion, I think FreeNAS and OpenFiler have horrible, 
>> horrible 
>> >> User Interfaces - not very intuitive, and they both seem to be file 
>> servers 
>> >> with things like iSCSI targets tacked on as an afterthought. 
>> >> 
>> >> Nik 
>> >> 
>> >> 
>> >>> Trevor Francis 
>> >>> Partner 
>> >>> 46 Labs | The PeerEdge Cloud 
>> >>> http://www.46labs.com <http://www.46labs.com/> | 
>> http://www.peeredge.net 
>> >>> <http://www.peeredge.net/> 
>> >>> 405-362-0046 - Voice | 405-410-4980 - Cell 
>> >>> trevorgfrancis - Skype 
>> >>> trevor@46labs.com <ma...@46labs.com> 
>> >>> Solutions Provider for the Telecom Industry 
>> >>> 
>> >>> <http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>>< 
>> >>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>>< 
>> >>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>>< 
>> >>> http://**www.facebook.com/PeerEdge <http://www.facebook.com/PeerEdge>> 
>> >>> 
>> >>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote: 
>> >>> 
>> >>> ZFS would be an interesting setup as you can do the cache pools like 
>> you 
>> >>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is that 
>> >>>> they 
>> >>>> really don't scale out well if you are looking for something with a 
>> >>>> unified 
>> >>>> name space. I'll say however that ZFS is a battle hardened FS with 
>> tons 
>> >>>> of 
>> >>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things these 
>> >>>> smaller start up companies are hocking are just ZFS appliances. 
>> >>>> 
>> >>>> RBD looks interesting but I'm not sure if I would be willing to put 
>> >>>> production data on it, I'm not sure how performant it is IRL. From a 
>> >>>> purely technical perspective, it looks REALLY cool. 
>> >>>> 
>> >>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is 
>> another 
>> >>>> option although historically small/random IO has not been it's strong 
>> >>>> point. 
>> >>>> 
>> >>>> If you are ok spending money on software and want a scale out block 
>> >>>> storage 
>> >>>> then you might want to consider HP LeftHand's VSA product. I am 
>> >>>> personally 
>> >>>> partial to NFS plays:) I went the exact opposite approach and settled 
>> on 
>> >>>> Isilon for our primary storage for our CS deployment. 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin <nik.martin@nfinausa.com 
>> >>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote: 
>> >>>> 
>> >>>> On 10/22/2012 10:16 AM, Trevor Francis wrote: 
>> >>>>> 
>> >>>>> We are looking at building a Primary Storage solution for an 
>> >>>>>> enterprise/carrier class application. However, we want to build it 
>> >>>>>> using 
>> >>>>>> a FOSS solution and not a commercial solution. Do you have a 
>> >>>>>> recommendation on platform? 
>> >>>>>> 
>> >>>>>> 
>> >>>>>> Trevor, 
>> >>>>> 
>> >>>>> I got EXCELLENT results builing a SAN from FOSS using: 
>> >>>>> OS: Centos 
>> >>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives. LSI 
>> MegaRAID 
>> >>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD 
>> >>>>> caching 
>> >>>>> (alternately, look at FlashCache from Facebook) 
>> >>>>> intel 10GB dual port nics, one port for crossover, on port for up 
>> link 
>> >>>>> to 
>> >>>>> storage network 
>> >>>>> 
>> >>>>> DRBD for real time block replication to active-active 
>> >>>>> Pacemaker+corosync for HA Resource management 
>> >>>>> tgtd for iSCSI target 
>> >>>>> 
>> >>>>> If you want file backed storage, XFS is a very good filesystem on 
>> Linux 
>> >>>>> now. 
>> >>>>> 
>> >>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but 
>> that 
>> >>>>> setup gave me a VERY high performance SAN. The downside is it is 
>> >>>>> entirely 
>> >>>>> managed by CLI, no UI whatsoever. 
>> >>>>> 
>> >>>>> 
>> >>>>> Trevor Francis 
>> >>>>>> Partner 
>> >>>>>> 46 Labs | The PeerEdge Cloud 
>> >>>>>> http://www.46labs.com <http://www.46labs.com/> | 
>> >>>>>> http://www.peeredge.net 
>> >>>>>> <http://www.peeredge.net/> 
>> >>>>>> 
>> >>>>>> 405-362-0046 - Voice | 405-410-4980 - Cell 
>> >>>>>> trevorgfrancis - Skype 
>> >>>>>> trevor@46labs.com <ma...@46labs.com> <mailto: 
>> trevor@46labs.com 
>> >>>>>> > 
>> >>>>>> 
>> >>>>>> 
>> >>>>>> Solutions Provider for the Telecom Industry 
>> >>>>>> 
>> >>>>>> <http://www.twitter.com/****peeredge< 
>> http://www.twitter.com/**peeredge>< 
>> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge 
>> >>>< 
>> >>>>>> http://www.twitter.**com/**peeredge < 
>> http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge> 
>> >>>>>> >>< 
>> >>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge> < 
>> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge 
>> >>>< 
>> >>>>>> http://**www.facebook.com/**PeerEdge< 
>> http://www.facebook.com/PeerEdge>< 
>> >>>>>> http://www.facebook.com/**PeerEdge < 
>> http://www.facebook.com/PeerEdge> 
>> >>>>>> >> 
>> >>>>>> 
>> >>>>>> 
>> >>>>>> 
>> >>>>> 
>>

Re: Primary Storage

Posted by Outback Dingo <ou...@gmail.com>.

On Mon, Oct 22, 2012 at 8:09 PM, Ivan Rodriguez <iv...@gmail.com> wrote:
> Solaris 11 ZFS and yes we tried different setups, raids levels number of
> SSD cache, ARC zfs options etc etc etc.
>
> Cheers
>

VMWare ??

> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo <ou...@gmail.com>wrote:
>
>> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com> wrote:
>> > We are using ZFS, with jbod, not in production yet exporting NFS to
>> > cloudstack, I'm not really happy about the performance
>> > but I think is related to the hardware itself rather than technology, we
>> > are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster as
>> > well, but we decided to move away from that path since gluster nfs is
>> still
>> > performing poorly,  plus we would like to see cloudstack integrating the
>> > gluster-fuse module, we haven't decided the final storage setup but at
>> the
>> > moment we had better results with ZFS.
>> >
>> >
>>
>> question is whos ZFS and have you "tweaked" the zfs / nfs config for
>> performance
>>
>> >
>> > On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <nik.martin@nfinausa.com
>> >wrote:
>> >
>> >> On 10/22/2012 05:49 PM, Trevor Francis wrote:
>> >>
>> >>> ZFS looks really interesting to me and I am leaning that way. I am
>> >>> considering using FreeNAS, as people seem to be having good luck with
>> >>> it. Can anyone weigh in here?
>> >>>
>> >>>
>> >> My personal opinion, I think FreeNAS and OpenFiler have horrible,
>> horrible
>> >> User Interfaces - not very intuitive, and they both seem to be file
>> servers
>> >> with things like iSCSI targets tacked on as an afterthought.
>> >>
>> >> Nik
>> >>
>> >>
>> >>> Trevor Francis
>> >>> Partner
>> >>> 46 Labs | The PeerEdge Cloud
>> >>> http://www.46labs.com <http://www.46labs.com/> |
>> http://www.peeredge.net
>> >>> <http://www.peeredge.net/>
>> >>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>> >>> trevorgfrancis - Skype
>> >>> trevor@46labs.com <ma...@46labs.com>
>> >>> Solutions Provider for the Telecom Industry
>> >>>
>> >>> <http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>><
>> >>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
>> >>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
>> >>> http://**www.facebook.com/PeerEdge <http://www.facebook.com/PeerEdge>>
>> >>>
>> >>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:
>> >>>
>> >>>  ZFS would be an interesting setup as you can do the cache pools like
>> you
>> >>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is that
>> >>>> they
>> >>>> really don't scale out well if you are looking for something with a
>> >>>> unified
>> >>>> name space. I'll say however that ZFS is a battle hardened FS with
>> tons
>> >>>> of
>> >>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things these
>> >>>> smaller start up companies are hocking are just ZFS appliances.
>> >>>>
>> >>>> RBD looks interesting but I'm not sure if I would be willing to put
>> >>>> production data on it, I'm not sure how performant it is IRL. From a
>> >>>> purely technical perspective, it looks REALLY cool.
>> >>>>
>> >>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is
>> another
>> >>>> option although historically small/random IO has not been it's strong
>> >>>> point.
>> >>>>
>> >>>> If you are ok spending money on software and want a scale out block
>> >>>> storage
>> >>>> then you might want to consider HP LeftHand's VSA product. I am
>> >>>> personally
>> >>>> partial to NFS plays:) I went the exact opposite approach and settled
>> on
>> >>>> Isilon for our primary storage for our CS deployment.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin <nik.martin@nfinausa.com
>> >>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote:
>> >>>>
>> >>>>  On 10/22/2012 10:16 AM, Trevor Francis wrote:
>> >>>>>
>> >>>>>  We are looking at building a Primary Storage solution for an
>> >>>>>> enterprise/carrier class application. However, we want to build it
>> >>>>>> using
>> >>>>>> a FOSS solution and not a commercial solution. Do you have a
>> >>>>>> recommendation on platform?
>> >>>>>>
>> >>>>>>
>> >>>>>>  Trevor,
>> >>>>>
>> >>>>> I got EXCELLENT results builing a SAN from FOSS using:
>> >>>>> OS: Centos
>> >>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives.  LSI
>> MegaRAID
>> >>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD
>> >>>>> caching
>> >>>>> (alternately, look at FlashCache from Facebook)
>> >>>>> intel 10GB dual port nics, one port for crossover, on port for up
>> link
>> >>>>> to
>> >>>>> storage network
>> >>>>>
>> >>>>> DRBD for real time block replication to active-active
>> >>>>> Pacemaker+corosync for HA Resource management
>> >>>>> tgtd for iSCSI target
>> >>>>>
>> >>>>> If you want file backed storage, XFS is a very good filesystem on
>> Linux
>> >>>>> now.
>> >>>>>
>> >>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but
>> that
>> >>>>> setup gave me a VERY high performance SAN.  The downside is it is
>> >>>>> entirely
>> >>>>> managed by CLI, no UI whatsoever.
>> >>>>>
>> >>>>>
>> >>>>>  Trevor Francis
>> >>>>>> Partner
>> >>>>>> 46 Labs | The PeerEdge Cloud
>> >>>>>> http://www.46labs.com <http://www.46labs.com/> |
>> >>>>>> http://www.peeredge.net
>> >>>>>> <http://www.peeredge.net/>
>> >>>>>>
>> >>>>>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>> >>>>>> trevorgfrancis - Skype
>> >>>>>> trevor@46labs.com <ma...@46labs.com> <mailto:
>> trevor@46labs.com
>> >>>>>> >
>> >>>>>>
>> >>>>>>
>> >>>>>> Solutions Provider for the Telecom Industry
>> >>>>>>
>> >>>>>> <http://www.twitter.com/****peeredge<
>> http://www.twitter.com/**peeredge><
>> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
>> >>><
>> >>>>>> http://www.twitter.**com/**peeredge <
>> http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge>
>> >>>>>> >><
>> >>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge> <
>> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
>> >>><
>> >>>>>> http://**www.facebook.com/**PeerEdge<
>> http://www.facebook.com/PeerEdge><
>> >>>>>> http://www.facebook.com/**PeerEdge <
>> http://www.facebook.com/PeerEdge>
>> >>>>>> >>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>>

Re: Primary Storage

Posted by br...@v365.com.au.

also forgot to mention http://www.osnexus.com/ has a very mature API 
which could easily be integrated into CloudStack.


On 23.10.2012 08:09, Ivan Rodriguez wrote:
> Solaris 11 ZFS and yes we tried different setups, raids levels number 
> of
> SSD cache, ARC zfs options etc etc etc.
>
> Cheers
>
> On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo
> <ou...@gmail.com>wrote:
>
>> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com> 
>> wrote:
>> > We are using ZFS, with jbod, not in production yet exporting NFS 
>> to
>> > cloudstack, I'm not really happy about the performance
>> > but I think is related to the hardware itself rather than 
>> technology, we
>> > are using intel SR2625UR and Intel 320 SSD, we were evaluating 
>> gluster as
>> > well, but we decided to move away from that path since gluster nfs 
>> is
>> still
>> > performing poorly,  plus we would like to see cloudstack 
>> integrating the
>> > gluster-fuse module, we haven't decided the final storage setup 
>> but at
>> the
>> > moment we had better results with ZFS.
>> >
>> >
>>
>> question is whos ZFS and have you "tweaked" the zfs / nfs config for
>> performance
>>
>> >
>> > On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin 
>> <nik.martin@nfinausa.com
>> >wrote:
>> >
>> >> On 10/22/2012 05:49 PM, Trevor Francis wrote:
>> >>
>> >>> ZFS looks really interesting to me and I am leaning that way. I 
>> am
>> >>> considering using FreeNAS, as people seem to be having good luck 
>> with
>> >>> it. Can anyone weigh in here?
>> >>>
>> >>>
>> >> My personal opinion, I think FreeNAS and OpenFiler have horrible,
>> horrible
>> >> User Interfaces - not very intuitive, and they both seem to be 
>> file
>> servers
>> >> with things like iSCSI targets tacked on as an afterthought.
>> >>
>> >> Nik
>> >>
>> >>
>> >>> Trevor Francis
>> >>> Partner
>> >>> 46 Labs | The PeerEdge Cloud
>> >>> http://www.46labs.com <http://www.46labs.com/> |
>> http://www.peeredge.net
>> >>> <http://www.peeredge.net/>
>> >>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>> >>> trevorgfrancis - Skype
>> >>> trevor@46labs.com <ma...@46labs.com>
>> >>> Solutions Provider for the Telecom Industry
>> >>>
>> >>> <http://www.twitter.com/**peeredge 
>> <http://www.twitter.com/peeredge>><
>> >>> http://www.twitter.**com/peeredge 
>> <http://www.twitter.com/peeredge>><
>> >>> http://www.**twitter.com/peeredge 
>> <http://www.twitter.com/peeredge>><
>> >>> http://**www.facebook.com/PeerEdge 
>> <http://www.facebook.com/PeerEdge>>
>> >>>
>> >>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:
>> >>>
>> >>>  ZFS would be an interesting setup as you can do the cache pools 
>> like
>> you
>> >>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD 
>> is that
>> >>>> they
>> >>>> really don't scale out well if you are looking for something 
>> with a
>> >>>> unified
>> >>>> name space. I'll say however that ZFS is a battle hardened FS 
>> with
>> tons
>> >>>> of
>> >>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things 
>> these
>> >>>> smaller start up companies are hocking are just ZFS appliances.
>> >>>>
>> >>>> RBD looks interesting but I'm not sure if I would be willing to 
>> put
>> >>>> production data on it, I'm not sure how performant it is IRL. 
>> From a
>> >>>> purely technical perspective, it looks REALLY cool.
>> >>>>
>> >>>> I suppose anything is fast if you put SSDs in it :) GlusterFS 
>> is
>> another
>> >>>> option although historically small/random IO has not been it's 
>> strong
>> >>>> point.
>> >>>>
>> >>>> If you are ok spending money on software and want a scale out 
>> block
>> >>>> storage
>> >>>> then you might want to consider HP LeftHand's VSA product. I am
>> >>>> personally
>> >>>> partial to NFS plays:) I went the exact opposite approach and 
>> settled
>> on
>> >>>> Isilon for our primary storage for our CS deployment.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin 
>> <nik.martin@nfinausa.com
>> >>>> <mailto:nik.martin@nfinausa.**com 
>> <ni...@nfinausa.com>>>wrote:
>> >>>>
>> >>>>  On 10/22/2012 10:16 AM, Trevor Francis wrote:
>> >>>>>
>> >>>>>  We are looking at building a Primary Storage solution for an
>> >>>>>> enterprise/carrier class application. However, we want to 
>> build it
>> >>>>>> using
>> >>>>>> a FOSS solution and not a commercial solution. Do you have a
>> >>>>>> recommendation on platform?
>> >>>>>>
>> >>>>>>
>> >>>>>>  Trevor,
>> >>>>>
>> >>>>> I got EXCELLENT results builing a SAN from FOSS using:
>> >>>>> OS: Centos
>> >>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives.  
>> LSI
>> MegaRAID
>> >>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to 
>> do SSD
>> >>>>> caching
>> >>>>> (alternately, look at FlashCache from Facebook)
>> >>>>> intel 10GB dual port nics, one port for crossover, on port for 
>> up
>> link
>> >>>>> to
>> >>>>> storage network
>> >>>>>
>> >>>>> DRBD for real time block replication to active-active
>> >>>>> Pacemaker+corosync for HA Resource management
>> >>>>> tgtd for iSCSI target
>> >>>>>
>> >>>>> If you want file backed storage, XFS is a very good filesystem 
>> on
>> Linux
>> >>>>> now.
>> >>>>>
>> >>>>> Pacemaker+Corosync can be difficult to grok at the beginning, 
>> but
>> that
>> >>>>> setup gave me a VERY high performance SAN.  The downside is it 
>> is
>> >>>>> entirely
>> >>>>> managed by CLI, no UI whatsoever.
>> >>>>>
>> >>>>>
>> >>>>>  Trevor Francis
>> >>>>>> Partner
>> >>>>>> 46 Labs | The PeerEdge Cloud
>> >>>>>> http://www.46labs.com <http://www.46labs.com/> |
>> >>>>>> http://www.peeredge.net
>> >>>>>> <http://www.peeredge.net/>
>> >>>>>>
>> >>>>>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>> >>>>>> trevorgfrancis - Skype
>> >>>>>> trevor@46labs.com <ma...@46labs.com> <mailto:
>> trevor@46labs.com
>> >>>>>> >
>> >>>>>>
>> >>>>>>
>> >>>>>> Solutions Provider for the Telecom Industry
>> >>>>>>
>> >>>>>> <http://www.twitter.com/****peeredge<
>> http://www.twitter.com/**peeredge><
>> >>>>>> http://www.twitter.com/**peeredge 
>> <http://www.twitter.com/peeredge
>> >>><
>> >>>>>> http://www.twitter.**com/**peeredge <
>> http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge>
>> >>>>>> >><
>> >>>>>> http://www.**twitter.com/**peeredge 
>> <http://twitter.com/peeredge> <
>> >>>>>> http://www.twitter.com/**peeredge 
>> <http://www.twitter.com/peeredge
>> >>><
>> >>>>>> http://**www.facebook.com/**PeerEdge<
>> http://www.facebook.com/PeerEdge><
>> >>>>>> http://www.facebook.com/**PeerEdge <
>> http://www.facebook.com/PeerEdge>
>> >>>>>> >>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>>

Re: Primary Storage

Posted by Ivan Rodriguez <iv...@gmail.com>.

Solaris 11 ZFS and yes we tried different setups, raids levels number of
SSD cache, ARC zfs options etc etc etc.

Cheers

On Tue, Oct 23, 2012 at 11:05 AM, Outback Dingo <ou...@gmail.com>wrote:

> On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com> wrote:
> > We are using ZFS, with jbod, not in production yet exporting NFS to
> > cloudstack, I'm not really happy about the performance
> > but I think is related to the hardware itself rather than technology, we
> > are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster as
> > well, but we decided to move away from that path since gluster nfs is
> still
> > performing poorly,  plus we would like to see cloudstack integrating the
> > gluster-fuse module, we haven't decided the final storage setup but at
> the
> > moment we had better results with ZFS.
> >
> >
>
> question is whos ZFS and have you "tweaked" the zfs / nfs config for
> performance
>
> >
> > On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <nik.martin@nfinausa.com
> >wrote:
> >
> >> On 10/22/2012 05:49 PM, Trevor Francis wrote:
> >>
> >>> ZFS looks really interesting to me and I am leaning that way. I am
> >>> considering using FreeNAS, as people seem to be having good luck with
> >>> it. Can anyone weigh in here?
> >>>
> >>>
> >> My personal opinion, I think FreeNAS and OpenFiler have horrible,
> horrible
> >> User Interfaces - not very intuitive, and they both seem to be file
> servers
> >> with things like iSCSI targets tacked on as an afterthought.
> >>
> >> Nik
> >>
> >>
> >>> Trevor Francis
> >>> Partner
> >>> 46 Labs | The PeerEdge Cloud
> >>> http://www.46labs.com <http://www.46labs.com/> |
> http://www.peeredge.net
> >>> <http://www.peeredge.net/>
> >>> 405-362-0046 - Voice  | 405-410-4980 - Cell
> >>> trevorgfrancis - Skype
> >>> trevor@46labs.com <ma...@46labs.com>
> >>> Solutions Provider for the Telecom Industry
> >>>
> >>> <http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>><
> >>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
> >>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
> >>> http://**www.facebook.com/PeerEdge <http://www.facebook.com/PeerEdge>>
> >>>
> >>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:
> >>>
> >>>  ZFS would be an interesting setup as you can do the cache pools like
> you
> >>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is that
> >>>> they
> >>>> really don't scale out well if you are looking for something with a
> >>>> unified
> >>>> name space. I'll say however that ZFS is a battle hardened FS with
> tons
> >>>> of
> >>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things these
> >>>> smaller start up companies are hocking are just ZFS appliances.
> >>>>
> >>>> RBD looks interesting but I'm not sure if I would be willing to put
> >>>> production data on it, I'm not sure how performant it is IRL. From a
> >>>> purely technical perspective, it looks REALLY cool.
> >>>>
> >>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is
> another
> >>>> option although historically small/random IO has not been it's strong
> >>>> point.
> >>>>
> >>>> If you are ok spending money on software and want a scale out block
> >>>> storage
> >>>> then you might want to consider HP LeftHand's VSA product. I am
> >>>> personally
> >>>> partial to NFS plays:) I went the exact opposite approach and settled
> on
> >>>> Isilon for our primary storage for our CS deployment.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin <nik.martin@nfinausa.com
> >>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote:
> >>>>
> >>>>  On 10/22/2012 10:16 AM, Trevor Francis wrote:
> >>>>>
> >>>>>  We are looking at building a Primary Storage solution for an
> >>>>>> enterprise/carrier class application. However, we want to build it
> >>>>>> using
> >>>>>> a FOSS solution and not a commercial solution. Do you have a
> >>>>>> recommendation on platform?
> >>>>>>
> >>>>>>
> >>>>>>  Trevor,
> >>>>>
> >>>>> I got EXCELLENT results builing a SAN from FOSS using:
> >>>>> OS: Centos
> >>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives.  LSI
> MegaRAID
> >>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD
> >>>>> caching
> >>>>> (alternately, look at FlashCache from Facebook)
> >>>>> intel 10GB dual port nics, one port for crossover, on port for up
> link
> >>>>> to
> >>>>> storage network
> >>>>>
> >>>>> DRBD for real time block replication to active-active
> >>>>> Pacemaker+corosync for HA Resource management
> >>>>> tgtd for iSCSI target
> >>>>>
> >>>>> If you want file backed storage, XFS is a very good filesystem on
> Linux
> >>>>> now.
> >>>>>
> >>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but
> that
> >>>>> setup gave me a VERY high performance SAN.  The downside is it is
> >>>>> entirely
> >>>>> managed by CLI, no UI whatsoever.
> >>>>>
> >>>>>
> >>>>>  Trevor Francis
> >>>>>> Partner
> >>>>>> 46 Labs | The PeerEdge Cloud
> >>>>>> http://www.46labs.com <http://www.46labs.com/> |
> >>>>>> http://www.peeredge.net
> >>>>>> <http://www.peeredge.net/>
> >>>>>>
> >>>>>> 405-362-0046 - Voice  | 405-410-4980 - Cell
> >>>>>> trevorgfrancis - Skype
> >>>>>> trevor@46labs.com <ma...@46labs.com> <mailto:
> trevor@46labs.com
> >>>>>> >
> >>>>>>
> >>>>>>
> >>>>>> Solutions Provider for the Telecom Industry
> >>>>>>
> >>>>>> <http://www.twitter.com/****peeredge<
> http://www.twitter.com/**peeredge><
> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
> >>><
> >>>>>> http://www.twitter.**com/**peeredge <
> http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge>
> >>>>>> >><
> >>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge> <
> >>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge
> >>><
> >>>>>> http://**www.facebook.com/**PeerEdge<
> http://www.facebook.com/PeerEdge><
> >>>>>> http://www.facebook.com/**PeerEdge <
> http://www.facebook.com/PeerEdge>
> >>>>>> >>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
>

Re: Primary Storage

Posted by Outback Dingo <ou...@gmail.com>.

On Mon, Oct 22, 2012 at 8:03 PM, Ivan Rodriguez <iv...@gmail.com> wrote:
> We are using ZFS, with jbod, not in production yet exporting NFS to
> cloudstack, I'm not really happy about the performance
> but I think is related to the hardware itself rather than technology, we
> are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster as
> well, but we decided to move away from that path since gluster nfs is still
> performing poorly,  plus we would like to see cloudstack integrating the
> gluster-fuse module, we haven't decided the final storage setup but at the
> moment we had better results with ZFS.
>
>

question is whos ZFS and have you "tweaked" the zfs / nfs config for performance

>
> On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <ni...@nfinausa.com>wrote:
>
>> On 10/22/2012 05:49 PM, Trevor Francis wrote:
>>
>>> ZFS looks really interesting to me and I am leaning that way. I am
>>> considering using FreeNAS, as people seem to be having good luck with
>>> it. Can anyone weigh in here?
>>>
>>>
>> My personal opinion, I think FreeNAS and OpenFiler have horrible, horrible
>> User Interfaces - not very intuitive, and they both seem to be file servers
>> with things like iSCSI targets tacked on as an afterthought.
>>
>> Nik
>>
>>
>>> Trevor Francis
>>> Partner
>>> 46 Labs | The PeerEdge Cloud
>>> http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
>>> <http://www.peeredge.net/>
>>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>>> trevorgfrancis - Skype
>>> trevor@46labs.com <ma...@46labs.com>
>>> Solutions Provider for the Telecom Industry
>>>
>>> <http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>><
>>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
>>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
>>> http://**www.facebook.com/PeerEdge <http://www.facebook.com/PeerEdge>>
>>>
>>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:
>>>
>>>  ZFS would be an interesting setup as you can do the cache pools like you
>>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is that
>>>> they
>>>> really don't scale out well if you are looking for something with a
>>>> unified
>>>> name space. I'll say however that ZFS is a battle hardened FS with tons
>>>> of
>>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things these
>>>> smaller start up companies are hocking are just ZFS appliances.
>>>>
>>>> RBD looks interesting but I'm not sure if I would be willing to put
>>>> production data on it, I'm not sure how performant it is IRL. From a
>>>> purely technical perspective, it looks REALLY cool.
>>>>
>>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is another
>>>> option although historically small/random IO has not been it's strong
>>>> point.
>>>>
>>>> If you are ok spending money on software and want a scale out block
>>>> storage
>>>> then you might want to consider HP LeftHand's VSA product. I am
>>>> personally
>>>> partial to NFS plays:) I went the exact opposite approach and settled on
>>>> Isilon for our primary storage for our CS deployment.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin <nik.martin@nfinausa.com
>>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote:
>>>>
>>>>  On 10/22/2012 10:16 AM, Trevor Francis wrote:
>>>>>
>>>>>  We are looking at building a Primary Storage solution for an
>>>>>> enterprise/carrier class application. However, we want to build it
>>>>>> using
>>>>>> a FOSS solution and not a commercial solution. Do you have a
>>>>>> recommendation on platform?
>>>>>>
>>>>>>
>>>>>>  Trevor,
>>>>>
>>>>> I got EXCELLENT results builing a SAN from FOSS using:
>>>>> OS: Centos
>>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives.  LSI MegaRAID
>>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD
>>>>> caching
>>>>> (alternately, look at FlashCache from Facebook)
>>>>> intel 10GB dual port nics, one port for crossover, on port for up link
>>>>> to
>>>>> storage network
>>>>>
>>>>> DRBD for real time block replication to active-active
>>>>> Pacemaker+corosync for HA Resource management
>>>>> tgtd for iSCSI target
>>>>>
>>>>> If you want file backed storage, XFS is a very good filesystem on Linux
>>>>> now.
>>>>>
>>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but that
>>>>> setup gave me a VERY high performance SAN.  The downside is it is
>>>>> entirely
>>>>> managed by CLI, no UI whatsoever.
>>>>>
>>>>>
>>>>>  Trevor Francis
>>>>>> Partner
>>>>>> 46 Labs | The PeerEdge Cloud
>>>>>> http://www.46labs.com <http://www.46labs.com/> |
>>>>>> http://www.peeredge.net
>>>>>> <http://www.peeredge.net/>
>>>>>>
>>>>>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>>>>>> trevorgfrancis - Skype
>>>>>> trevor@46labs.com <ma...@46labs.com> <mailto:trevor@46labs.com
>>>>>> >
>>>>>>
>>>>>>
>>>>>> Solutions Provider for the Telecom Industry
>>>>>>
>>>>>> <http://www.twitter.com/****peeredge<http://www.twitter.com/**peeredge><
>>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>>><
>>>>>> http://www.twitter.**com/**peeredge <http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge>
>>>>>> >><
>>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge> <
>>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>>><
>>>>>> http://**www.facebook.com/**PeerEdge<http://www.facebook.com/PeerEdge><
>>>>>> http://www.facebook.com/**PeerEdge <http://www.facebook.com/PeerEdge>
>>>>>> >>
>>>>>>
>>>>>>
>>>>>>
>>>>>

Re: Primary Storage

Posted by Ivan Rodriguez <iv...@gmail.com>.

We are using ZFS, with jbod, not in production yet exporting NFS to
cloudstack, I'm not really happy about the performance
but I think is related to the hardware itself rather than technology, we
are using intel SR2625UR and Intel 320 SSD, we were evaluating gluster as
well, but we decided to move away from that path since gluster nfs is still
performing poorly,  plus we would like to see cloudstack integrating the
gluster-fuse module, we haven't decided the final storage setup but at the
moment we had better results with ZFS.



On Tue, Oct 23, 2012 at 10:44 AM, Nik Martin <ni...@nfinausa.com>wrote:

> On 10/22/2012 05:49 PM, Trevor Francis wrote:
>
>> ZFS looks really interesting to me and I am leaning that way. I am
>> considering using FreeNAS, as people seem to be having good luck with
>> it. Can anyone weigh in here?
>>
>>
> My personal opinion, I think FreeNAS and OpenFiler have horrible, horrible
> User Interfaces - not very intuitive, and they both seem to be file servers
> with things like iSCSI targets tacked on as an afterthought.
>
> Nik
>
>
>> Trevor Francis
>> Partner
>> 46 Labs | The PeerEdge Cloud
>> http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
>> <http://www.peeredge.net/>
>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>> trevorgfrancis - Skype
>> trevor@46labs.com <ma...@46labs.com>
>> Solutions Provider for the Telecom Industry
>>
>> <http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>><
>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
>> http://**www.facebook.com/PeerEdge <http://www.facebook.com/PeerEdge>>
>>
>> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:
>>
>>  ZFS would be an interesting setup as you can do the cache pools like you
>>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is that
>>> they
>>> really don't scale out well if you are looking for something with a
>>> unified
>>> name space. I'll say however that ZFS is a battle hardened FS with tons
>>> of
>>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things these
>>> smaller start up companies are hocking are just ZFS appliances.
>>>
>>> RBD looks interesting but I'm not sure if I would be willing to put
>>> production data on it, I'm not sure how performant it is IRL. From a
>>> purely technical perspective, it looks REALLY cool.
>>>
>>> I suppose anything is fast if you put SSDs in it :) GlusterFS is another
>>> option although historically small/random IO has not been it's strong
>>> point.
>>>
>>> If you are ok spending money on software and want a scale out block
>>> storage
>>> then you might want to consider HP LeftHand's VSA product. I am
>>> personally
>>> partial to NFS plays:) I went the exact opposite approach and settled on
>>> Isilon for our primary storage for our CS deployment.
>>>
>>>
>>>
>>>
>>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin <nik.martin@nfinausa.com
>>> <mailto:nik.martin@nfinausa.**com <ni...@nfinausa.com>>>wrote:
>>>
>>>  On 10/22/2012 10:16 AM, Trevor Francis wrote:
>>>>
>>>>  We are looking at building a Primary Storage solution for an
>>>>> enterprise/carrier class application. However, we want to build it
>>>>> using
>>>>> a FOSS solution and not a commercial solution. Do you have a
>>>>> recommendation on platform?
>>>>>
>>>>>
>>>>>  Trevor,
>>>>
>>>> I got EXCELLENT results builing a SAN from FOSS using:
>>>> OS: Centos
>>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives.  LSI MegaRAID
>>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD
>>>> caching
>>>> (alternately, look at FlashCache from Facebook)
>>>> intel 10GB dual port nics, one port for crossover, on port for up link
>>>> to
>>>> storage network
>>>>
>>>> DRBD for real time block replication to active-active
>>>> Pacemaker+corosync for HA Resource management
>>>> tgtd for iSCSI target
>>>>
>>>> If you want file backed storage, XFS is a very good filesystem on Linux
>>>> now.
>>>>
>>>> Pacemaker+Corosync can be difficult to grok at the beginning, but that
>>>> setup gave me a VERY high performance SAN.  The downside is it is
>>>> entirely
>>>> managed by CLI, no UI whatsoever.
>>>>
>>>>
>>>>  Trevor Francis
>>>>> Partner
>>>>> 46 Labs | The PeerEdge Cloud
>>>>> http://www.46labs.com <http://www.46labs.com/> |
>>>>> http://www.peeredge.net
>>>>> <http://www.peeredge.net/>
>>>>>
>>>>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>>>>> trevorgfrancis - Skype
>>>>> trevor@46labs.com <ma...@46labs.com> <mailto:trevor@46labs.com
>>>>> >
>>>>>
>>>>>
>>>>> Solutions Provider for the Telecom Industry
>>>>>
>>>>> <http://www.twitter.com/****peeredge<http://www.twitter.com/**peeredge><
>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>>><
>>>>> http://www.twitter.**com/**peeredge <http://www.twitter.com/**peeredge<http://www.twitter.com/peeredge>
>>>>> >><
>>>>> http://www.**twitter.com/**peeredge <http://twitter.com/peeredge> <
>>>>> http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>>><
>>>>> http://**www.facebook.com/**PeerEdge<http://www.facebook.com/PeerEdge><
>>>>> http://www.facebook.com/**PeerEdge <http://www.facebook.com/PeerEdge>
>>>>> >>
>>>>>
>>>>>
>>>>>
>>>>

Re: Primary Storage

Posted by Nik Martin <ni...@nfinausa.com>.

On 10/22/2012 05:49 PM, Trevor Francis wrote:
> ZFS looks really interesting to me and I am leaning that way. I am
> considering using FreeNAS, as people seem to be having good luck with
> it. Can anyone weigh in here?
>

My personal opinion, I think FreeNAS and OpenFiler have horrible, 
horrible User Interfaces - not very intuitive, and they both seem to be 
file servers with things like iSCSI targets tacked on as an afterthought.

Nik

>
> Trevor Francis
> Partner
> 46 Labs | The PeerEdge Cloud
> http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
> <http://www.peeredge.net/>
> 405-362-0046 - Voice  | 405-410-4980 - Cell
> trevorgfrancis - Skype
> trevor@46labs.com <ma...@46labs.com>
> Solutions Provider for the Telecom Industry
>
> <http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.facebook.com/PeerEdge>
>
> On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:
>
>> ZFS would be an interesting setup as you can do the cache pools like you
>> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is that they
>> really don't scale out well if you are looking for something with a
>> unified
>> name space. I'll say however that ZFS is a battle hardened FS with tons of
>> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things these
>> smaller start up companies are hocking are just ZFS appliances.
>>
>> RBD looks interesting but I'm not sure if I would be willing to put
>> production data on it, I'm not sure how performant it is IRL. From a
>> purely technical perspective, it looks REALLY cool.
>>
>> I suppose anything is fast if you put SSDs in it :) GlusterFS is another
>> option although historically small/random IO has not been it's strong
>> point.
>>
>> If you are ok spending money on software and want a scale out block
>> storage
>> then you might want to consider HP LeftHand's VSA product. I am personally
>> partial to NFS plays:) I went the exact opposite approach and settled on
>> Isilon for our primary storage for our CS deployment.
>>
>>
>>
>>
>> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin <nik.martin@nfinausa.com
>> <ma...@nfinausa.com>>wrote:
>>
>>> On 10/22/2012 10:16 AM, Trevor Francis wrote:
>>>
>>>> We are looking at building a Primary Storage solution for an
>>>> enterprise/carrier class application. However, we want to build it using
>>>> a FOSS solution and not a commercial solution. Do you have a
>>>> recommendation on platform?
>>>>
>>>>
>>> Trevor,
>>>
>>> I got EXCELLENT results builing a SAN from FOSS using:
>>> OS: Centos
>>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives.  LSI MegaRAID
>>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD
>>> caching
>>> (alternately, look at FlashCache from Facebook)
>>> intel 10GB dual port nics, one port for crossover, on port for up link to
>>> storage network
>>>
>>> DRBD for real time block replication to active-active
>>> Pacemaker+corosync for HA Resource management
>>> tgtd for iSCSI target
>>>
>>> If you want file backed storage, XFS is a very good filesystem on Linux
>>> now.
>>>
>>> Pacemaker+Corosync can be difficult to grok at the beginning, but that
>>> setup gave me a VERY high performance SAN.  The downside is it is
>>> entirely
>>> managed by CLI, no UI whatsoever.
>>>
>>>
>>>> Trevor Francis
>>>> Partner
>>>> 46 Labs | The PeerEdge Cloud
>>>> http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
>>>> <http://www.peeredge.net/>
>>>>
>>>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>>>> trevorgfrancis - Skype
>>>> trevor@46labs.com <ma...@46labs.com> <ma...@46labs.com>
>>>>
>>>> Solutions Provider for the Telecom Industry
>>>>
>>>> <http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>><
>>>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
>>>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
>>>> http://**www.facebook.com/PeerEdge <http://www.facebook.com/PeerEdge>>
>>>>
>>>>
>>>

Re: Primary Storage

Posted by Trevor Francis <tr...@tgrahamcapital.com>.

ZFS looks really interesting to me and I am leaning that way. I am considering using FreeNAS, as people seem to be having good luck with it. Can anyone weigh in here?


Trevor Francis
Partner
46 Labs | The PeerEdge Cloud
http://www.46labs.com | http://www.peeredge.net
405-362-0046 - Voice  | 405-410-4980 - Cell
trevorgfrancis - Skype
trevor@46labs.com
 
Solutions Provider for the Telecom Industry

 

On Oct 22, 2012, at 2:30 PM, Jason Davis wrote:

> ZFS would be an interesting setup as you can do the cache pools like you
> would do in CacheCade. The problem with ZFS or CacheCade+DRBD is that they
> really don't scale out well if you are looking for something with a unified
> name space. I'll say however that ZFS is a battle hardened FS with tons of
> shops using it. A lot of the whiz-bang SSD+SATA disk SAN things these
> smaller start up companies are hocking are just ZFS appliances.
> 
> RBD looks interesting but I'm not sure if I would be willing to put
> production data on it, I'm not sure how performant it is IRL. From a
> purely technical perspective, it looks REALLY cool.
> 
> I suppose anything is fast if you put SSDs in it :) GlusterFS is another
> option although historically small/random IO has not been it's strong point.
> 
> If you are ok spending money on software and want a scale out block storage
> then you might want to consider HP LeftHand's VSA product. I am personally
> partial to NFS plays:) I went the exact opposite approach and settled on
> Isilon for our primary storage for our CS deployment.
> 
> 
> 
> 
> On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin <ni...@nfinausa.com>wrote:
> 
>> On 10/22/2012 10:16 AM, Trevor Francis wrote:
>> 
>>> We are looking at building a Primary Storage solution for an
>>> enterprise/carrier class application. However, we want to build it using
>>> a FOSS solution and not a commercial solution. Do you have a
>>> recommendation on platform?
>>> 
>>> 
>> Trevor,
>> 
>> I got EXCELLENT results builing a SAN from FOSS using:
>> OS: Centos
>> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives.  LSI MegaRAID
>> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD caching
>> (alternately, look at FlashCache from Facebook)
>> intel 10GB dual port nics, one port for crossover, on port for up link to
>> storage network
>> 
>> DRBD for real time block replication to active-active
>> Pacemaker+corosync for HA Resource management
>> tgtd for iSCSI target
>> 
>> If you want file backed storage, XFS is a very good filesystem on Linux
>> now.
>> 
>> Pacemaker+Corosync can be difficult to grok at the beginning, but that
>> setup gave me a VERY high performance SAN.  The downside is it is entirely
>> managed by CLI, no UI whatsoever.
>> 
>> 
>>> Trevor Francis
>>> Partner
>>> 46 Labs | The PeerEdge Cloud
>>> http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
>>> <http://www.peeredge.net/>
>>> 
>>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>>> trevorgfrancis - Skype
>>> trevor@46labs.com <ma...@46labs.com>
>>> 
>>> Solutions Provider for the Telecom Industry
>>> 
>>> <http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>><
>>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
>>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
>>> http://**www.facebook.com/PeerEdge <http://www.facebook.com/PeerEdge>>
>>> 
>>> 
>> 
>> --
>> Regards,
>> 
>> Nik
>> 
>> Nik Martin
>> VP Business Development
>> Nfina Technologies, Inc.
>> +1.251.243.0043 x1003
>> Relentless Reliability
>>

Re: Primary Storage

Posted by Jason Davis <sc...@gmail.com>.

ZFS would be an interesting setup as you can do the cache pools like you
would do in CacheCade. The problem with ZFS or CacheCade+DRBD is that they
really don't scale out well if you are looking for something with a unified
name space. I'll say however that ZFS is a battle hardened FS with tons of
shops using it. A lot of the whiz-bang SSD+SATA disk SAN things these
smaller start up companies are hocking are just ZFS appliances.

RBD looks interesting but I'm not sure if I would be willing to put
production data on it, I'm not sure how performant it is IRL. From a
purely technical perspective, it looks REALLY cool.

I suppose anything is fast if you put SSDs in it :) GlusterFS is another
option although historically small/random IO has not been it's strong point.

If you are ok spending money on software and want a scale out block storage
then you might want to consider HP LeftHand's VSA product. I am personally
partial to NFS plays:) I went the exact opposite approach and settled on
Isilon for our primary storage for our CS deployment.

On Mon, Oct 22, 2012 at 10:24 AM, Nik Martin <ni...@nfinausa.com>wrote:

> On 10/22/2012 10:16 AM, Trevor Francis wrote:
>
>> We are looking at building a Primary Storage solution for an
>> enterprise/carrier class application. However, we want to build it using
>> a FOSS solution and not a commercial solution. Do you have a
>> recommendation on platform?
>>
>>
> Trevor,
>
> I got EXCELLENT results builing a SAN from FOSS using:
> OS: Centos
> Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives.  LSI MegaRAID
> with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD caching
> (alternately, look at FlashCache from Facebook)
> intel 10GB dual port nics, one port for crossover, on port for up link to
> storage network
>
> DRBD for real time block replication to active-active
> Pacemaker+corosync for HA Resource management
> tgtd for iSCSI target
>
> If you want file backed storage, XFS is a very good filesystem on Linux
> now.
>
> Pacemaker+Corosync can be difficult to grok at the beginning, but that
> setup gave me a VERY high performance SAN.  The downside is it is entirely
> managed by CLI, no UI whatsoever.
>
>
>> Trevor Francis
>> Partner
>> 46 Labs | The PeerEdge Cloud
>> http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
>> <http://www.peeredge.net/>
>>
>> 405-362-0046 - Voice  | 405-410-4980 - Cell
>> trevorgfrancis - Skype
>> trevor@46labs.com <ma...@46labs.com>
>>
>> Solutions Provider for the Telecom Industry
>>
>> <http://www.twitter.com/**peeredge <http://www.twitter.com/peeredge>><
>> http://www.twitter.**com/peeredge <http://www.twitter.com/peeredge>><
>> http://www.**twitter.com/peeredge <http://www.twitter.com/peeredge>><
>> http://**www.facebook.com/PeerEdge <http://www.facebook.com/PeerEdge>>
>>
>>
>
> --
> Regards,
>
> Nik
>
> Nik Martin
> VP Business Development
> Nfina Technologies, Inc.
> +1.251.243.0043 x1003
> Relentless Reliability
>

Re: Primary Storage

Posted by Nik Martin <ni...@nfinausa.com>.

On 10/22/2012 10:16 AM, Trevor Francis wrote:
> We are looking at building a Primary Storage solution for an
> enterprise/carrier class application. However, we want to build it using
> a FOSS solution and not a commercial solution. Do you have a
> recommendation on platform?
>

Trevor,

I got EXCELLENT results builing a SAN from FOSS using:
OS: Centos
Hardware: 2X storage servers, with 12x2TB 3.5 SATA drives.  LSI MegaRAID 
with CacheCade Pro, with 240 GB Intel 520 SSDs configured to do SSD 
caching (alternately, look at FlashCache from Facebook)
intel 10GB dual port nics, one port for crossover, on port for up link 
to storage network

DRBD for real time block replication to active-active
Pacemaker+corosync for HA Resource management
tgtd for iSCSI target

If you want file backed storage, XFS is a very good filesystem on Linux now.

Pacemaker+Corosync can be difficult to grok at the beginning, but that 
setup gave me a VERY high performance SAN.  The downside is it is 
entirely managed by CLI, no UI whatsoever.

>
> Trevor Francis
> Partner
> 46 Labs | The PeerEdge Cloud
> http://www.46labs.com <http://www.46labs.com/> | http://www.peeredge.net
> <http://www.peeredge.net/>
> 405-362-0046 - Voice  | 405-410-4980 - Cell
> trevorgfrancis - Skype
> trevor@46labs.com <ma...@46labs.com>
> Solutions Provider for the Telecom Industry
>
> <http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.twitter.com/peeredge><http://www.facebook.com/PeerEdge>
>

-- 
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

Re: Primary Storage

Posted by Bryan Whitehead <dr...@megahappy.net>.

> We are looking at building a Primary Storage solution for an
> enterprise/carrier class application. However, we want to build it using a
> FOSS solution and not a commercial solution. Do you have a recommendation
> on platform?
>
>
I'm using GlusterFS as primary storage on my production cloudstack
installs. I'm using IPoIB to get some pretty fast speeds. I
can consistently get about 120MB/sec from within a VM using qcow2 files in
kvm over glusterfs fuse mounts. I'm hoping cloudstack will be able to
support the native gluster stuff being done in qemu in the future.