You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Earl Cahill <ca...@yahoo.com> on 2009/04/26 08:15:23 UTC

pig on ssd

A few months ago, I was benchmarking some pig stuff and it seems like the entire process took five or ten minutes and the cpu time was fewer than ten seconds.  It hit me that wow, disk io is really the bottleneck here.  About that time I was learning about fusionio cards, and though expensive, I pondered buying one, but have just been waiting.  Recently I have pondered buying a cheaper ssd card to play around with and was just wondering if anyone out there has benchmarked pig on ssd.  Looks like there are some brand new cards just out (OCZ Vertex EX) that have < 1 ms seek time, and read / write times of 260 MBS / 210 MBS.  Thinking that with a bit of ram, a nice eight core machine that has an ssd drive or two raided together could get some pretty awesome throughput. 

Thanks,
Earl

 http://blog.spack.net
http://holaservers.com



      

Re: pig on ssd

Posted by jason hadoop <ja...@gmail.com>.
At present it is hard to beat the IO ops per second of the fusion io cards,
it is not so much the latency of the device as the latency of issuing
commands to the device that is the utlimate bottleneck.

I have heard - but do not know - that the conventional sata chain peaks
about 300 transactions per second, per device. I see about 200 as the max on
my non SSD solaris machines.

I have an order fighting its way through the signers of such things for a
fusion io card, and I am really looking forward to seeing what it can do in
our app.

My one experience with a fusionio card was unpacking gzipped file system
images so I could edit them and then repack them, and the fusion io card
sustained about 400MB/sec for that. It was nice, I could do one unpack edit
and repack in about 5 minutes instead of a couple of hours.

On Sat, Apr 25, 2009 at 11:15 PM, Earl Cahill <ca...@yahoo.com> wrote:

> A few months ago, I was benchmarking some pig stuff and it seems like the
> entire process took five or ten minutes and the cpu time was fewer than ten
> seconds.  It hit me that wow, disk io is really the bottleneck here.  About
> that time I was learning about fusionio cards, and though expensive, I
> pondered buying one, but have just been waiting.  Recently I have pondered
> buying a cheaper ssd card to play around with and was just wondering if
> anyone out there has benchmarked pig on ssd.  Looks like there are some
> brand new cards just out (OCZ Vertex EX) that have < 1 ms seek time, and
> read / write times of 260 MBS / 210 MBS.  Thinking that with a bit of ram, a
> nice eight core machine that has an ssd drive or two raided together could
> get some pretty awesome throughput.
>
> Thanks,
> Earl
>
>  http://blog.spack.net
> http://holaservers.com
>
>
>
>




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: pig on ssd

Posted by Laurent Laborde <ke...@gmail.com>.
On Sun, Apr 26, 2009 at 8:15 AM, Earl Cahill <ca...@yahoo.com> wrote:
> A few months ago, I was benchmarking some pig stuff and it seems like the entire process took five or ten minutes and the cpu time was fewer than ten seconds.  It hit me that wow, disk io is really the bottleneck here.  About that time I was learning about fusionio cards, and though expensive, I pondered buying one, but have just been waiting.  Recently I have pondered buying a cheaper ssd card to play around with and was just wondering if anyone out there has benchmarked pig on ssd.  Looks like there are some brand new cards just out (OCZ Vertex EX) that have < 1 ms seek time, and read / write times of 260 MBS / 210 MBS.  Thinking that with a bit of ram, a nice eight core machine that has an ssd drive or two raided together could get some pretty awesome throughput.

It is certainly a really super shiny toy :)
Pig/hadoop/hdfs/mapreduce/... is built to be scalable. when you need
more IO or CPU, add more server.
And considering the price of a FusionIO card, you could add a lot of servers :)

Be carefull with SSD. Performance can drop very badly when performing
concurent read and write operation.

At works, we're using Postgresql.
As any RDBMS, it's poorly scalable compared to MapReduce stuff.
We're IO bound and choose to add more server instead of buying more
expensive hardware.
And we're still using traditional array of SAS HDD.

If you're IO bound (and you are) :
- add more disk
- add more servers

Wait for better, bigger, cheaper SSD.

-- 
F4FQM
Kerunix Flan
Laurent Laborde