You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Steven Bourke <st...@ucd.ie> on 2010/05/01 20:23:30 UTC

How do you store data...

I'm working with large datasets and have limited hardware resources (Like everyone else!)

I was wondering what would people recommend for storing my data in when using mahout. I've roughly 100gb of data right now, that will grow and shrink over time. If I distribute the storage the maximum number of nodes I would have access to is three. 

I guess this is really a 'how long is a piece of string' question, but would still appreciate peoples experiences! 

My requirements would be speed! 

Steve


Re: How do you store data...

Posted by Ted Dunning <te...@gmail.com>.
At deepdyve, we replicate everything 3x but downgrade some intermediate data
to 2x.  For very small tests 1x is fine, but I really appreciate the speed
boost of 3x replication.

On Sat, May 1, 2010 at 3:45 PM, Sean Owen <sr...@gmail.com> wrote:

> I can tell you I replicate 1x for testing and debugging, and replicate
> 3x in production as a rule. This was the norm at Google FWIW; some key
> data was distributed more but 3x was the default.
>

Re: How do you store data...

Posted by Sean Owen <sr...@gmail.com>.
If you're using Hadoop-based jobs in Mahout it certainly makes sense
to have your data on your HDFS cluster that serves the Hadoop cluster;
it has to be available on such a cluster.

So are you asking about how much to distribute the data? Replication
obviously costs more storage, but buys not only redundancy but also
perhaps performance: if the data copies are closer to the workers,
it's faster. It sounds like you have a small / local cluster, so this
may not be a factor.

I can tell you I replicate 1x for testing and debugging, and replicate
3x in production as a rule. This was the norm at Google FWIW; some key
data was distributed more but 3x was the default.

On Sat, May 1, 2010 at 7:23 PM, Steven Bourke <st...@ucd.ie> wrote:
> I'm working with large datasets and have limited hardware resources (Like everyone else!)
>
> I was wondering what would people recommend for storing my data in when using mahout. I've roughly 100gb of data right now, that will grow and shrink over time. If I distribute the storage the maximum number of nodes I would have access to is three.
>
> I guess this is really a 'how long is a piece of string' question, but would still appreciate peoples experiences!
>
> My requirements would be speed!
>
> Steve
>
>