You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ioan Eugen Stan <st...@gmail.com> on 2011/11/17 03:39:13 UTC

clustering hardware requirements

Hello,

I have to figure out how much hardware is required to do clustering
for my company on about 10+ milion user accounts, each with 100-5000
documents. The documents will be indexed so vector creation will be
done at indexing.
Is there any formula to approximate the size of the vectors based on
the index size? I'm looking for rough estimates (how much disk extra
space should I consider?).

Which are the most time consuming tasks?  From my experience with
clustering, the index/vector creation part is the most time consuming,
while clustering being the second. Does anyone have some data on how
much time a clustering job takes?

Thanks,

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com/

Re: clustering hardware requirements

Posted by Grant Ingersoll <gs...@apache.org>.

Here's some numbers using http://aws.amazon.com/datasets/7791434387204566 running locally:

Raw content size:
9.2 GB, 48K "items" -- note, most of the files are GZipped

It took 15 minutes to convert all of these to sequence files on an i7 single CPU w/ 4 cores and hyper-threading. 3.4 GHz machine with 16 GB of RAM

After converting to sequence files:
40 GB, 659 items.

Encoded Vectors (see build-asf-email.sh): cardinality = 5000: 11 GBs for 1,300 items.  This took 83 minutes to convert

Splitting into test and train took 9 minutes for SGD.  I had to kill the SGD job due to some issues I'm having on my machine w/ CPU temperature (SGD really cranks on the CPU and something is messed up on my machine) that I need to track down.

For clustering,  about the same time for  converting to sequence files

The job to convert to vectors took a while (it scrolled out of my window).  The resulting tfidf-vecs were 7.8 gb.
Dictionary:
 82865442 2011-11-21 17:46 dictionary.file-0*
83269191 2011-11-21 17:46 dictionary.file-1*
10963133 2011-11-21 17:46 dictionary.file-2*

Freq files:

 37160153 2011-11-21 22:35 frequency.file-0*
 37160173 2011-11-21 22:35 frequency.file-1*
 37160173 2011-11-21 22:35 frequency.file-2*
 31407713 2011-11-21 22:35 frequency.file-3*

Total dir size for seq2sparse:  du -s seq2sparse/
30923564	seq2sparse/

More as they become available.

HTH,
Grant

On Nov 21, 2011, at 3:57 AM, Ioan Eugen Stan wrote:

>> I'll try in the next few days to track down the numbers from running the stuff in my recent IBM article: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/.  Or, you can go run them yourself!
> 
> I think posting some reference data for the jobs will be great. I will have something to compare to when I have something done. In the mean time I will try to do a quick and dirty implementation working and see how things move and post my findings. This could take a while as I depend on some modifications.
> 
>> Otherwise, I don't know that we have any formula just yet.  I suspect that once you reach a certain number of documents, your dictionary will stop growing, more or less.  Then, it is just a question of how many vectors you have and the sparseness.  This probably could be guessed at by looking at what the average number of words are in your email collection.  Naturally, attachments may skew this if you are including them.
> 
> I also suspect that things will be asymptotically after a certain number of documents, remains to see where that threshold is.
> 
>> That has been my experience, too.  Seq2Sparse is often the long part.  I suspect one could get it done a lot faster in Lucene.  SequenceFilesFromDirectory is also slow, but that is inherently sequential.
> 
> I will be able to use a map reduce job to create vectors or just create them as an indexing step so I hope this step will not count when considering the effective clustering time.
> 
>> I haven't explored yet what it would mean to use Encoded vectors in Clustering, but perhaps I can call Ted to the front of the class and see if he has thoughts on whether that even makes sense, as that would give you a fixed size Vector.
>> 
>> -Grant
> 
> I don't know about encoded vectors yet, I hope to get some more info on them from Mahout in Action. If they do what I think they do, I will definitely try them, and probably complain on the list (Ted) if I can't interpret them right :).
> 
> Thanks for the reply,
> 
> --
> Ioan Eugen Stan

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: clustering hardware requirements

Posted by Ioan Eugen Stan <st...@gmail.com>.

Pe 22.11.2011 04:32, Lance Norskog a scris:
> Ioan- when you understand them, please explain them here:
>
> https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats
>

Ok, I'll remember that.

Re: clustering hardware requirements

Posted by Lance Norskog <go...@gmail.com>.

Ioan- when you understand them, please explain them here:

https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats

On Mon, Nov 21, 2011 at 12:57 AM, Ioan Eugen Stan <st...@gmail.com>wrote:

> I'll try in the next few days to track down the numbers from running the
>> stuff in my recent IBM article: http://www.ibm.com/**
>> developerworks/java/library/j-**mahout-scaling/<http://www.ibm.com/developerworks/java/library/j-mahout-scaling/>.
>>  Or, you can go run them yourself!
>>
>
> I think posting some reference data for the jobs will be great. I will
> have something to compare to when I have something done. In the mean time I
> will try to do a quick and dirty implementation working and see how things
> move and post my findings. This could take a while as I depend on some
> modifications.
>
>
>  Otherwise, I don't know that we have any formula just yet.  I suspect
>> that once you reach a certain number of documents, your dictionary will
>> stop growing, more or less.  Then, it is just a question of how many
>> vectors you have and the sparseness.  This probably could be guessed at by
>> looking at what the average number of words are in your email collection.
>>  Naturally, attachments may skew this if you are including them.
>>
>
> I also suspect that things will be asymptotically after a certain number
> of documents, remains to see where that threshold is.
>
>
>  That has been my experience, too.  Seq2Sparse is often the long part.  I
>> suspect one could get it done a lot faster in Lucene.
>>  SequenceFilesFromDirectory is also slow, but that is inherently sequential.
>>
>
> I will be able to use a map reduce job to create vectors or just create
> them as an indexing step so I hope this step will not count when
> considering the effective clustering time.
>
>
>  I haven't explored yet what it would mean to use Encoded vectors in
>> Clustering, but perhaps I can call Ted to the front of the class and see if
>> he has thoughts on whether that even makes sense, as that would give you a
>> fixed size Vector.
>>
>> -Grant
>>
>
> I don't know about encoded vectors yet, I hope to get some more info on
> them from Mahout in Action. If they do what I think they do, I will
> definitely try them, and probably complain on the list (Ted) if I can't
> interpret them right :).
>
> Thanks for the reply,
>
> --
> Ioan Eugen Stan
>



-- 
Lance Norskog
goksron@gmail.com

Re: clustering hardware requirements

Posted by Ioan Eugen Stan <st...@gmail.com>.

> I'll try in the next few days to track down the numbers from running the stuff in my recent IBM article: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/.  Or, you can go run them yourself!

I think posting some reference data for the jobs will be great. I will 
have something to compare to when I have something done. In the mean 
time I will try to do a quick and dirty implementation working and see 
how things move and post my findings. This could take a while as I 
depend on some modifications.

> Otherwise, I don't know that we have any formula just yet.  I suspect that once you reach a certain number of documents, your dictionary will stop growing, more or less.  Then, it is just a question of how many vectors you have and the sparseness.  This probably could be guessed at by looking at what the average number of words are in your email collection.  Naturally, attachments may skew this if you are including them.

I also suspect that things will be asymptotically after a certain number 
of documents, remains to see where that threshold is.

> That has been my experience, too.  Seq2Sparse is often the long part.  I suspect one could get it done a lot faster in Lucene.  SequenceFilesFromDirectory is also slow, but that is inherently sequential.

I will be able to use a map reduce job to create vectors or just create 
them as an indexing step so I hope this step will not count when 
considering the effective clustering time.

> I haven't explored yet what it would mean to use Encoded vectors in Clustering, but perhaps I can call Ted to the front of the class and see if he has thoughts on whether that even makes sense, as that would give you a fixed size Vector.
>
> -Grant

I don't know about encoded vectors yet, I hope to get some more info on 
them from Mahout in Action. If they do what I think they do, I will 
definitely try them, and probably complain on the list (Ted) if I can't 
interpret them right :).

Thanks for the reply,

--
Ioan Eugen Stan

Re: clustering hardware requirements

Posted by Ted Dunning <te...@gmail.com>.

It is a great idea except that the centroids become harder to interpret.
 Not much harder.  Just a bit harder.

On Fri, Nov 18, 2011 at 9:44 AM, Grant Ingersoll <gs...@apache.org>wrote:

> I haven't explored yet what it would mean to use Encoded vectors in
> Clustering, but perhaps I can call Ted to the front of the class and see if
> he has thoughts on whether that even makes sense, as that would give you a
> fixed size Vector.
>

Re: clustering hardware requirements

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 16, 2011, at 9:39 PM, Ioan Eugen Stan wrote:

> Hello,
> 
> I have to figure out how much hardware is required to do clustering
> for my company on about 10+ milion user accounts, each with 100-5000
> documents. The documents will be indexed so vector creation will be
> done at indexing.
> Is there any formula to approximate the size of the vectors based on
> the index size? I'm looking for rough estimates (how much disk extra
> space should I consider?).

I'll try in the next few days to track down the numbers from running the stuff in my recent IBM article: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/.  Or, you can go run them yourself!  

Otherwise, I don't know that we have any formula just yet.  I suspect that once you reach a certain number of documents, your dictionary will stop growing, more or less.  Then, it is just a question of how many vectors you have and the sparseness.  This probably could be guessed at by looking at what the average number of words are in your email collection.  Naturally, attachments may skew this if you are including them.

> 
> Which are the most time consuming tasks?  From my experience with
> clustering, the index/vector creation part is the most time consuming,
> while clustering being the second. Does anyone have some data on how
> much time a clustering job takes?

That has been my experience, too.  Seq2Sparse is often the long part.  I suspect one could get it done a lot faster in Lucene.  SequenceFilesFromDirectory is also slow, but that is inherently sequential.

I haven't explored yet what it would mean to use Encoded vectors in Clustering, but perhaps I can call Ted to the front of the class and see if he has thoughts on whether that even makes sense, as that would give you a fixed size Vector.

-Grant