You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Ioan Eugen Stan <st...@gmail.com> on 2011/11/17 03:39:13 UTC
clustering hardware requirements
Hello,
I have to figure out how much hardware is required to do clustering
for my company on about 10+ milion user accounts, each with 100-5000
documents. The documents will be indexed so vector creation will be
done at indexing.
Is there any formula to approximate the size of the vectors based on
the index size? I'm looking for rough estimates (how much disk extra
space should I consider?).
Which are the most time consuming tasks? From my experience with
clustering, the index/vector creation part is the most time consuming,
while clustering being the second. Does anyone have some data on how
much time a clustering job takes?
Thanks,
--
Ioan Eugen Stan
http://ieugen.blogspot.com/
Re: clustering hardware requirements
Posted by Grant Ingersoll <gs...@apache.org>.
Here's some numbers using http://aws.amazon.com/datasets/7791434387204566 running locally:
Raw content size:
9.2 GB, 48K "items" -- note, most of the files are GZipped
It took 15 minutes to convert all of these to sequence files on an i7 single CPU w/ 4 cores and hyper-threading. 3.4 GHz machine with 16 GB of RAM
After converting to sequence files:
40 GB, 659 items.
Encoded Vectors (see build-asf-email.sh): cardinality = 5000: 11 GBs for 1,300 items. This took 83 minutes to convert
Splitting into test and train took 9 minutes for SGD. I had to kill the SGD job due to some issues I'm having on my machine w/ CPU temperature (SGD really cranks on the CPU and something is messed up on my machine) that I need to track down.
For clustering, about the same time for converting to sequence files
The job to convert to vectors took a while (it scrolled out of my window). The resulting tfidf-vecs were 7.8 gb.
Dictionary:
82865442 2011-11-21 17:46 dictionary.file-0*
83269191 2011-11-21 17:46 dictionary.file-1*
10963133 2011-11-21 17:46 dictionary.file-2*
Freq files:
37160153 2011-11-21 22:35 frequency.file-0*
37160173 2011-11-21 22:35 frequency.file-1*
37160173 2011-11-21 22:35 frequency.file-2*
31407713 2011-11-21 22:35 frequency.file-3*
Total dir size for seq2sparse: du -s seq2sparse/
30923564 seq2sparse/
More as they become available.
HTH,
Grant
On Nov 21, 2011, at 3:57 AM, Ioan Eugen Stan wrote:
>> I'll try in the next few days to track down the numbers from running the stuff in my recent IBM article: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/. Or, you can go run them yourself!
>
> I think posting some reference data for the jobs will be great. I will have something to compare to when I have something done. In the mean time I will try to do a quick and dirty implementation working and see how things move and post my findings. This could take a while as I depend on some modifications.
>
>> Otherwise, I don't know that we have any formula just yet. I suspect that once you reach a certain number of documents, your dictionary will stop growing, more or less. Then, it is just a question of how many vectors you have and the sparseness. This probably could be guessed at by looking at what the average number of words are in your email collection. Naturally, attachments may skew this if you are including them.
>
> I also suspect that things will be asymptotically after a certain number of documents, remains to see where that threshold is.
>
>> That has been my experience, too. Seq2Sparse is often the long part. I suspect one could get it done a lot faster in Lucene. SequenceFilesFromDirectory is also slow, but that is inherently sequential.
>
> I will be able to use a map reduce job to create vectors or just create them as an indexing step so I hope this step will not count when considering the effective clustering time.
>
>> I haven't explored yet what it would mean to use Encoded vectors in Clustering, but perhaps I can call Ted to the front of the class and see if he has thoughts on whether that even makes sense, as that would give you a fixed size Vector.
>>
>> -Grant
>
> I don't know about encoded vectors yet, I hope to get some more info on them from Mahout in Action. If they do what I think they do, I will definitely try them, and probably complain on the list (Ted) if I can't interpret them right :).
>
> Thanks for the reply,
>
> --
> Ioan Eugen Stan
--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Re: clustering hardware requirements
Posted by Ioan Eugen Stan <st...@gmail.com>.
Pe 22.11.2011 04:32, Lance Norskog a scris:
> Ioan- when you understand them, please explain them here:
>
> https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats
>
Ok, I'll remember that.
Re: clustering hardware requirements
Posted by Lance Norskog <go...@gmail.com>.
Ioan- when you understand them, please explain them here:
https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats
On Mon, Nov 21, 2011 at 12:57 AM, Ioan Eugen Stan <st...@gmail.com>wrote:
> I'll try in the next few days to track down the numbers from running the
>> stuff in my recent IBM article: http://www.ibm.com/**
>> developerworks/java/library/j-**mahout-scaling/<http://www.ibm.com/developerworks/java/library/j-mahout-scaling/>.
>> Or, you can go run them yourself!
>>
>
> I think posting some reference data for the jobs will be great. I will
> have something to compare to when I have something done. In the mean time I
> will try to do a quick and dirty implementation working and see how things
> move and post my findings. This could take a while as I depend on some
> modifications.
>
>
> Otherwise, I don't know that we have any formula just yet. I suspect
>> that once you reach a certain number of documents, your dictionary will
>> stop growing, more or less. Then, it is just a question of how many
>> vectors you have and the sparseness. This probably could be guessed at by
>> looking at what the average number of words are in your email collection.
>> Naturally, attachments may skew this if you are including them.
>>
>
> I also suspect that things will be asymptotically after a certain number
> of documents, remains to see where that threshold is.
>
>
> That has been my experience, too. Seq2Sparse is often the long part. I
>> suspect one could get it done a lot faster in Lucene.
>> SequenceFilesFromDirectory is also slow, but that is inherently sequential.
>>
>
> I will be able to use a map reduce job to create vectors or just create
> them as an indexing step so I hope this step will not count when
> considering the effective clustering time.
>
>
> I haven't explored yet what it would mean to use Encoded vectors in
>> Clustering, but perhaps I can call Ted to the front of the class and see if
>> he has thoughts on whether that even makes sense, as that would give you a
>> fixed size Vector.
>>
>> -Grant
>>
>
> I don't know about encoded vectors yet, I hope to get some more info on
> them from Mahout in Action. If they do what I think they do, I will
> definitely try them, and probably complain on the list (Ted) if I can't
> interpret them right :).
>
> Thanks for the reply,
>
> --
> Ioan Eugen Stan
>
--
Lance Norskog
goksron@gmail.com
Re: clustering hardware requirements
Posted by Ioan Eugen Stan <st...@gmail.com>.
> I'll try in the next few days to track down the numbers from running the stuff in my recent IBM article: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/. Or, you can go run them yourself!
I think posting some reference data for the jobs will be great. I will
have something to compare to when I have something done. In the mean
time I will try to do a quick and dirty implementation working and see
how things move and post my findings. This could take a while as I
depend on some modifications.
> Otherwise, I don't know that we have any formula just yet. I suspect that once you reach a certain number of documents, your dictionary will stop growing, more or less. Then, it is just a question of how many vectors you have and the sparseness. This probably could be guessed at by looking at what the average number of words are in your email collection. Naturally, attachments may skew this if you are including them.
I also suspect that things will be asymptotically after a certain number
of documents, remains to see where that threshold is.
> That has been my experience, too. Seq2Sparse is often the long part. I suspect one could get it done a lot faster in Lucene. SequenceFilesFromDirectory is also slow, but that is inherently sequential.
I will be able to use a map reduce job to create vectors or just create
them as an indexing step so I hope this step will not count when
considering the effective clustering time.
> I haven't explored yet what it would mean to use Encoded vectors in Clustering, but perhaps I can call Ted to the front of the class and see if he has thoughts on whether that even makes sense, as that would give you a fixed size Vector.
>
> -Grant
I don't know about encoded vectors yet, I hope to get some more info on
them from Mahout in Action. If they do what I think they do, I will
definitely try them, and probably complain on the list (Ted) if I can't
interpret them right :).
Thanks for the reply,
--
Ioan Eugen Stan
Re: clustering hardware requirements
Posted by Ted Dunning <te...@gmail.com>.
It is a great idea except that the centroids become harder to interpret.
Not much harder. Just a bit harder.
On Fri, Nov 18, 2011 at 9:44 AM, Grant Ingersoll <gs...@apache.org>wrote:
> I haven't explored yet what it would mean to use Encoded vectors in
> Clustering, but perhaps I can call Ted to the front of the class and see if
> he has thoughts on whether that even makes sense, as that would give you a
> fixed size Vector.
>
Re: clustering hardware requirements
Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 16, 2011, at 9:39 PM, Ioan Eugen Stan wrote:
> Hello,
>
> I have to figure out how much hardware is required to do clustering
> for my company on about 10+ milion user accounts, each with 100-5000
> documents. The documents will be indexed so vector creation will be
> done at indexing.
> Is there any formula to approximate the size of the vectors based on
> the index size? I'm looking for rough estimates (how much disk extra
> space should I consider?).
I'll try in the next few days to track down the numbers from running the stuff in my recent IBM article: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/. Or, you can go run them yourself!
Otherwise, I don't know that we have any formula just yet. I suspect that once you reach a certain number of documents, your dictionary will stop growing, more or less. Then, it is just a question of how many vectors you have and the sparseness. This probably could be guessed at by looking at what the average number of words are in your email collection. Naturally, attachments may skew this if you are including them.
>
> Which are the most time consuming tasks? From my experience with
> clustering, the index/vector creation part is the most time consuming,
> while clustering being the second. Does anyone have some data on how
> much time a clustering job takes?
That has been my experience, too. Seq2Sparse is often the long part. I suspect one could get it done a lot faster in Lucene. SequenceFilesFromDirectory is also slow, but that is inherently sequential.
I haven't explored yet what it would mean to use Encoded vectors in Clustering, but perhaps I can call Ted to the front of the class and see if he has thoughts on whether that even makes sense, as that would give you a fixed size Vector.
-Grant