You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by David Starina <da...@gmail.com> on 2016/03/10 11:43:40 UTC

LDA - help me understand

Hi,

I realize MapReduce algorithms are not the "hot new stuff" anymore, but I
am playing around with LDA. I have some problems with the memory, can you
help me suggest how to set up parameters to make this work?

I am running on a virtual cluster on my laptop - two nodes with 3 GB of
memory each - just to prepare before I try this on a physical cluster with
much larger data set. I am using a data set of 500 documents, averaging
around 120 kB each, with roughly 60.000 terms. Running this with 20 topics
runs ok - but when running on 100 topics, I ran out of memory (on the
mappers). Can you suggest me how to set parameters, so it's going to run
more mappers that will consume less memory?

The error I get: Task Id : attempt_1457214584155_0074_m_000000_1, Status :
FAILED
*Container* [pid=26283,containerID=container_1457214584155_0074_01_000003] *is
running beyond physical memory limits. Current usage: 1.0 GB of 1 GB
physical memory used*; 1.7 GB of 2.1 GB virtual memory used. Killing
container.

This are the parameters I set for CVB0Driver:

static int numTopics = 100;
static double doc_topic_smoothening = 0.5;
static double term_topic_smoothening = 0.5;

static int maxIter = 3;
static int iteration_block_size = 10;
static double convergenceDelta = 0;
static float testFraction = 0.0f;
static int numTrainThreads = 4;
static int numUpdateThreads = 1;
static int maxItersPerDoc = 3;
static int numReduceTasks = 10;
static boolean backfillPerplexity = false;

Any suggestion? Should I enlarge the container size on Hadoop, or can
I fix this with LDA parameters?

Cheers,
David

Re: LDA - help me understand

Posted by David Starina <da...@gmail.com>.

About the last question: it probably has something to do with setting the
max iterations and max iterations per document to the same value ... What
is the "number of iterations per document" really doing?

--David

On Thu, Mar 10, 2016 at 5:39 PM, David Starina <da...@gmail.com>
wrote:

> There is one more weird thing I can not understand ...
>
> When running only one iteration of LDA, the iteration took 88 seconds.
> When running 20 iterations with exactly the same code, on the same
> documents, same parameters ... it took 8683 seconds - which is 434 seconds
> per iteration. Is there something I don't understand about this algorithm?
> Why would one iteration take that much longer just because you run more of
> iterations?
>
> --David
>
> On Thu, Mar 10, 2016 at 2:24 PM, David Starina <da...@gmail.com>
> wrote:
>
>> How does memory requirement grow with the number of topics? A little
>> experimentation shows me that number of documents doesn't matter as much as
>> the number of topics ... Does the memory requirement grow exponentially
>> with the number of topics?
>>
>> --David
>>
>> On Thu, Mar 10, 2016 at 11:43 AM, David Starina <da...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I realize MapReduce algorithms are not the "hot new stuff" anymore, but
>>> I am playing around with LDA. I have some problems with the memory, can you
>>> help me suggest how to set up parameters to make this work?
>>>
>>> I am running on a virtual cluster on my laptop - two nodes with 3 GB of
>>> memory each - just to prepare before I try this on a physical cluster with
>>> much larger data set. I am using a data set of 500 documents, averaging
>>> around 120 kB each, with roughly 60.000 terms. Running this with 20 topics
>>> runs ok - but when running on 100 topics, I ran out of memory (on the
>>> mappers). Can you suggest me how to set parameters, so it's going to run
>>> more mappers that will consume less memory?
>>>
>>> The error I get: Task Id : attempt_1457214584155_0074_m_000000_1, Status
>>> : FAILED
>>> *Container*
>>> [pid=26283,containerID=container_1457214584155_0074_01_000003] *is
>>> running beyond physical memory limits. Current usage: 1.0 GB of 1 GB
>>> physical memory used*; 1.7 GB of 2.1 GB virtual memory used. Killing
>>> container.
>>>
>>> This are the parameters I set for CVB0Driver:
>>>
>>> static int numTopics = 100;
>>> static double doc_topic_smoothening = 0.5;
>>> static double term_topic_smoothening = 0.5;
>>>
>>> static int maxIter = 3;
>>> static int iteration_block_size = 10;
>>> static double convergenceDelta = 0;
>>> static float testFraction = 0.0f;
>>> static int numTrainThreads = 4;
>>> static int numUpdateThreads = 1;
>>> static int maxItersPerDoc = 3;
>>> static int numReduceTasks = 10;
>>> static boolean backfillPerplexity = false;
>>>
>>> Any suggestion? Should I enlarge the container size on Hadoop, or can I fix this with LDA parameters?
>>>
>>> Cheers,
>>> David
>>>
>>>
>>
>

Re: LDA - help me understand

Posted by David Starina <da...@gmail.com>.

There is one more weird thing I can not understand ...

When running only one iteration of LDA, the iteration took 88 seconds. When
running 20 iterations with exactly the same code, on the same documents,
same parameters ... it took 8683 seconds - which is 434 seconds per
iteration. Is there something I don't understand about this algorithm? Why
would one iteration take that much longer just because you run more of
iterations?

--David

On Thu, Mar 10, 2016 at 2:24 PM, David Starina <da...@gmail.com>
wrote:

> How does memory requirement grow with the number of topics? A little
> experimentation shows me that number of documents doesn't matter as much as
> the number of topics ... Does the memory requirement grow exponentially
> with the number of topics?
>
> --David
>
> On Thu, Mar 10, 2016 at 11:43 AM, David Starina <da...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I realize MapReduce algorithms are not the "hot new stuff" anymore, but I
>> am playing around with LDA. I have some problems with the memory, can you
>> help me suggest how to set up parameters to make this work?
>>
>> I am running on a virtual cluster on my laptop - two nodes with 3 GB of
>> memory each - just to prepare before I try this on a physical cluster with
>> much larger data set. I am using a data set of 500 documents, averaging
>> around 120 kB each, with roughly 60.000 terms. Running this with 20 topics
>> runs ok - but when running on 100 topics, I ran out of memory (on the
>> mappers). Can you suggest me how to set parameters, so it's going to run
>> more mappers that will consume less memory?
>>
>> The error I get: Task Id : attempt_1457214584155_0074_m_000000_1, Status
>> : FAILED
>> *Container*
>> [pid=26283,containerID=container_1457214584155_0074_01_000003] *is
>> running beyond physical memory limits. Current usage: 1.0 GB of 1 GB
>> physical memory used*; 1.7 GB of 2.1 GB virtual memory used. Killing
>> container.
>>
>> This are the parameters I set for CVB0Driver:
>>
>> static int numTopics = 100;
>> static double doc_topic_smoothening = 0.5;
>> static double term_topic_smoothening = 0.5;
>>
>> static int maxIter = 3;
>> static int iteration_block_size = 10;
>> static double convergenceDelta = 0;
>> static float testFraction = 0.0f;
>> static int numTrainThreads = 4;
>> static int numUpdateThreads = 1;
>> static int maxItersPerDoc = 3;
>> static int numReduceTasks = 10;
>> static boolean backfillPerplexity = false;
>>
>> Any suggestion? Should I enlarge the container size on Hadoop, or can I fix this with LDA parameters?
>>
>> Cheers,
>> David
>>
>>
>

Re: LDA - help me understand

Posted by David Starina <da...@gmail.com>.

How does memory requirement grow with the number of topics? A little
experimentation shows me that number of documents doesn't matter as much as
the number of topics ... Does the memory requirement grow exponentially
with the number of topics?

--David

On Thu, Mar 10, 2016 at 11:43 AM, David Starina <da...@gmail.com>
wrote:

> Hi,
>
> I realize MapReduce algorithms are not the "hot new stuff" anymore, but I
> am playing around with LDA. I have some problems with the memory, can you
> help me suggest how to set up parameters to make this work?
>
> I am running on a virtual cluster on my laptop - two nodes with 3 GB of
> memory each - just to prepare before I try this on a physical cluster with
> much larger data set. I am using a data set of 500 documents, averaging
> around 120 kB each, with roughly 60.000 terms. Running this with 20 topics
> runs ok - but when running on 100 topics, I ran out of memory (on the
> mappers). Can you suggest me how to set parameters, so it's going to run
> more mappers that will consume less memory?
>
> The error I get: Task Id : attempt_1457214584155_0074_m_000000_1, Status :
> FAILED
> *Container*
> [pid=26283,containerID=container_1457214584155_0074_01_000003] *is
> running beyond physical memory limits. Current usage: 1.0 GB of 1 GB
> physical memory used*; 1.7 GB of 2.1 GB virtual memory used. Killing
> container.
>
> This are the parameters I set for CVB0Driver:
>
> static int numTopics = 100;
> static double doc_topic_smoothening = 0.5;
> static double term_topic_smoothening = 0.5;
>
> static int maxIter = 3;
> static int iteration_block_size = 10;
> static double convergenceDelta = 0;
> static float testFraction = 0.0f;
> static int numTrainThreads = 4;
> static int numUpdateThreads = 1;
> static int maxItersPerDoc = 3;
> static int numReduceTasks = 10;
> static boolean backfillPerplexity = false;
>
> Any suggestion? Should I enlarge the container size on Hadoop, or can I fix this with LDA parameters?
>
> Cheers,
> David
>
>