You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Bae, Jae Hyeon" <me...@gmail.com> on 2011/03/10 07:28:09 UTC

Question about LDA parameter estimation

Hi

I am studying LDA algorithm for my statistics project. The goal is fully
understanding LDA algorithms and statistical concepts behind that and
analyze implementation. I've chosen Mahout LDA implementation because it's
scalable and well-documented.

According to the original paper written by Blei, Ng, Jordan,
parameters(alpha, beta) would be estimated with variational EM method. But I
can't find any numerical methods to optimize those parameters. In Mahout
implementation, alpha is topic smoothing input by user, beta is just
P(word|topic), not estimated.

I think that this implementation has a basic assumption. I want to know
whether there was specific reason to implement like this without parameter
estimation.

Thank you

Best, Jay

Re: Question about LDA parameter estimation

Posted by Vasil Vasilev <va...@gmail.com>.

Hi Ted,

I spent some time playing around with LDA.
First I was thinking that approach similar to the Dirichlet one could be
applied, but I couldn't find a way how to do it.
Then I decided to implement the alpha estimation technique, described in the
paper of Blei, Ng and Jordan. Here are the results over the Reuters data set
(20 iterations)

1. Execution of the old LDA algorithm (currently implemented in Mahout)

Initial LL: -9562019.417431084
LL on 19-th iteration: -7308477.774407879
LL on 20-th iteration: -7304552.275676554
Rel Change: 5.3711577875640971379064282317523e-4
Execution time: 1h 3m

2. Execution of the new LDA algorithm (with alpha estimation) with fixed
initial alpha:

Initial LL: -9563341.06579829
LL on 19-th iteration: -7259696.726658782
LL on 20-th iteration: -7206975.427212661
Rel Change: 0.0072621903408884631309388965350663
Execution time: 57m

3. Execution of the new LDA algorithm (with alpha estimation) with
randomized initial alpha:

Initial LL: -9591528.727727547
LL on 19-th iteration: -6982928.298954172
LL on 20-th iteration: -6975124.693072233
Rel Change: 0.0011175262794990663549897755026221
Execution time: 1h 13m

The difference between 2 and 3 is that in 3 alphas for the topics are taken
as a random number between 0 and the topic smoothing parameter.
What makes an impression is that 2 and 3 reach lower levels of LL then 1 and
in addition due to the bigger relative change at the end (compared with 1)
has potential for reaching lower LL values.

Selecting randomized alpha shows better results. What could be the reason
for that? One thought that comes into my mind is that randomized alphas
account for the fact that not all topics generate the same number of words
in the corpus

Also the following was observed about the alphas:
- if alpha is selected to be small (50/numOfTopics) then during execution it
decreases
- if alpha is selected high (100.0 for example) then during execution it
increases. The algorithm converges faster but on much higher LL. The
resulting topics are not very meaningful
- for randomly selected alpha, even though in the initial set the alphas may
differ a lot, after the 20-th iteration they are very close (range between
0.04 and 1.1). This last observation means that the approach of introducing
alpha estimation does not lead to non-parametric algorithm

Regards, Vasil

On Fri, Mar 25, 2011 at 6:58 PM, Ted Dunning <te...@gmail.com> wrote:

> Yes.  It would be possible to build a non-parametric implementation.
>  Probably simpler is to over-provision the number of
> topics and apply some forceful regularization.  This is much like what we
> do with our Dirichlet Process clustering.
>
>
> On Fri, Mar 25, 2011 at 8:14 AM, Vasil Vasilev <va...@gmail.com>wrote:
>
>> Hi David,
>>
>> Doesn't the fact that the alpha parameters are not learned make the
>> algorithm very dependent on the initial number of topics that is provided
>> by
>> the user (-k <numTopics>), i.e. it could learn how the words are
>> distributed
>> by topics, but it cannot learn the correct number of topics. May be
>> approach
>> similar to the one implemented in the Dirichlet algorithm could be used,
>> which has initial prior alpha and then the number of "meaningful" topics
>> is
>> refined depending on how many words each topic has collected (i.e. the
>> less
>> words a topic has attracted the less probable this topic becomes as
>> whole).
>>
>> Regards, Vasil
>>
>> On Thu, Mar 10, 2011 at 8:40 PM, David Hall <dl...@cs.berkeley.edu> wrote:
>>
>> > err, Jae, sorry.
>> >
>> > -- David
>> >
>> > On Thu, Mar 10, 2011 at 10:33 AM, David Hall <dl...@cs.berkeley.edu>
>> wrote:
>> > > Hi Bae,
>> > >
>> > > We only try to obtain MLE's of p(word|topic) (beta), and we treat
>> > > alpha and eta as fixed. As you say, those could be learned, and it
>> > > might improve performance, but it's just not implemented.
>> > >
>> > > There's no particular reason they're not implemented, but they're not
>> > > critical to getting basic LDA working, especially MAP estimation of
>> > > \beta.
>> > >
>> > > -- David
>> > >
>> > > On Wed, Mar 9, 2011 at 10:28 PM, Bae, Jae Hyeon <me...@gmail.com>
>> > wrote:
>> > >> Hi
>> > >>
>> > >> I am studying LDA algorithm for my statistics project. The goal is
>> fully
>> > >> understanding LDA algorithms and statistical concepts behind that and
>> > >> analyze implementation. I've chosen Mahout LDA implementation because
>> > it's
>> > >> scalable and well-documented.
>> > >>
>> > >> According to the original paper written by Blei, Ng, Jordan,
>> > >> parameters(alpha, beta) would be estimated with variational EM
>> method.
>> > But I
>> > >> can't find any numerical methods to optimize those parameters. In
>> Mahout
>> > >> implementation, alpha is topic smoothing input by user, beta is just
>> > >> P(word|topic), not estimated.
>> > >>
>> > >> I think that this implementation has a basic assumption. I want to
>> know
>> > >> whether there was specific reason to implement like this without
>> > parameter
>> > >> estimation.
>> > >>
>> > >> Thank you
>> > >>
>> > >> Best, Jay
>> > >>
>> > >
>> >
>>
>
>

Re: Question about LDA parameter estimation

Posted by Ted Dunning <te...@gmail.com>.

Yes.  It would be possible to build a non-parametric implementation.
 Probably simpler is to over-provision the number of
topics and apply some forceful regularization.  This is much like what we do
with our Dirichlet Process clustering.

On Fri, Mar 25, 2011 at 8:14 AM, Vasil Vasilev <va...@gmail.com> wrote:

> Hi David,
>
> Doesn't the fact that the alpha parameters are not learned make the
> algorithm very dependent on the initial number of topics that is provided
> by
> the user (-k <numTopics>), i.e. it could learn how the words are
> distributed
> by topics, but it cannot learn the correct number of topics. May be
> approach
> similar to the one implemented in the Dirichlet algorithm could be used,
> which has initial prior alpha and then the number of "meaningful" topics is
> refined depending on how many words each topic has collected (i.e. the less
> words a topic has attracted the less probable this topic becomes as whole).
>
> Regards, Vasil
>
> On Thu, Mar 10, 2011 at 8:40 PM, David Hall <dl...@cs.berkeley.edu> wrote:
>
> > err, Jae, sorry.
> >
> > -- David
> >
> > On Thu, Mar 10, 2011 at 10:33 AM, David Hall <dl...@cs.berkeley.edu>
> wrote:
> > > Hi Bae,
> > >
> > > We only try to obtain MLE's of p(word|topic) (beta), and we treat
> > > alpha and eta as fixed. As you say, those could be learned, and it
> > > might improve performance, but it's just not implemented.
> > >
> > > There's no particular reason they're not implemented, but they're not
> > > critical to getting basic LDA working, especially MAP estimation of
> > > \beta.
> > >
> > > -- David
> > >
> > > On Wed, Mar 9, 2011 at 10:28 PM, Bae, Jae Hyeon <me...@gmail.com>
> > wrote:
> > >> Hi
> > >>
> > >> I am studying LDA algorithm for my statistics project. The goal is
> fully
> > >> understanding LDA algorithms and statistical concepts behind that and
> > >> analyze implementation. I've chosen Mahout LDA implementation because
> > it's
> > >> scalable and well-documented.
> > >>
> > >> According to the original paper written by Blei, Ng, Jordan,
> > >> parameters(alpha, beta) would be estimated with variational EM method.
> > But I
> > >> can't find any numerical methods to optimize those parameters. In
> Mahout
> > >> implementation, alpha is topic smoothing input by user, beta is just
> > >> P(word|topic), not estimated.
> > >>
> > >> I think that this implementation has a basic assumption. I want to
> know
> > >> whether there was specific reason to implement like this without
> > parameter
> > >> estimation.
> > >>
> > >> Thank you
> > >>
> > >> Best, Jay
> > >>
> > >
> >
>

Re: Question about LDA parameter estimation

Posted by Vasil Vasilev <va...@gmail.com>.

Hi David,

Doesn't the fact that the alpha parameters are not learned make the
algorithm very dependent on the initial number of topics that is provided by
the user (-k <numTopics>), i.e. it could learn how the words are distributed
by topics, but it cannot learn the correct number of topics. May be approach
similar to the one implemented in the Dirichlet algorithm could be used,
which has initial prior alpha and then the number of "meaningful" topics is
refined depending on how many words each topic has collected (i.e. the less
words a topic has attracted the less probable this topic becomes as whole).

Regards, Vasil

On Thu, Mar 10, 2011 at 8:40 PM, David Hall <dl...@cs.berkeley.edu> wrote:

> err, Jae, sorry.
>
> -- David
>
> On Thu, Mar 10, 2011 at 10:33 AM, David Hall <dl...@cs.berkeley.edu> wrote:
> > Hi Bae,
> >
> > We only try to obtain MLE's of p(word|topic) (beta), and we treat
> > alpha and eta as fixed. As you say, those could be learned, and it
> > might improve performance, but it's just not implemented.
> >
> > There's no particular reason they're not implemented, but they're not
> > critical to getting basic LDA working, especially MAP estimation of
> > \beta.
> >
> > -- David
> >
> > On Wed, Mar 9, 2011 at 10:28 PM, Bae, Jae Hyeon <me...@gmail.com>
> wrote:
> >> Hi
> >>
> >> I am studying LDA algorithm for my statistics project. The goal is fully
> >> understanding LDA algorithms and statistical concepts behind that and
> >> analyze implementation. I've chosen Mahout LDA implementation because
> it's
> >> scalable and well-documented.
> >>
> >> According to the original paper written by Blei, Ng, Jordan,
> >> parameters(alpha, beta) would be estimated with variational EM method.
> But I
> >> can't find any numerical methods to optimize those parameters. In Mahout
> >> implementation, alpha is topic smoothing input by user, beta is just
> >> P(word|topic), not estimated.
> >>
> >> I think that this implementation has a basic assumption. I want to know
> >> whether there was specific reason to implement like this without
> parameter
> >> estimation.
> >>
> >> Thank you
> >>
> >> Best, Jay
> >>
> >
>

Re: Question about LDA parameter estimation

Posted by David Hall <dl...@cs.berkeley.edu>.

err, Jae, sorry.

-- David

On Thu, Mar 10, 2011 at 10:33 AM, David Hall <dl...@cs.berkeley.edu> wrote:
> Hi Bae,
>
> We only try to obtain MLE's of p(word|topic) (beta), and we treat
> alpha and eta as fixed. As you say, those could be learned, and it
> might improve performance, but it's just not implemented.
>
> There's no particular reason they're not implemented, but they're not
> critical to getting basic LDA working, especially MAP estimation of
> \beta.
>
> -- David
>
> On Wed, Mar 9, 2011 at 10:28 PM, Bae, Jae Hyeon <me...@gmail.com> wrote:
>> Hi
>>
>> I am studying LDA algorithm for my statistics project. The goal is fully
>> understanding LDA algorithms and statistical concepts behind that and
>> analyze implementation. I've chosen Mahout LDA implementation because it's
>> scalable and well-documented.
>>
>> According to the original paper written by Blei, Ng, Jordan,
>> parameters(alpha, beta) would be estimated with variational EM method. But I
>> can't find any numerical methods to optimize those parameters. In Mahout
>> implementation, alpha is topic smoothing input by user, beta is just
>> P(word|topic), not estimated.
>>
>> I think that this implementation has a basic assumption. I want to know
>> whether there was specific reason to implement like this without parameter
>> estimation.
>>
>> Thank you
>>
>> Best, Jay
>>
>

Re: Question about LDA parameter estimation

Posted by David Hall <dl...@cs.berkeley.edu>.

Hi Bae,

We only try to obtain MLE's of p(word|topic) (beta), and we treat
alpha and eta as fixed. As you say, those could be learned, and it
might improve performance, but it's just not implemented.

There's no particular reason they're not implemented, but they're not
critical to getting basic LDA working, especially MAP estimation of
\beta.

-- David

On Wed, Mar 9, 2011 at 10:28 PM, Bae, Jae Hyeon <me...@gmail.com> wrote:
> Hi
>
> I am studying LDA algorithm for my statistics project. The goal is fully
> understanding LDA algorithms and statistical concepts behind that and
> analyze implementation. I've chosen Mahout LDA implementation because it's
> scalable and well-documented.
>
> According to the original paper written by Blei, Ng, Jordan,
> parameters(alpha, beta) would be estimated with variational EM method. But I
> can't find any numerical methods to optimize those parameters. In Mahout
> implementation, alpha is topic smoothing input by user, beta is just
> P(word|topic), not estimated.
>
> I think that this implementation has a basic assumption. I want to know
> whether there was specific reason to implement like this without parameter
> estimation.
>
> Thank you
>
> Best, Jay
>