You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vasil Vasilev <va...@gmail.com> on 2011/04/21 06:08:47 UTC

LDA related enhancements

Hi Mahouters,

I was experimenting with the LDA clustering algorithm on the Reuters data
set and I did several enhancements, which if you find interesting I could
contribute to the project:

1. Created term-frequency vectors pruner: LDA uses the tf vectors and not
the tf-idf ones which result from seq2sparse. Due this fact words like
"and", "where", etc. get also included in the resulting topics. To prevent
that I run seq2sparse with the whole tf-idf sequence and then run the
"pruner". It first calculates the standard deviation of the document
frequencies of the words and then prunes all entries in the tf vectors whose
document frequency is bigger then 3 times the calculated standard deviation.
This ensures including most of the words population, but still pruning the
unnecessary ones.

2. Implemented the alpha-estimation part of the LDA algorithm as described
in the Blei, Ng, Jordan paper. This leads to better results in maximizing
the log-likelihood for the same number of iterations. Just an example - for
20 iterations on the reuters data set the enhanced algorithm reaches value
of -6975124.693072233, compared to -7304552.275676554 with the original
implementation

3. Created LDA Vectorizer. It executes only the inference part of the LDA
algorithm based on the last LDA state and the input document vectors and for
each vector produces a vector of the gammas, that are result of the
inference. The idea is that the vectors produced in this way can be used for
clustering with any of the existing algorithms (like canopy, kmeans, etc.)

Regards, Vasil

Re: LDA related enhancements

Posted by Jake Mannix <ja...@gmail.com>.

Vasil,

  This sounds great!

On Wed, Apr 20, 2011 at 9:08 PM, Vasil Vasilev <va...@gmail.com> wrote:

> Hi Mahouters,
>
> 1. Created term-frequency vectors pruner: LDA uses the tf vectors and not
> the tf-idf ones which result from seq2sparse. Due this fact words like
> "and", "where", etc. get also included in the resulting topics. To prevent
> that I run seq2sparse with the whole tf-idf sequence and then run the
> "pruner". It first calculates the standard deviation of the document
> frequencies of the words and then prunes all entries in the tf vectors
> whose
> document frequency is bigger then 3 times the calculated standard
> deviation.
> This ensures including most of the words population, but still pruning the
> unnecessary ones.
>

If you could add this to the whole seq2sparse functionality in general
(optionally),
this would be generally better than the minDf / maxDf way we currently do
this.


> 2. Implemented the alpha-estimation part of the LDA algorithm as described
> in the Blei, Ng, Jordan paper. This leads to better results in maximizing
> the log-likelihood for the same number of iterations. Just an example - for
> 20 iterations on the reuters data set the enhanced algorithm reaches value
> of -6975124.693072233, compared to -7304552.275676554 with the original
> implementation
>

Awesome.


> 3. Created LDA Vectorizer. It executes only the inference part of the LDA
> algorithm based on the last LDA state and the input document vectors and
> for
> each vector produces a vector of the gammas, that are result of the
> inference. The idea is that the vectors produced in this way can be used
> for
> clustering with any of the existing algorithms (like canopy, kmeans, etc.)
>

Yeah, I've got code which does this too, and keep meaning to clean it up
for submission, but if yours is ready to go, file a JIRA, submit a patch! :)

The gamma vector is totally helpful, it lets you do LSI-style search, as
well.

  -jake

Re: LDA related enhancements

Posted by Vasil Vasilev <va...@gmail.com>.

The patch for pruning words with high document frequencies is ready:
https://issues.apache.org/jira/browse/MAHOUT-688

On Thu, Apr 28, 2011 at 5:08 PM, Vasil Vasilev <va...@gmail.com> wrote:

> Also the topic regularization patch is ready:
> https://issues.apache.org/jira/browse/MAHOUT-684
>
>
> On Thu, Apr 28, 2011 at 10:53 AM, Vasil Vasilev <va...@gmail.com>wrote:
>
>> Hi all,
>>
>> The LDA Vectorization patch is ready. You can take a look at:
>> https://issues.apache.org/jira/browse/MAHOUT-683*
>>
>> *Regards, Vasil
>> *
>> *
>> On Thu, Apr 21, 2011 at 9:47 AM, Vasil Vasilev <va...@gmail.com>wrote:
>>
>>> Ok. I am going to try out 1) suggested by Jake, then write couple of
>>> tests and then I will file the Jira-s.
>>>
>>>
>>> On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>>
>>>>
>>>> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:
>>>>
>>>> > Hi Mahouters,
>>>> >
>>>> > I was experimenting with the LDA clustering algorithm on the Reuters
>>>> data
>>>> > set and I did several enhancements, which if you find interesting I
>>>> could
>>>> > contribute to the project:
>>>> >
>>>> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and
>>>> not
>>>> > the tf-idf ones which result from seq2sparse. Due this fact words like
>>>> > "and", "where", etc. get also included in the resulting topics. To
>>>> prevent
>>>> > that I run seq2sparse with the whole tf-idf sequence and then run the
>>>> > "pruner". It first calculates the standard deviation of the document
>>>> > frequencies of the words and then prunes all entries in the tf vectors
>>>> whose
>>>> > document frequency is bigger then 3 times the calculated standard
>>>> deviation.
>>>> > This ensures including most of the words population, but still pruning
>>>> the
>>>> > unnecessary ones.
>>>> >
>>>> > 2. Implemented the alpha-estimation part of the LDA algorithm as
>>>> described
>>>> > in the Blei, Ng, Jordan paper. This leads to better results in
>>>> maximizing
>>>> > the log-likelihood for the same number of iterations. Just an example
>>>> - for
>>>> > 20 iterations on the reuters data set the enhanced algorithm reaches
>>>> value
>>>> > of -6975124.693072233, compared to -7304552.275676554 with the
>>>> original
>>>> > implementation
>>>> >
>>>> > 3. Created LDA Vectorizer. It executes only the inference part of the
>>>> LDA
>>>> > algorithm based on the last LDA state and the input document vectors
>>>> and for
>>>> > each vector produces a vector of the gammas, that are result of the
>>>> > inference. The idea is that the vectors produced in this way can be
>>>> used for
>>>> > clustering with any of the existing algorithms (like canopy, kmeans,
>>>> etc.)
>>>> >
>>>>
>>>> As Jake says, this all sounds great.  Please see:
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
>>>>
>>>>
>>>
>>
>

Re: LDA related enhancements

Posted by Vasil Vasilev <va...@gmail.com>.

Also the topic regularization patch is ready:
https://issues.apache.org/jira/browse/MAHOUT-684

On Thu, Apr 28, 2011 at 10:53 AM, Vasil Vasilev <va...@gmail.com> wrote:

> Hi all,
>
> The LDA Vectorization patch is ready. You can take a look at:
> https://issues.apache.org/jira/browse/MAHOUT-683*
>
> *Regards, Vasil
> *
> *
> On Thu, Apr 21, 2011 at 9:47 AM, Vasil Vasilev <va...@gmail.com>wrote:
>
>> Ok. I am going to try out 1) suggested by Jake, then write couple of tests
>> and then I will file the Jira-s.
>>
>>
>> On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>
>>>
>>> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:
>>>
>>> > Hi Mahouters,
>>> >
>>> > I was experimenting with the LDA clustering algorithm on the Reuters
>>> data
>>> > set and I did several enhancements, which if you find interesting I
>>> could
>>> > contribute to the project:
>>> >
>>> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and
>>> not
>>> > the tf-idf ones which result from seq2sparse. Due this fact words like
>>> > "and", "where", etc. get also included in the resulting topics. To
>>> prevent
>>> > that I run seq2sparse with the whole tf-idf sequence and then run the
>>> > "pruner". It first calculates the standard deviation of the document
>>> > frequencies of the words and then prunes all entries in the tf vectors
>>> whose
>>> > document frequency is bigger then 3 times the calculated standard
>>> deviation.
>>> > This ensures including most of the words population, but still pruning
>>> the
>>> > unnecessary ones.
>>> >
>>> > 2. Implemented the alpha-estimation part of the LDA algorithm as
>>> described
>>> > in the Blei, Ng, Jordan paper. This leads to better results in
>>> maximizing
>>> > the log-likelihood for the same number of iterations. Just an example -
>>> for
>>> > 20 iterations on the reuters data set the enhanced algorithm reaches
>>> value
>>> > of -6975124.693072233, compared to -7304552.275676554 with the original
>>> > implementation
>>> >
>>> > 3. Created LDA Vectorizer. It executes only the inference part of the
>>> LDA
>>> > algorithm based on the last LDA state and the input document vectors
>>> and for
>>> > each vector produces a vector of the gammas, that are result of the
>>> > inference. The idea is that the vectors produced in this way can be
>>> used for
>>> > clustering with any of the existing algorithms (like canopy, kmeans,
>>> etc.)
>>> >
>>>
>>> As Jake says, this all sounds great.  Please see:
>>> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
>>>
>>>
>>
>

Re: LDA related enhancements

Posted by Vasil Vasilev <va...@gmail.com>.

Hi all,

The LDA Vectorization patch is ready. You can take a look at:
https://issues.apache.org/jira/browse/MAHOUT-683*

*Regards, Vasil*
*
On Thu, Apr 21, 2011 at 9:47 AM, Vasil Vasilev <va...@gmail.com> wrote:

> Ok. I am going to try out 1) suggested by Jake, then write couple of tests
> and then I will file the Jira-s.
>
>
> On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>>
>> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:
>>
>> > Hi Mahouters,
>> >
>> > I was experimenting with the LDA clustering algorithm on the Reuters
>> data
>> > set and I did several enhancements, which if you find interesting I
>> could
>> > contribute to the project:
>> >
>> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and
>> not
>> > the tf-idf ones which result from seq2sparse. Due this fact words like
>> > "and", "where", etc. get also included in the resulting topics. To
>> prevent
>> > that I run seq2sparse with the whole tf-idf sequence and then run the
>> > "pruner". It first calculates the standard deviation of the document
>> > frequencies of the words and then prunes all entries in the tf vectors
>> whose
>> > document frequency is bigger then 3 times the calculated standard
>> deviation.
>> > This ensures including most of the words population, but still pruning
>> the
>> > unnecessary ones.
>> >
>> > 2. Implemented the alpha-estimation part of the LDA algorithm as
>> described
>> > in the Blei, Ng, Jordan paper. This leads to better results in
>> maximizing
>> > the log-likelihood for the same number of iterations. Just an example -
>> for
>> > 20 iterations on the reuters data set the enhanced algorithm reaches
>> value
>> > of -6975124.693072233, compared to -7304552.275676554 with the original
>> > implementation
>> >
>> > 3. Created LDA Vectorizer. It executes only the inference part of the
>> LDA
>> > algorithm based on the last LDA state and the input document vectors and
>> for
>> > each vector produces a vector of the gammas, that are result of the
>> > inference. The idea is that the vectors produced in this way can be used
>> for
>> > clustering with any of the existing algorithms (like canopy, kmeans,
>> etc.)
>> >
>>
>> As Jake says, this all sounds great.  Please see:
>> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
>>
>>
>

Re: LDA related enhancements

Posted by Vasil Vasilev <va...@gmail.com>.

Ok. I am going to try out 1) suggested by Jake, then write couple of tests
and then I will file the Jira-s.

On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:
>
> > Hi Mahouters,
> >
> > I was experimenting with the LDA clustering algorithm on the Reuters data
> > set and I did several enhancements, which if you find interesting I could
> > contribute to the project:
> >
> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and not
> > the tf-idf ones which result from seq2sparse. Due this fact words like
> > "and", "where", etc. get also included in the resulting topics. To
> prevent
> > that I run seq2sparse with the whole tf-idf sequence and then run the
> > "pruner". It first calculates the standard deviation of the document
> > frequencies of the words and then prunes all entries in the tf vectors
> whose
> > document frequency is bigger then 3 times the calculated standard
> deviation.
> > This ensures including most of the words population, but still pruning
> the
> > unnecessary ones.
> >
> > 2. Implemented the alpha-estimation part of the LDA algorithm as
> described
> > in the Blei, Ng, Jordan paper. This leads to better results in maximizing
> > the log-likelihood for the same number of iterations. Just an example -
> for
> > 20 iterations on the reuters data set the enhanced algorithm reaches
> value
> > of -6975124.693072233, compared to -7304552.275676554 with the original
> > implementation
> >
> > 3. Created LDA Vectorizer. It executes only the inference part of the LDA
> > algorithm based on the last LDA state and the input document vectors and
> for
> > each vector produces a vector of the gammas, that are result of the
> > inference. The idea is that the vectors produced in this way can be used
> for
> > clustering with any of the existing algorithms (like canopy, kmeans,
> etc.)
> >
>
> As Jake says, this all sounds great.  Please see:
> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
>
>

Re: LDA related enhancements

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:

> Hi Mahouters,
> 
> I was experimenting with the LDA clustering algorithm on the Reuters data
> set and I did several enhancements, which if you find interesting I could
> contribute to the project:
> 
> 1. Created term-frequency vectors pruner: LDA uses the tf vectors and not
> the tf-idf ones which result from seq2sparse. Due this fact words like
> "and", "where", etc. get also included in the resulting topics. To prevent
> that I run seq2sparse with the whole tf-idf sequence and then run the
> "pruner". It first calculates the standard deviation of the document
> frequencies of the words and then prunes all entries in the tf vectors whose
> document frequency is bigger then 3 times the calculated standard deviation.
> This ensures including most of the words population, but still pruning the
> unnecessary ones.
> 
> 2. Implemented the alpha-estimation part of the LDA algorithm as described
> in the Blei, Ng, Jordan paper. This leads to better results in maximizing
> the log-likelihood for the same number of iterations. Just an example - for
> 20 iterations on the reuters data set the enhanced algorithm reaches value
> of -6975124.693072233, compared to -7304552.275676554 with the original
> implementation
> 
> 3. Created LDA Vectorizer. It executes only the inference part of the LDA
> algorithm based on the last LDA state and the input document vectors and for
> each vector produces a vector of the gammas, that are result of the
> inference. The idea is that the vectors produced in this way can be used for
> clustering with any of the existing algorithms (like canopy, kmeans, etc.)
> 

As Jake says, this all sounds great.  Please see: https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute