You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Chris McConnell <c....@gmail.com> on 2011/05/03 00:04:42 UTC

LDA from Lucene Indexes

Hello all,

We are looking at utilizing LDA for some topic trending off some
pre-built Lucene indexes. I've put the command(s) and output below.
While searching, it seems a lot of people are unable to get this to
work properly. Most answers tell the user to review the example
"build-reuters.sh" but that doesn't utilize a Lucene index for the
input.

The dictionary is created (on local disk) and an attempt at vector
creation is done on HDFS, however no vectors are written out. I'm
interested to know if anyone has actually gotten this to work on
Mahout 0.4. I have (just for testing purposes) then tried to run the
actual LDA on the created directories, however I wouldn't expect it to
work since there are no vectors created.

Thanks,
Chris

bin/mahout lucene.vector --dir /home/index_for_mahout/ --output
/user/vectored_lucene_index --dictOut
/home/vectored_lucene_index/dict.out --weight TF --field content
11/05/02 17:23:57 INFO lucene.Driver: Output File: /user/vectored_lucene_index
11/05/02 17:23:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/05/02 17:23:57 INFO zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
11/05/02 17:23:57 INFO compress.CodecPool: Got brand-new compressor
11/05/02 17:23:58 INFO lucene.Driver: Wrote: 0 vectors
11/05/02 17:23:58 INFO lucene.Driver: Dictionary Output file:
/home/vectored_lucene_index/dict.out
11/05/02 17:23:58 INFO driver.MahoutDriver: Program took 578 ms

Re: LDA from Lucene Indexes

Posted by Chris McConnell <c....@gmail.com>.

Very interesting, thanks for the pointer Vasil! We've been playing
around a bit, even removing some of the stop words during the Lucene
index creation which is helping a bit as well.

Thanks again for this link.

Best,
Chris

On Wed, May 11, 2011 at 11:10 AM, Vasil Vasilev <va...@gmail.com> wrote:
> Hi Chris,
>
> I had a similar problem to what you describe. It turned out that many of the
> words I wanted to "stop" are also words with high document frequency.
> In order to avoid these words one option is to use maxDFPercent, but there
> are to issues with this:
> 1. You should know what exactly percentage to select
> 2. It works only on the tfidf vectors and not on the tf ones (LDA uses the
> latter)
>
> You can take a look at
> https://issues.apache.org/jira/browse/MAHOUT-688which provides one
> possible solution.
>
> On Thu, May 5, 2011 at 4:27 PM, Chris McConnell
> <c....@gmail.com>wrote:
>
>> Hi guys,
>>
>> I'm jumping back as the later emails jump into expansions (all of
>> which sound great), but I wanted to give this a better link back to
>> the original question.
>>
>> This adjustment allowed me to get the vectors created, create the lda
>> input and grab the topics out of the final results.
>>
>> I'm curious if anyone has done testing with the parameters at all.
>> Obviously different data will lead to different parameter needs
>> (number of topics, smoothing, iterations, etc.) but I'm wondering
>> particularly about "stop words." I believe I ran across some older
>> questions in the mailing list about this, where users were curious if
>> they could be specified in Mahout, or if we should be doing so within
>> the Lucene index creation, others?
>>
>> Another thought I had, we have the dictionary output, if we were to
>> modify the dictionary to remove those stop words, would that have a
>> similar effect, or does the algorithm (haven't had a chance to dig
>> into it yet, so I apologize if this is obvious) require every word
>> within the vector to exist in the dictionary?
>>
>> Thanks for all the help, I'm excited this chain has gathered some
>> steam within the community to improve the algorithm(s) surrounding
>> LDA, as we (GE) feel this library has great potential.
>>
>> Best,
>> Chris
>>
>> bin/mahout lda -i /user/TopicTrending/ -o
>> /user/TopicTrending/lda_output/ -k 5 -v 50000
>>
>> On Tue, May 3, 2011 at 12:22 PM, Jake Mannix <ja...@gmail.com>
>> wrote:
>> > Hi Chris,
>> >
>> >  That's what I thought.  This line needs to make sure you store
>> termvectors
>> > (see this article<
>> http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/
>> >for
>> > more details):
>> >
>> > On Tue, May 3, 2011 at 8:32 AM, Chris McConnell
>> > <c....@gmail.com>wrote:
>> >>
>> >> if (elementName.equals("doc")) {
>> >>                if(title && content){
>> >>                                doc.add(new
>> >> Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED));
>> >>                                doc.add(new
>> >> Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED));
>> >
>> >
>> > You want this to be:
>> >
>> > new Field("content", contentStr, Field.Store.YES, Field.Index.ANALYZED,
>> > Field.TermVector.YES);
>> >
>> > Although technically, we could add the capability to take a Store.YES
>> field
>> > and re-tokenize and
>> > build vectors from this as well.
>> >
>> >  -jake
>> >
>>
>

Re: LDA from Lucene Indexes

Posted by Vasil Vasilev <va...@gmail.com>.

Hi Chris,

I had a similar problem to what you describe. It turned out that many of the
words I wanted to "stop" are also words with high document frequency.
In order to avoid these words one option is to use maxDFPercent, but there
are to issues with this:
1. You should know what exactly percentage to select
2. It works only on the tfidf vectors and not on the tf ones (LDA uses the
latter)

You can take a look at
https://issues.apache.org/jira/browse/MAHOUT-688which provides one
possible solution.

On Thu, May 5, 2011 at 4:27 PM, Chris McConnell
<c....@gmail.com>wrote:

> Hi guys,
>
> I'm jumping back as the later emails jump into expansions (all of
> which sound great), but I wanted to give this a better link back to
> the original question.
>
> This adjustment allowed me to get the vectors created, create the lda
> input and grab the topics out of the final results.
>
> I'm curious if anyone has done testing with the parameters at all.
> Obviously different data will lead to different parameter needs
> (number of topics, smoothing, iterations, etc.) but I'm wondering
> particularly about "stop words." I believe I ran across some older
> questions in the mailing list about this, where users were curious if
> they could be specified in Mahout, or if we should be doing so within
> the Lucene index creation, others?
>
> Another thought I had, we have the dictionary output, if we were to
> modify the dictionary to remove those stop words, would that have a
> similar effect, or does the algorithm (haven't had a chance to dig
> into it yet, so I apologize if this is obvious) require every word
> within the vector to exist in the dictionary?
>
> Thanks for all the help, I'm excited this chain has gathered some
> steam within the community to improve the algorithm(s) surrounding
> LDA, as we (GE) feel this library has great potential.
>
> Best,
> Chris
>
> bin/mahout lda -i /user/TopicTrending/ -o
> /user/TopicTrending/lda_output/ -k 5 -v 50000
>
> On Tue, May 3, 2011 at 12:22 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > Hi Chris,
> >
> >  That's what I thought.  This line needs to make sure you store
> termvectors
> > (see this article<
> http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/
> >for
> > more details):
> >
> > On Tue, May 3, 2011 at 8:32 AM, Chris McConnell
> > <c....@gmail.com>wrote:
> >>
> >> if (elementName.equals("doc")) {
> >>                if(title && content){
> >>                                doc.add(new
> >> Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED));
> >>                                doc.add(new
> >> Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED));
> >
> >
> > You want this to be:
> >
> > new Field("content", contentStr, Field.Store.YES, Field.Index.ANALYZED,
> > Field.TermVector.YES);
> >
> > Although technically, we could add the capability to take a Store.YES
> field
> > and re-tokenize and
> > build vectors from this as well.
> >
> >  -jake
> >
>

Re: LDA from Lucene Indexes

Posted by Chris McConnell <c....@gmail.com>.

Hi guys,

I'm jumping back as the later emails jump into expansions (all of
which sound great), but I wanted to give this a better link back to
the original question.

This adjustment allowed me to get the vectors created, create the lda
input and grab the topics out of the final results.

I'm curious if anyone has done testing with the parameters at all.
Obviously different data will lead to different parameter needs
(number of topics, smoothing, iterations, etc.) but I'm wondering
particularly about "stop words." I believe I ran across some older
questions in the mailing list about this, where users were curious if
they could be specified in Mahout, or if we should be doing so within
the Lucene index creation, others?

Another thought I had, we have the dictionary output, if we were to
modify the dictionary to remove those stop words, would that have a
similar effect, or does the algorithm (haven't had a chance to dig
into it yet, so I apologize if this is obvious) require every word
within the vector to exist in the dictionary?

Thanks for all the help, I'm excited this chain has gathered some
steam within the community to improve the algorithm(s) surrounding
LDA, as we (GE) feel this library has great potential.

Best,
Chris

bin/mahout lda -i /user/TopicTrending/ -o
/user/TopicTrending/lda_output/ -k 5 -v 50000

On Tue, May 3, 2011 at 12:22 PM, Jake Mannix <ja...@gmail.com> wrote:
> Hi Chris,
>
>  That's what I thought.  This line needs to make sure you store termvectors
> (see this article<http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/>for
> more details):
>
> On Tue, May 3, 2011 at 8:32 AM, Chris McConnell
> <c....@gmail.com>wrote:
>>
>> if (elementName.equals("doc")) {
>>                if(title && content){
>>                                doc.add(new
>> Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED));
>>                                doc.add(new
>> Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED));
>
>
> You want this to be:
>
> new Field("content", contentStr, Field.Store.YES, Field.Index.ANALYZED,
> Field.TermVector.YES);
>
> Although technically, we could add the capability to take a Store.YES field
> and re-tokenize and
> build vectors from this as well.
>
>  -jake
>

Re: LDA from Lucene Indexes

Posted by Grant Ingersoll <gs...@apache.org>.

On May 4, 2011, at 2:31 PM, Jake Mannix wrote:

> On Wed, May 4, 2011 at 10:46 AM, Ted Dunning <te...@gmail.com> wrote:
> 
>> Pipelining is good for abstraction and really bad for performance (in the
>> map-reduce world).
>> 
>> My thought is that we could have a multipurpose tool.  Input would be a
>> lucene index and the program would read term vectors or original text as
>> available.  Output would be either sequence file full of text or sequence
>> file full of vectors.
>> 
> 
> Ok, sure, then this is modifying the lucene.vectors code, not the
> seq2sparse code, right?

Easiest is to dump to text and then use seq2sparse which has all of the functionality for tokenizing, etc.   As Jake said, it's about 5 lines of code plus boilerplate.  I think I even have some lying around somewhere.

If we go the route suggested here by Ted, we likely
should refactor both lucene.vec and seq2sparse to have a shared piece for doing the analysis.  After all, it's entirely feasible that one would want to even postprocess what comes out of the term vector too (for instance, if it wasn't stemmed before or if you wanted more aggressive stopword removal)

-Grant

Re: LDA from Lucene Indexes

Posted by Ted Dunning <te...@gmail.com>.

Good point.

On Wed, May 4, 2011 at 11:31 AM, Jake Mannix <ja...@gmail.com> wrote:

> On Wed, May 4, 2011 at 10:46 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Pipelining is good for abstraction and really bad for performance (in the
> > map-reduce world).
> >
> > My thought is that we could have a multipurpose tool.  Input would be a
> > lucene index and the program would read term vectors or original text as
> > available.  Output would be either sequence file full of text or sequence
> > file full of vectors.
> >
>
> Ok, sure, then this is modifying the lucene.vectors code, not the
> seq2sparse code, right?
>
>  -jake
>

Re: LDA from Lucene Indexes

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, May 4, 2011 at 10:46 AM, Ted Dunning <te...@gmail.com> wrote:

> Pipelining is good for abstraction and really bad for performance (in the
> map-reduce world).
>
> My thought is that we could have a multipurpose tool.  Input would be a
> lucene index and the program would read term vectors or original text as
> available.  Output would be either sequence file full of text or sequence
> file full of vectors.
>

Ok, sure, then this is modifying the lucene.vectors code, not the
seq2sparse code, right?

  -jake

Re: LDA from Lucene Indexes

Posted by Ted Dunning <te...@gmail.com>.

Pipelining is good for abstraction and really bad for performance (in the
map-reduce world).

My thought is that we could have a multipurpose tool.  Input would be a
lucene index and the program would read term vectors or original text as
available.  Output would be either sequence file full of text or sequence
file full of vectors.

This would allow pipelining if interesting, but would also allow the common
case of generating vectors to proceed in one step.

On Wed, May 4, 2011 at 10:41 AM, Jake Mannix <ja...@gmail.com> wrote:

> On Wed, May 4, 2011 at 10:33 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > It might be that the right thing is to just tweak the current seq2saprse
> > process.
> >
> > Jake,
> >
> > is that what you were thinking?
> >
>
> Well seq2sparse is really for grabbing sequence files, and lucene.vector
> grabs
> lucene indexes... I was just imagining another script that takes lucene
> indexes
> and produces text files (or sequence files of text), so you can just
> pipeline it.
>
> I haven't thought about it too carefully, however.
>

Re: LDA from Lucene Indexes

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, May 4, 2011 at 10:33 AM, Ted Dunning <te...@gmail.com> wrote:

> It might be that the right thing is to just tweak the current seq2saprse
> process.
>
> Jake,
>
> is that what you were thinking?
>

Well seq2sparse is really for grabbing sequence files, and lucene.vector
grabs
lucene indexes... I was just imagining another script that takes lucene
indexes
and produces text files (or sequence files of text), so you can just
pipeline it.

I haven't thought about it too carefully, however.

Re: LDA from Lucene Indexes

Posted by Ted Dunning <te...@gmail.com>.

It might be that the right thing is to just tweak the current seq2saprse
process.

Jake,

is that what you were thinking?

On Wed, May 4, 2011 at 10:22 AM, Julian Limon <ju...@tukipa.com>wrote:

> Thanks, Jake!
>
> I also need certain files that are generated in the seq2sparse process
> (tf),
> so lucene.vector might not be the best choice. I'll take a look at dumping
> stored fields, then.
>
> Thanks
>
> 2011/5/4 Jake Mannix <ja...@gmail.com>
>
> > On Wed, May 4, 2011 at 8:53 AM, Julian Limon <julian.limon@tukipa.com
> > >wrote:
> >
> > > This sounds really interesting. Is there a way to dump certain fields
> > from
> > > a
> > > Lucene index to text files?
> > >
> > > If so, I could use Lucene to do the parsing, and then seqdirectory and
> > > seq2sparse to generate Mahout vectors out of these files.
> > >
> >
> > You need to either have the fields Store.YES, or TermVector.YES for this
> > to work.  If you have the latter, then you don't need them in text files,
> > you
> > can use the usual lucene.vector script to produce mahout vectors.
> >
> > To dump stored fields, we don't currently have a script to do that, but
> it
> > should be another 5 lines of code to write one (ok, 25 lines, including
> > boilerplate, damn java).  File a ticket, there are lots of people around
> > here
> > who could write that code.
> >
> >  -jake
> >
> >
> > > Thanks,
> > >
> > > Julian
> > >
> > > 2011/5/3 Jake Mannix <ja...@gmail.com>
> > >
> > > > On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <gsingers@apache.org
> >
> > > > wrote:
> > > >
> > > > >
> > > > > > Although technically, we could add the capability to take a
> > Store.YES
> > > > > field
> > > > > > and re-tokenize and
> > > > > > build vectors from this as well.
> > > > >
> > > > > True, or we could just dump stored fields out to text and use the
> > > > existing
> > > > > text converter
> > > >
> > > >
> > > > That would probably be the right way to do that, actually.
> > > >
> > > >  -jake
> > > >
> > >
> >
>

Re: LDA from Lucene Indexes

Posted by Julian Limon <ju...@tukipa.com>.

Thanks, Jake!

I also need certain files that are generated in the seq2sparse process (tf),
so lucene.vector might not be the best choice. I'll take a look at dumping
stored fields, then.

Thanks

2011/5/4 Jake Mannix <ja...@gmail.com>

> On Wed, May 4, 2011 at 8:53 AM, Julian Limon <julian.limon@tukipa.com
> >wrote:
>
> > This sounds really interesting. Is there a way to dump certain fields
> from
> > a
> > Lucene index to text files?
> >
> > If so, I could use Lucene to do the parsing, and then seqdirectory and
> > seq2sparse to generate Mahout vectors out of these files.
> >
>
> You need to either have the fields Store.YES, or TermVector.YES for this
> to work.  If you have the latter, then you don't need them in text files,
> you
> can use the usual lucene.vector script to produce mahout vectors.
>
> To dump stored fields, we don't currently have a script to do that, but it
> should be another 5 lines of code to write one (ok, 25 lines, including
> boilerplate, damn java).  File a ticket, there are lots of people around
> here
> who could write that code.
>
>  -jake
>
>
> > Thanks,
> >
> > Julian
> >
> > 2011/5/3 Jake Mannix <ja...@gmail.com>
> >
> > > On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <gs...@apache.org>
> > > wrote:
> > >
> > > >
> > > > > Although technically, we could add the capability to take a
> Store.YES
> > > > field
> > > > > and re-tokenize and
> > > > > build vectors from this as well.
> > > >
> > > > True, or we could just dump stored fields out to text and use the
> > > existing
> > > > text converter
> > >
> > >
> > > That would probably be the right way to do that, actually.
> > >
> > >  -jake
> > >
> >
>

Re: LDA from Lucene Indexes

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, May 4, 2011 at 8:53 AM, Julian Limon <ju...@tukipa.com>wrote:

> This sounds really interesting. Is there a way to dump certain fields from
> a
> Lucene index to text files?
>
> If so, I could use Lucene to do the parsing, and then seqdirectory and
> seq2sparse to generate Mahout vectors out of these files.
>

You need to either have the fields Store.YES, or TermVector.YES for this
to work.  If you have the latter, then you don't need them in text files,
you
can use the usual lucene.vector script to produce mahout vectors.

To dump stored fields, we don't currently have a script to do that, but it
should be another 5 lines of code to write one (ok, 25 lines, including
boilerplate, damn java).  File a ticket, there are lots of people around
here
who could write that code.

  -jake

> Thanks,
>
> Julian
>
> 2011/5/3 Jake Mannix <ja...@gmail.com>
>
> > On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <gs...@apache.org>
> > wrote:
> >
> > >
> > > > Although technically, we could add the capability to take a Store.YES
> > > field
> > > > and re-tokenize and
> > > > build vectors from this as well.
> > >
> > > True, or we could just dump stored fields out to text and use the
> > existing
> > > text converter
> >
> >
> > That would probably be the right way to do that, actually.
> >
> >  -jake
> >
>

Re: LDA from Lucene Indexes

Posted by Julian Limon <ju...@tukipa.com>.

This sounds really interesting. Is there a way to dump certain fields from a
Lucene index to text files?

If so, I could use Lucene to do the parsing, and then seqdirectory and
seq2sparse to generate Mahout vectors out of these files.

Thanks,

Julian

2011/5/3 Jake Mannix <ja...@gmail.com>

> On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
>
> >
> > > Although technically, we could add the capability to take a Store.YES
> > field
> > > and re-tokenize and
> > > build vectors from this as well.
> >
> > True, or we could just dump stored fields out to text and use the
> existing
> > text converter
>
>
> That would probably be the right way to do that, actually.
>
>  -jake
>

Re: LDA from Lucene Indexes

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, May 3, 2011 at 6:17 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> > Although technically, we could add the capability to take a Store.YES
> field
> > and re-tokenize and
> > build vectors from this as well.
>
> True, or we could just dump stored fields out to text and use the existing
> text converter

That would probably be the right way to do that, actually.

  -jake

Re: LDA from Lucene Indexes

Posted by Grant Ingersoll <gs...@apache.org>.

On May 3, 2011, at 12:22 PM, Jake Mannix wrote:

> Hi Chris,
> 
>  That's what I thought.  This line needs to make sure you store termvectors
> (see this article<http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/>for
> more details):
> 
> On Tue, May 3, 2011 at 8:32 AM, Chris McConnell
> <c....@gmail.com>wrote:
>> 
> 
> Although technically, we could add the capability to take a Store.YES field
> and re-tokenize and
> build vectors from this as well.

True, or we could just dump stored fields out to text and use the existing text converter

Re: LDA from Lucene Indexes

Posted by Jake Mannix <ja...@gmail.com>.

Hi Chris,

  That's what I thought.  This line needs to make sure you store termvectors
(see this article<http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/>for
more details):

On Tue, May 3, 2011 at 8:32 AM, Chris McConnell
<c....@gmail.com>wrote:
>
> if (elementName.equals("doc")) {
>                if(title && content){
>                                doc.add(new
> Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED));
>                                doc.add(new
> Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED));

You want this to be:

new Field("content", contentStr, Field.Store.YES, Field.Index.ANALYZED,
Field.TermVector.YES);

Although technically, we could add the capability to take a Store.YES field
and re-tokenize and
build vectors from this as well.

  -jake

Re: LDA from Lucene Indexes

Posted by Chris McConnell <c....@gmail.com>.

Here is the segment I was given about generation of the index. Any
thoughts, let me know!

Thanks,
Chris

if (elementName.equals("doc")) {
                if(title && content){
                                doc.add(new
Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED));
                                doc.add(new
Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED));
                }

                return;
    }


On Mon, May 2, 2011 at 6:36 PM, Jake Mannix <ja...@gmail.com> wrote:
> Were your lucene indexes created with term vectors enabled?
>
> On May 2, 2011 3:05 PM, "Chris McConnell" <c....@gmail.com>
> wrote:
>
> Hello all,
>
> We are looking at utilizing LDA for some topic trending off some
> pre-built Lucene indexes. I've put the command(s) and output below.
> While searching, it seems a lot of people are unable to get this to
> work properly. Most answers tell the user to review the example
> "build-reuters.sh" but that doesn't utilize a Lucene index for the
> input.
>
> The dictionary is created (on local disk) and an attempt at vector
> creation is done on HDFS, however no vectors are written out. I'm
> interested to know if anyone has actually gotten this to work on
> Mahout 0.4. I have (just for testing purposes) then tried to run the
> actual LDA on the created directories, however I wouldn't expect it to
> work since there are no vectors created.
>
> Thanks,
> Chris
>
> bin/mahout lucene.vector --dir /home/index_for_mahout/ --output
> /user/vectored_lucene_index --dictOut
> /home/vectored_lucene_index/dict.out --weight TF --field content
> 11/05/02 17:23:57 INFO lucene.Driver: Output File:
> /user/vectored_lucene_index
> 11/05/02 17:23:57 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 11/05/02 17:23:57 INFO zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
> 11/05/02 17:23:57 INFO compress.CodecPool: Got brand-new compressor
> 11/05/02 17:23:58 INFO lucene.Driver: Wrote: 0 vectors
> 11/05/02 17:23:58 INFO lucene.Driver: Dictionary Output file:
> /home/vectored_lucene_index/dict.out
> 11/05/02 17:23:58 INFO driver.MahoutDriver: Program took 578 ms
>

Re: LDA from Lucene Indexes

Posted by Chris McConnell <c....@gmail.com>.

Great question, let me check on that. Sadly I don't have fast control
over the indexing process, but I'll post an update in the AM.

Thanks for the tip.

Chris

On Mon, May 2, 2011 at 6:36 PM, Jake Mannix <ja...@gmail.com> wrote:
> Were your lucene indexes created with term vectors enabled?
>
> On May 2, 2011 3:05 PM, "Chris McConnell" <c....@gmail.com>
> wrote:
>
> Hello all,
>
> We are looking at utilizing LDA for some topic trending off some
> pre-built Lucene indexes. I've put the command(s) and output below.
> While searching, it seems a lot of people are unable to get this to
> work properly. Most answers tell the user to review the example
> "build-reuters.sh" but that doesn't utilize a Lucene index for the
> input.
>
> The dictionary is created (on local disk) and an attempt at vector
> creation is done on HDFS, however no vectors are written out. I'm
> interested to know if anyone has actually gotten this to work on
> Mahout 0.4. I have (just for testing purposes) then tried to run the
> actual LDA on the created directories, however I wouldn't expect it to
> work since there are no vectors created.
>
> Thanks,
> Chris
>
> bin/mahout lucene.vector --dir /home/index_for_mahout/ --output
> /user/vectored_lucene_index --dictOut
> /home/vectored_lucene_index/dict.out --weight TF --field content
> 11/05/02 17:23:57 INFO lucene.Driver: Output File:
> /user/vectored_lucene_index
> 11/05/02 17:23:57 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 11/05/02 17:23:57 INFO zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
> 11/05/02 17:23:57 INFO compress.CodecPool: Got brand-new compressor
> 11/05/02 17:23:58 INFO lucene.Driver: Wrote: 0 vectors
> 11/05/02 17:23:58 INFO lucene.Driver: Dictionary Output file:
> /home/vectored_lucene_index/dict.out
> 11/05/02 17:23:58 INFO driver.MahoutDriver: Program took 578 ms
>

Re: LDA from Lucene Indexes

Posted by Jake Mannix <ja...@gmail.com>.

Were your lucene indexes created with term vectors enabled?

On May 2, 2011 3:05 PM, "Chris McConnell" <c....@gmail.com>
wrote:

Hello all,

We are looking at utilizing LDA for some topic trending off some
pre-built Lucene indexes. I've put the command(s) and output below.
While searching, it seems a lot of people are unable to get this to
work properly. Most answers tell the user to review the example
"build-reuters.sh" but that doesn't utilize a Lucene index for the
input.

The dictionary is created (on local disk) and an attempt at vector
creation is done on HDFS, however no vectors are written out. I'm
interested to know if anyone has actually gotten this to work on
Mahout 0.4. I have (just for testing purposes) then tried to run the
actual LDA on the created directories, however I wouldn't expect it to
work since there are no vectors created.

Thanks,
Chris

bin/mahout lucene.vector --dir /home/index_for_mahout/ --output
/user/vectored_lucene_index --dictOut
/home/vectored_lucene_index/dict.out --weight TF --field content
11/05/02 17:23:57 INFO lucene.Driver: Output File:
/user/vectored_lucene_index
11/05/02 17:23:57 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/05/02 17:23:57 INFO zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
11/05/02 17:23:57 INFO compress.CodecPool: Got brand-new compressor
11/05/02 17:23:58 INFO lucene.Driver: Wrote: 0 vectors
11/05/02 17:23:58 INFO lucene.Driver: Dictionary Output file:
/home/vectored_lucene_index/dict.out
11/05/02 17:23:58 INFO driver.MahoutDriver: Program took 578 ms