You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Philippe Adjiman <ad...@gmail.com> on 2011/09/18 16:48:24 UTC

issue while running lucene.vector driver in mahout 0.5

Hi,

I was trying to generate vectors from a lucene index using the lucene.vector
driver, it worked fine using mahout 0.4 but in mahout 0.5 i get the
following exception:

SEVERE: There are too many documents that do not have a term vector for
description
Exception in thread "main" java.lang.IllegalStateException: There are too
many documents that do not have a term vector for description
 at
org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:114)
at
org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:41)
 at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
 at
org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:43)
at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:206)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)

My lucene index was created using:


doc.add(new Field("documentId", documentId, Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("content", content, Field.Store.YES,
Field.Index.ANALYZED,TermVector.YES));


If it is a know issue, sorry for the duplicate, else let me know if i can
help in order to reproduce.


-Philippe


-- 
Philippe Adjiman | twitter: padjiman | linkedin:
il.linkedin.com/in/philippeadjiman | blog: http://philippeadjiman.com/blog

Re: issue while running lucene.vector driver in mahout 0.5

Posted by Grant Ingersoll <gs...@apache.org>.

The LuceneIterator has a built-in circuit breaker if it gets too many errors.  If  you are using lucene.vector, you can pass in --maxPercentErrorDocs X, where X is some percentage of docs you are willing to allow errors in.  The default is no errors.


On Sep 18, 2011, at 10:48 AM, Philippe Adjiman wrote:

> Hi,
> 
> I was trying to generate vectors from a lucene index using the lucene.vector
> driver, it worked fine using mahout 0.4 but in mahout 0.5 i get the
> following exception:
> 
> SEVERE: There are too many documents that do not have a term vector for
> description
> Exception in thread "main" java.lang.IllegalStateException: There are too
> many documents that do not have a term vector for description
> at
> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:114)
> at
> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:41)
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
> at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
> at
> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:43)
> at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:206)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> 
> My lucene index was created using:
> 
> 
> doc.add(new Field("documentId", documentId, Field.Store.YES,
> Field.Index.NOT_ANALYZED));
> doc.add(new Field("content", content, Field.Store.YES,
> Field.Index.ANALYZED,TermVector.YES));
> 
> 
> If it is a know issue, sorry for the duplicate, else let me know if i can
> help in order to reproduce.
> 
> 
> -Philippe
> 
> 
> -- 
> Philippe Adjiman | twitter: padjiman | linkedin:
> il.linkedin.com/in/philippeadjiman | blog: http://philippeadjiman.com/blog

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com