You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Rob Ennals <ro...@gmail.com> on 2010/01/15 06:41:45 UTC

CorruptIndexException or NullPointerException when creating vectors from Lucene

Hi Guys,

I'm totally new to Mahout so I'm running into what I expect are newbie issues.

To get started with clustering, I tried importing some indexes from Lucene.

Following the Lucene tutorial, I created a really simple index of the
Lucene source code:
http://lucene.apache.org/java/3_0_0/demo.html

I then tried to convert this to a Mahout Vector, following as per
http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html

This gives me a CorruptIndexException:

rob@rob:~/svn/mahout$ java
org.apache.mahout.utils.vectors.lucene.Driver --dir
/home/rob/Reference/Installers/lucene-3.0.0/index --output
/home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
contents
Exception in thread "main"
org.apache.lucene.index.CorruptIndexException: Incompatible format
version: 2 expected 1 or lower
        at org.apache.lucene.index.FieldsReader.<init>(FieldsReader.java:117)
        at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:277)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:640)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:599)
        at org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:104)
        at org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
        at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74)
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704)
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:314)
        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:140)


I also tried running the driver on the actual Lucene index that I want
to apply it to, and this time to a NullPointerException:

rob@rob:~/svn/mahout$ java
org.apache.mahout.utils.vectors.lucene.Driver --dir
/home/rob/git/thinklink/scala/bin/index/ --output
/home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
contents
Jan 14, 2010 9:40:40 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Output File: /home/rob/test/output
Exception in thread "main" java.lang.NullPointerException
        at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
        at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
        at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
        at org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226)
        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)


In both cases, the indexes should have the "contents" field.


I assume I'm doing something stupid here. If someone can tell me what
that is, then that would be great.


Thanks

-Rob

Re: CorruptIndexException or NullPointerException when creating vectors from Lucene

Posted by Rob Ennals <ro...@gmail.com>.
Thanks for the help.

I hadn't realized that Java was picking up the Lucene class from the
target/dependency/ directory, rather than from my Lucene installation.
I fixed this by replacing the Lucene jar in the dependency directory
with the only from Lucene 3.0.0, and now I get the
NullPointerException for the Lucene demo index as well:

rob@rob:~/svn/mahout$ java
org.apache.mahout.utils.vectors.lucene.Driver --dir
/home/rob/Reference/Installers/lucene-3.0.0/index --output
/home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
contents
Jan 18, 2010 2:06:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Output File: /home/rob/test/output
Exception in thread "main" java.lang.NullPointerException
        at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
        at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
        at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
        at org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226)
        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)


I then tried downgrading Lucene to 2.9.1 to see if this fixed the
NullPointerException, but I get the same problem:

java org.apache.mahout.utils.vectors.lucene.Driver --dir
/home/rob/Reference/Installers/lucene-2.9.1/index --output
/home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
contents
Jan 18, 2010 2:18:37 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Output File: /home/rob/test/output
Exception in thread "main" java.lang.NullPointerException
        at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
        at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
        at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
        at org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226)
        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)


Any idea what's going on here?


Thanks

-Rob


On Thu, Jan 14, 2010 at 10:01 PM, Shashikant Kore <sh...@gmail.com> wrote:
> The first problem seems to be index version incompatibility.
>
> Since you created index with Lucene 3.0, you will need the same
> version to read the index. It seem while creating the vectors, the
> version of Lucene is lower than that.  Can you check if you are using
> the same lucene jar while creating vector?
>
> Not sure what the second problem is.
>
> --shashi
>
> On Fri, Jan 15, 2010 at 11:11 AM, Rob Ennals <ro...@gmail.com> wrote:
>> Hi Guys,
>>
>> I'm totally new to Mahout so I'm running into what I expect are newbie issues.
>>
>> To get started with clustering, I tried importing some indexes from Lucene.
>>
>> Following the Lucene tutorial, I created a really simple index of the
>> Lucene source code:
>> http://lucene.apache.org/java/3_0_0/demo.html
>>
>> I then tried to convert this to a Mahout Vector, following as per
>> http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>>
>> This gives me a CorruptIndexException:
>>
>> rob@rob:~/svn/mahout$ java
>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>> /home/rob/Reference/Installers/lucene-3.0.0/index --output
>> /home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
>> contents
>> Exception in thread "main"
>> org.apache.lucene.index.CorruptIndexException: Incompatible format
>> version: 2 expected 1 or lower
>>        at org.apache.lucene.index.FieldsReader.<init>(FieldsReader.java:117)
>>        at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:277)
>>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:640)
>>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:599)
>>        at org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:104)
>>        at org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
>>        at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74)
>>        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704)
>>        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
>>        at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
>>        at org.apache.lucene.index.IndexReader.open(IndexReader.java:314)
>>        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:140)
>>
>>
>> I also tried running the driver on the actual Lucene index that I want
>> to apply it to, and this time to a NullPointerException:
>>
>> rob@rob:~/svn/mahout$ java
>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>> /home/rob/git/thinklink/scala/bin/index/ --output
>> /home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
>> contents
>> Jan 14, 2010 9:40:40 PM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Output File: /home/rob/test/output
>> Exception in thread "main" java.lang.NullPointerException
>>        at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
>>        at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
>>        at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
>>        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
>>        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
>>        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
>>        at org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226)
>>        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)
>>
>>
>> In both cases, the indexes should have the "contents" field.
>>
>>
>> I assume I'm doing something stupid here. If someone can tell me what
>> that is, then that would be great.
>>
>>
>> Thanks
>>
>> -Rob
>>
>

Re: CorruptIndexException or NullPointerException when creating vectors from Lucene

Posted by Isabel Drost <is...@apache.org>.
On Fri Grant Ingersoll <gs...@apache.org> wrote:

> Right, Mahout is currently on Lucene 2.9.  We should upgrade.

Apart from the issues to be fixed in MAHOUT-246 - is there anything else
that would block upgrading?

Isabel

Re: CorruptIndexException or NullPointerException when creating vectors from Lucene

Posted by Grant Ingersoll <gs...@apache.org>.
Right, Mahout is currently on Lucene 2.9.  We should upgrade.

On Jan 15, 2010, at 1:01 AM, Shashikant Kore wrote:

> The first problem seems to be index version incompatibility.
> 
> Since you created index with Lucene 3.0, you will need the same
> version to read the index. It seem while creating the vectors, the
> version of Lucene is lower than that.  Can you check if you are using
> the same lucene jar while creating vector?
> 
> Not sure what the second problem is.
> 
> --shashi
> 
> On Fri, Jan 15, 2010 at 11:11 AM, Rob Ennals <ro...@gmail.com> wrote:
>> Hi Guys,
>> 
>> I'm totally new to Mahout so I'm running into what I expect are newbie issues.
>> 
>> To get started with clustering, I tried importing some indexes from Lucene.
>> 
>> Following the Lucene tutorial, I created a really simple index of the
>> Lucene source code:
>> http://lucene.apache.org/java/3_0_0/demo.html
>> 
>> I then tried to convert this to a Mahout Vector, following as per
>> http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>> 
>> This gives me a CorruptIndexException:
>> 
>> rob@rob:~/svn/mahout$ java
>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>> /home/rob/Reference/Installers/lucene-3.0.0/index --output
>> /home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
>> contents
>> Exception in thread "main"
>> org.apache.lucene.index.CorruptIndexException: Incompatible format
>> version: 2 expected 1 or lower
>>        at org.apache.lucene.index.FieldsReader.<init>(FieldsReader.java:117)
>>        at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:277)
>>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:640)
>>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:599)
>>        at org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:104)
>>        at org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
>>        at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74)
>>        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704)
>>        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
>>        at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
>>        at org.apache.lucene.index.IndexReader.open(IndexReader.java:314)
>>        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:140)
>> 
>> 
>> I also tried running the driver on the actual Lucene index that I want
>> to apply it to, and this time to a NullPointerException:
>> 
>> rob@rob:~/svn/mahout$ java
>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>> /home/rob/git/thinklink/scala/bin/index/ --output
>> /home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
>> contents
>> Jan 14, 2010 9:40:40 PM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Output File: /home/rob/test/output
>> Exception in thread "main" java.lang.NullPointerException
>>        at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
>>        at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
>>        at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
>>        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
>>        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
>>        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
>>        at org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226)
>>        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)
>> 
>> 
>> In both cases, the indexes should have the "contents" field.
>> 
>> 
>> I assume I'm doing something stupid here. If someone can tell me what
>> that is, then that would be great.
>> 
>> 
>> Thanks
>> 
>> -Rob
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Re: CorruptIndexException or NullPointerException when creating vectors from Lucene

Posted by Shashikant Kore <sh...@gmail.com>.
The first problem seems to be index version incompatibility.

Since you created index with Lucene 3.0, you will need the same
version to read the index. It seem while creating the vectors, the
version of Lucene is lower than that.  Can you check if you are using
the same lucene jar while creating vector?

Not sure what the second problem is.

--shashi

On Fri, Jan 15, 2010 at 11:11 AM, Rob Ennals <ro...@gmail.com> wrote:
> Hi Guys,
>
> I'm totally new to Mahout so I'm running into what I expect are newbie issues.
>
> To get started with clustering, I tried importing some indexes from Lucene.
>
> Following the Lucene tutorial, I created a really simple index of the
> Lucene source code:
> http://lucene.apache.org/java/3_0_0/demo.html
>
> I then tried to convert this to a Mahout Vector, following as per
> http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>
> This gives me a CorruptIndexException:
>
> rob@rob:~/svn/mahout$ java
> org.apache.mahout.utils.vectors.lucene.Driver --dir
> /home/rob/Reference/Installers/lucene-3.0.0/index --output
> /home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
> contents
> Exception in thread "main"
> org.apache.lucene.index.CorruptIndexException: Incompatible format
> version: 2 expected 1 or lower
>        at org.apache.lucene.index.FieldsReader.<init>(FieldsReader.java:117)
>        at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:277)
>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:640)
>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:599)
>        at org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:104)
>        at org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
>        at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74)
>        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704)
>        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
>        at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
>        at org.apache.lucene.index.IndexReader.open(IndexReader.java:314)
>        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:140)
>
>
> I also tried running the driver on the actual Lucene index that I want
> to apply it to, and this time to a NullPointerException:
>
> rob@rob:~/svn/mahout$ java
> org.apache.mahout.utils.vectors.lucene.Driver --dir
> /home/rob/git/thinklink/scala/bin/index/ --output
> /home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
> contents
> Jan 14, 2010 9:40:40 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Output File: /home/rob/test/output
> Exception in thread "main" java.lang.NullPointerException
>        at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
>        at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
>        at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
>        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
>        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
>        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
>        at org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226)
>        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)
>
>
> In both cases, the indexes should have the "contents" field.
>
>
> I assume I'm doing something stupid here. If someone can tell me what
> that is, then that would be great.
>
>
> Thanks
>
> -Rob
>

Re: CorruptIndexException or NullPointerException when creating vectors from Lucene

Posted by rqualis <rq...@macroteck.com>.
Strange, but I do the following to get it working.
In eclipse I link my project to the lucene...core jar that came with the
mahout I downloaded.  I then rebuilt my project and execute it to create the
index.  Then, I return to mahout and execute the mahout lucene.vector ...
and it worked.

Hope this helps
-- 
View this message in context: http://lucene.472066.n3.nabble.com/CorruptIndexException-or-NullPointerException-when-creating-vectors-from-Lucene-tp640134p760426.html
Sent from the Mahout User List mailing list archive at Nabble.com.