You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Peter Andrews <pw...@gmail.com> on 2011/06/10 23:03:11 UTC

term collocation from lucene index

Hi,

I just started using Mahout a few or two ago and so far its been pretty
good. I working on some term collocation and while I have been working from
a directory of files, I want to switch to using lucene indexes as that is
the format the files are already in. I am trying to use the lucene.vector to
turn the indexes into vector and then use
org.apache.mahout.vectorizer.collocations.llr.CollocDriver to generate the
collocations and LLRs. I keep getting this error when I run CollocDriver,
any ideas?

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
cast to org.apache.hadoop.io.Text
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:40)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)


-- 
Peter Andrews

Re: term collocation from lucene index

Posted by Drew Farris <dr...@apache.org>.

Hi Peter,

Apologies for the delay in following up on this.

The error you're seeing is as a result of the output of the
lucene.vector task being incompatible with the input of CollocDriver.
They each write and read sequence files containing different types of
data.

The sequence files produced by the lucene.vector task have a key type
of LongWritable and a value type of VectorWritable, the vectors
themselves do not retain references to the word position in the
original document, but merely encode the word as an id from the
dictionary that is generated as a part of the process and a weight
based upon the occurrences of the term in the index and the parameters
you specified when you called lucene.vector. As positional information
is not retained you can not use these vectors as input to the
CollocDriver code which relies on word proximity to form collocations.

The CollocDriver expects the sequence files it uses as input to have a
key type of Text and value type of StringTuple. Sequence Files with a
key of Text and value of Text are also acceptable if the preprocess
option is specified. In either case the key is document id, while the
value is the text of a document either tokenized in StringTuple form
or untokenized in Text form.

To produce sequence files suitable for generating collocations from a
Lucene index, you'll need to write some code pull the text from a
stored field or re-construct the text from a term vector with
positional information. You can then write this to a sequence file
that will work with CollocDriver. The
org.apache.mahout.utils.vectors.lucene.Driver class is a good starting
point for learning how to extract data from a Lucene index and write
data to sequence files.

Drew

On Fri, Jun 10, 2011 at 5:03 PM, Peter Andrews <pw...@gmail.com> wrote:
> Hi,
>
> I just started using Mahout a few or two ago and so far its been pretty
> good. I working on some term collocation and while I have been working from
> a directory of files, I want to switch to using lucene indexes as that is
> the format the files are already in. I am trying to use the lucene.vector to
> turn the indexes into vector and then use
> org.apache.mahout.vectorizer.collocations.llr.CollocDriver to generate the
> collocations and LLRs. I keep getting this error when I run CollocDriver,
> any ideas?
>
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
> cast to org.apache.hadoop.io.Text
> at
> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:40)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> at org.apache.hadoop.mapred.Child.main(Child.java:253)
>
>
> --
> Peter Andrews
>