You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Julian Limon <ju...@tukipa.com> on 2011/04/17 09:09:03 UTC

Create vector using existing dictionary and IDF values

Hello all,

Sorry to bother again, but I've been hitting my head against the wall for
the last day and I don't seem to find the answer.

I'm trying to create a new tfidf vector (or probably many vectors) out of a
new directory using something like seq2sparse. However, I want to create
these vectors based on the dictionary and idf values of a previously
executed directory. Let's say that I created my vectors using the whole
corpus and now I want to calculate new tfidf vectors for a few documents (or
more exactly, a few queries) that share the properties of the previous
corpus.

I know that seq2sparse stores a dictionary and tf values in temporary
folders. My first attempt was to modify DictionaryVectorizer and
TFIDFConverter to have them use a dictionary and a df-count from a different
directory. So far it seems that I had some luck with both, but now I'm
getting "index out of bound" exception. My guess is that some other class or
job determines the size of some array based on the document source.

Do you guys have any ideas about what might be wrong? Or even better, do you
guys know of a better way to generate a vector (i.e., a query vector) using
previous matrix values (i.e., the index)?

Thanks a lot,

Julian

P.S. The error I'm getting looks like this:

Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0002
org.apache.mahout.math.IndexException: Index 517 is outside allowable range
of [0,0)
at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:392)
at
org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69)
at
org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:95)
at
org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:50)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO:  map 100% reduce 0%

Re: Create vector using existing dictionary and IDF values

Posted by Julian Limon <ju...@tukipa.com>.
Thanks, Daniel! I hadn't realized that, but it makes perfect sense now. I'll
take a look at the code to account for those cases.

Julian

2011/4/17 Daniel McEnnis <dm...@gmail.com>

> Julian,
>
> You're using a dictionary that has only the values seen in the
> training set.  Once you execute with a different document, you may
> have entries that are present in the new set but not in the old.
> Unless you deal with this case specifically, they will generate
> IndexOutOfBounds or NullPointer errors depending on how you implement
> the dictionary.
>
> Daniel
>
> On Sun, Apr 17, 2011 at 3:09 AM, Julian Limon <ju...@tukipa.com>
> wrote:
> > Hello all,
> >
> > Sorry to bother again, but I've been hitting my head against the wall for
> > the last day and I don't seem to find the answer.
> >
> > I'm trying to create a new tfidf vector (or probably many vectors) out of
> a
> > new directory using something like seq2sparse. However, I want to create
> > these vectors based on the dictionary and idf values of a previously
> > executed directory. Let's say that I created my vectors using the whole
> > corpus and now I want to calculate new tfidf vectors for a few documents
> (or
> > more exactly, a few queries) that share the properties of the previous
> > corpus.
> >
> > I know that seq2sparse stores a dictionary and tf values in temporary
> > folders. My first attempt was to modify DictionaryVectorizer and
> > TFIDFConverter to have them use a dictionary and a df-count from a
> different
> > directory. So far it seems that I had some luck with both, but now I'm
> > getting "index out of bound" exception. My guess is that some other class
> or
> > job determines the size of some array based on the document source.
> >
> > Do you guys have any ideas about what might be wrong? Or even better, do
> you
> > guys know of a better way to generate a vector (i.e., a query vector)
> using
> > previous matrix values (i.e., the index)?
> >
> > Thanks a lot,
> >
> > Julian
> >
> > P.S. The error I'm getting looks like this:
> >
> > Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> > WARNING: job_local_0002
> > org.apache.mahout.math.IndexException: Index 517 is outside allowable
> range
> > of [0,0)
> > at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:392)
> > at
> >
> org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69)
> > at
> >
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:95)
> > at
> >
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:50)
> > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> > at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
> > at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> > Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.JobClient
> > monitorAndPrintJob
> > INFO:  map 100% reduce 0%
> >
>

Re: Create vector using existing dictionary and IDF values

Posted by Daniel McEnnis <dm...@gmail.com>.
Julian,

You're using a dictionary that has only the values seen in the
training set.  Once you execute with a different document, you may
have entries that are present in the new set but not in the old.
Unless you deal with this case specifically, they will generate
IndexOutOfBounds or NullPointer errors depending on how you implement
the dictionary.

Daniel

On Sun, Apr 17, 2011 at 3:09 AM, Julian Limon <ju...@tukipa.com> wrote:
> Hello all,
>
> Sorry to bother again, but I've been hitting my head against the wall for
> the last day and I don't seem to find the answer.
>
> I'm trying to create a new tfidf vector (or probably many vectors) out of a
> new directory using something like seq2sparse. However, I want to create
> these vectors based on the dictionary and idf values of a previously
> executed directory. Let's say that I created my vectors using the whole
> corpus and now I want to calculate new tfidf vectors for a few documents (or
> more exactly, a few queries) that share the properties of the previous
> corpus.
>
> I know that seq2sparse stores a dictionary and tf values in temporary
> folders. My first attempt was to modify DictionaryVectorizer and
> TFIDFConverter to have them use a dictionary and a df-count from a different
> directory. So far it seems that I had some luck with both, but now I'm
> getting "index out of bound" exception. My guess is that some other class or
> job determines the size of some array based on the document source.
>
> Do you guys have any ideas about what might be wrong? Or even better, do you
> guys know of a better way to generate a vector (i.e., a query vector) using
> previous matrix values (i.e., the index)?
>
> Thanks a lot,
>
> Julian
>
> P.S. The error I'm getting looks like this:
>
> Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local_0002
> org.apache.mahout.math.IndexException: Index 517 is outside allowable range
> of [0,0)
> at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:392)
> at
> org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69)
> at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:95)
> at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:50)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO:  map 100% reduce 0%
>