You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Kris Jack <mr...@gmail.com> on 2010/07/01 16:23:54 UTC

Re: Reading Vectors Created from a Lucene Index

Hi Grant,

I applied the patch but still no luck.  In debugging, I found that in
LuceneIterable, line 129:

<<
  result = result.normalize(normPower);
>>

seems to make result, which was before a NamedVector, back into a Vector and
causes the name to be lost.  If I change the code to allow the name to be
kept by replacing the line with:

<<
  result = new NamedVector(result.normalize(normPower), name);
>>

then the name is included and the result remains a NamedVector but the
VectorDumper code still just prints out Vectors and not NamedVectors.
Perhaps I am going back this wrong but shouldn't there be a check in the
VectorDumper to find out the type of vector being dumped?

Thanks,
Kris



2010/6/30 Grant Ingersoll <gs...@apache.org>

> Kris,
>
> Can you try the patch at
> https://issues.apache.org/jira/secure/attachment/12448396/MAHOUT-379-lucene.patch
>
> Thanks,
> Grant
>
> On Jun 30, 2010, at 8:53 AM, Grant Ingersoll wrote:
>
> >
> > On Jun 30, 2010, at 8:39 AM, Grant Ingersoll wrote:
> >
> >>
> >> On Jun 29, 2010, at 1:54 PM, Kris Jack wrote:
> >>
> >>> Hi everyone,
> >>>
> >>> I have been using mahout to generate vectors from a lucene index using:
> >>>
> >>> $MAHOUT_HOME/bin/mahout lucene.vector
> >>>
> >>> In doing so, mahout creates an output file that has new ids for my
> >>> documents, that are completely unlike my original --idField, that is a
> >>> string.  How can I relate the new ids to my original ids?  Is there is
> a
> >>> method that allows me to output the vectors with the original --idField
> >>> values that appear in the lucene index rather than the new doc ids?
> >>
> >>
> >> Hmm, it seems the --idField stuff has been commented out, likely with
> the change of labels.
> >>
> >
> > I've brought the issue up over on dev@, as it is a bug.
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Re: Reading Vectors Created from a Lucene Index

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 2, 2010, at 5:42 AM, Kris Jack wrote:

> Hi Drew,
> 
> That indeed causes the name to be emitted now.  With the change that I
> suggested and your patch,

https://issues.apache.org/jira/browse/MAHOUT-434 handles the normalization issue.

> I'm now getting the names of vectors, as provided
> by the -idField, being output with the vectors themselves.
> 




> Thanks again,
> Kris
> 
> 
> 
> 2010/7/2 Drew Farris <dr...@gmail.com>
> 
>> Hi Kris,
>> 
>> Could you try the code in the patch at:
>> https://issues.apache.org/jira/secure/attachment/12448536/MAHOUT-402.patch
>> 
>> This should cause VectorDumper to emit the names found in NamedVectors.
>> 
>> Thanks,
>> Drew
>> 
>> On Thu, Jul 1, 2010 at 10:23 AM, Kris Jack <mr...@gmail.com> wrote:
>> 
>>> Hi Grant,
>>> 
>>> I applied the patch but still no luck.  In debugging, I found that in
>>> LuceneIterable, line 129:
>>> 
>>> <<
>>> result = result.normalize(normPower);
>>>>> 
>>> 
>>> seems to make result, which was before a NamedVector, back into a Vector
>>> and
>>> causes the name to be lost.  If I change the code to allow the name to be
>>> kept by replacing the line with:
>>> 
>>> <<
>>> result = new NamedVector(result.normalize(normPower), name);
>>>>> 
>>> 
>>> then the name is included and the result remains a NamedVector but the
>>> VectorDumper code still just prints out Vectors and not NamedVectors.
>>> Perhaps I am going back this wrong but shouldn't there be a check in the
>>> VectorDumper to find out the type of vector being dumped?
>>> 
>>> Thanks,
>>> Kris
>>> 
>>> 
>>> 
>>> 2010/6/30 Grant Ingersoll <gs...@apache.org>
>>> 
>>>> Kris,
>>>> 
>>>> Can you try the patch at
>>>> 
>>> 
>> https://issues.apache.org/jira/secure/attachment/12448396/MAHOUT-379-lucene.patch
>>>> 
>>>> Thanks,
>>>> Grant
>>>> 
>>>> On Jun 30, 2010, at 8:53 AM, Grant Ingersoll wrote:
>>>> 
>>>>> 
>>>>> On Jun 30, 2010, at 8:39 AM, Grant Ingersoll wrote:
>>>>> 
>>>>>> 
>>>>>> On Jun 29, 2010, at 1:54 PM, Kris Jack wrote:
>>>>>> 
>>>>>>> Hi everyone,
>>>>>>> 
>>>>>>> I have been using mahout to generate vectors from a lucene index
>>> using:
>>>>>>> 
>>>>>>> $MAHOUT_HOME/bin/mahout lucene.vector
>>>>>>> 
>>>>>>> In doing so, mahout creates an output file that has new ids for my
>>>>>>> documents, that are completely unlike my original --idField, that
>> is
>>> a
>>>>>>> string.  How can I relate the new ids to my original ids?  Is there
>>> is
>>>> a
>>>>>>> method that allows me to output the vectors with the original
>>> --idField
>>>>>>> values that appear in the lucene index rather than the new doc ids?
>>>>>> 
>>>>>> 
>>>>>> Hmm, it seems the --idField stuff has been commented out, likely
>> with
>>>> the change of labels.
>>>>>> 
>>>>> 
>>>>> I've brought the issue up over on dev@, as it is a bug.
>>>> 
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>> 
>>>> Search the Lucene ecosystem using Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Dr Kris Jack,
>>> http://www.mendeley.com/profiles/kris-jack/
>>> 
>> 
> 
> 
> 
> -- 
> Dr Kris Jack,
> http://www.mendeley.com/profiles/kris-jack/


Re: Reading Vectors Created from a Lucene Index

Posted by Kris Jack <mr...@gmail.com>.
Hi Drew,

That indeed causes the name to be emitted now.  With the change that I
suggested and your patch, I'm now getting the names of vectors, as provided
by the -idField, being output with the vectors themselves.

Thanks again,
Kris



2010/7/2 Drew Farris <dr...@gmail.com>

> Hi Kris,
>
> Could you try the code in the patch at:
> https://issues.apache.org/jira/secure/attachment/12448536/MAHOUT-402.patch
>
> This should cause VectorDumper to emit the names found in NamedVectors.
>
> Thanks,
> Drew
>
> On Thu, Jul 1, 2010 at 10:23 AM, Kris Jack <mr...@gmail.com> wrote:
>
> > Hi Grant,
> >
> > I applied the patch but still no luck.  In debugging, I found that in
> > LuceneIterable, line 129:
> >
> > <<
> >  result = result.normalize(normPower);
> > >>
> >
> > seems to make result, which was before a NamedVector, back into a Vector
> > and
> > causes the name to be lost.  If I change the code to allow the name to be
> > kept by replacing the line with:
> >
> > <<
> >  result = new NamedVector(result.normalize(normPower), name);
> > >>
> >
> > then the name is included and the result remains a NamedVector but the
> > VectorDumper code still just prints out Vectors and not NamedVectors.
> > Perhaps I am going back this wrong but shouldn't there be a check in the
> > VectorDumper to find out the type of vector being dumped?
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/30 Grant Ingersoll <gs...@apache.org>
> >
> > > Kris,
> > >
> > > Can you try the patch at
> > >
> >
> https://issues.apache.org/jira/secure/attachment/12448396/MAHOUT-379-lucene.patch
> > >
> > > Thanks,
> > > Grant
> > >
> > > On Jun 30, 2010, at 8:53 AM, Grant Ingersoll wrote:
> > >
> > > >
> > > > On Jun 30, 2010, at 8:39 AM, Grant Ingersoll wrote:
> > > >
> > > >>
> > > >> On Jun 29, 2010, at 1:54 PM, Kris Jack wrote:
> > > >>
> > > >>> Hi everyone,
> > > >>>
> > > >>> I have been using mahout to generate vectors from a lucene index
> > using:
> > > >>>
> > > >>> $MAHOUT_HOME/bin/mahout lucene.vector
> > > >>>
> > > >>> In doing so, mahout creates an output file that has new ids for my
> > > >>> documents, that are completely unlike my original --idField, that
> is
> > a
> > > >>> string.  How can I relate the new ids to my original ids?  Is there
> > is
> > > a
> > > >>> method that allows me to output the vectors with the original
> > --idField
> > > >>> values that appear in the lucene index rather than the new doc ids?
> > > >>
> > > >>
> > > >> Hmm, it seems the --idField stuff has been commented out, likely
> with
> > > the change of labels.
> > > >>
> > > >
> > > > I've brought the issue up over on dev@, as it is a bug.
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://www.lucidimagination.com/
> > >
> > > Search the Lucene ecosystem using Solr/Lucene:
> > > http://www.lucidimagination.com/search
> > >
> > >
> >
> >
> > --
> > Dr Kris Jack,
> > http://www.mendeley.com/profiles/kris-jack/
> >
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Re: Reading Vectors Created from a Lucene Index

Posted by Drew Farris <dr...@gmail.com>.
Hi Kris,

Could you try the code in the patch at:
https://issues.apache.org/jira/secure/attachment/12448536/MAHOUT-402.patch

This should cause VectorDumper to emit the names found in NamedVectors.

Thanks,
Drew

On Thu, Jul 1, 2010 at 10:23 AM, Kris Jack <mr...@gmail.com> wrote:

> Hi Grant,
>
> I applied the patch but still no luck.  In debugging, I found that in
> LuceneIterable, line 129:
>
> <<
>  result = result.normalize(normPower);
> >>
>
> seems to make result, which was before a NamedVector, back into a Vector
> and
> causes the name to be lost.  If I change the code to allow the name to be
> kept by replacing the line with:
>
> <<
>  result = new NamedVector(result.normalize(normPower), name);
> >>
>
> then the name is included and the result remains a NamedVector but the
> VectorDumper code still just prints out Vectors and not NamedVectors.
> Perhaps I am going back this wrong but shouldn't there be a check in the
> VectorDumper to find out the type of vector being dumped?
>
> Thanks,
> Kris
>
>
>
> 2010/6/30 Grant Ingersoll <gs...@apache.org>
>
> > Kris,
> >
> > Can you try the patch at
> >
> https://issues.apache.org/jira/secure/attachment/12448396/MAHOUT-379-lucene.patch
> >
> > Thanks,
> > Grant
> >
> > On Jun 30, 2010, at 8:53 AM, Grant Ingersoll wrote:
> >
> > >
> > > On Jun 30, 2010, at 8:39 AM, Grant Ingersoll wrote:
> > >
> > >>
> > >> On Jun 29, 2010, at 1:54 PM, Kris Jack wrote:
> > >>
> > >>> Hi everyone,
> > >>>
> > >>> I have been using mahout to generate vectors from a lucene index
> using:
> > >>>
> > >>> $MAHOUT_HOME/bin/mahout lucene.vector
> > >>>
> > >>> In doing so, mahout creates an output file that has new ids for my
> > >>> documents, that are completely unlike my original --idField, that is
> a
> > >>> string.  How can I relate the new ids to my original ids?  Is there
> is
> > a
> > >>> method that allows me to output the vectors with the original
> --idField
> > >>> values that appear in the lucene index rather than the new doc ids?
> > >>
> > >>
> > >> Hmm, it seems the --idField stuff has been commented out, likely with
> > the change of labels.
> > >>
> > >
> > > I've brought the issue up over on dev@, as it is a bug.
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>
>
> --
> Dr Kris Jack,
> http://www.mendeley.com/profiles/kris-jack/
>