You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Kris Jack <mr...@gmail.com> on 2010/07/01 16:23:54 UTC
Re: Reading Vectors Created from a Lucene Index
Hi Grant,
I applied the patch but still no luck. In debugging, I found that in
LuceneIterable, line 129:
<<
result = result.normalize(normPower);
>>
seems to make result, which was before a NamedVector, back into a Vector and
causes the name to be lost. If I change the code to allow the name to be
kept by replacing the line with:
<<
result = new NamedVector(result.normalize(normPower), name);
>>
then the name is included and the result remains a NamedVector but the
VectorDumper code still just prints out Vectors and not NamedVectors.
Perhaps I am going back this wrong but shouldn't there be a check in the
VectorDumper to find out the type of vector being dumped?
Thanks,
Kris
2010/6/30 Grant Ingersoll <gs...@apache.org>
> Kris,
>
> Can you try the patch at
> https://issues.apache.org/jira/secure/attachment/12448396/MAHOUT-379-lucene.patch
>
> Thanks,
> Grant
>
> On Jun 30, 2010, at 8:53 AM, Grant Ingersoll wrote:
>
> >
> > On Jun 30, 2010, at 8:39 AM, Grant Ingersoll wrote:
> >
> >>
> >> On Jun 29, 2010, at 1:54 PM, Kris Jack wrote:
> >>
> >>> Hi everyone,
> >>>
> >>> I have been using mahout to generate vectors from a lucene index using:
> >>>
> >>> $MAHOUT_HOME/bin/mahout lucene.vector
> >>>
> >>> In doing so, mahout creates an output file that has new ids for my
> >>> documents, that are completely unlike my original --idField, that is a
> >>> string. How can I relate the new ids to my original ids? Is there is
> a
> >>> method that allows me to output the vectors with the original --idField
> >>> values that appear in the lucene index rather than the new doc ids?
> >>
> >>
> >> Hmm, it seems the --idField stuff has been commented out, likely with
> the change of labels.
> >>
> >
> > I've brought the issue up over on dev@, as it is a bug.
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
--
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/
Re: Reading Vectors Created from a Lucene Index
Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 2, 2010, at 5:42 AM, Kris Jack wrote:
> Hi Drew,
>
> That indeed causes the name to be emitted now. With the change that I
> suggested and your patch,
https://issues.apache.org/jira/browse/MAHOUT-434 handles the normalization issue.
> I'm now getting the names of vectors, as provided
> by the -idField, being output with the vectors themselves.
>
> Thanks again,
> Kris
>
>
>
> 2010/7/2 Drew Farris <dr...@gmail.com>
>
>> Hi Kris,
>>
>> Could you try the code in the patch at:
>> https://issues.apache.org/jira/secure/attachment/12448536/MAHOUT-402.patch
>>
>> This should cause VectorDumper to emit the names found in NamedVectors.
>>
>> Thanks,
>> Drew
>>
>> On Thu, Jul 1, 2010 at 10:23 AM, Kris Jack <mr...@gmail.com> wrote:
>>
>>> Hi Grant,
>>>
>>> I applied the patch but still no luck. In debugging, I found that in
>>> LuceneIterable, line 129:
>>>
>>> <<
>>> result = result.normalize(normPower);
>>>>>
>>>
>>> seems to make result, which was before a NamedVector, back into a Vector
>>> and
>>> causes the name to be lost. If I change the code to allow the name to be
>>> kept by replacing the line with:
>>>
>>> <<
>>> result = new NamedVector(result.normalize(normPower), name);
>>>>>
>>>
>>> then the name is included and the result remains a NamedVector but the
>>> VectorDumper code still just prints out Vectors and not NamedVectors.
>>> Perhaps I am going back this wrong but shouldn't there be a check in the
>>> VectorDumper to find out the type of vector being dumped?
>>>
>>> Thanks,
>>> Kris
>>>
>>>
>>>
>>> 2010/6/30 Grant Ingersoll <gs...@apache.org>
>>>
>>>> Kris,
>>>>
>>>> Can you try the patch at
>>>>
>>>
>> https://issues.apache.org/jira/secure/attachment/12448396/MAHOUT-379-lucene.patch
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>> On Jun 30, 2010, at 8:53 AM, Grant Ingersoll wrote:
>>>>
>>>>>
>>>>> On Jun 30, 2010, at 8:39 AM, Grant Ingersoll wrote:
>>>>>
>>>>>>
>>>>>> On Jun 29, 2010, at 1:54 PM, Kris Jack wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I have been using mahout to generate vectors from a lucene index
>>> using:
>>>>>>>
>>>>>>> $MAHOUT_HOME/bin/mahout lucene.vector
>>>>>>>
>>>>>>> In doing so, mahout creates an output file that has new ids for my
>>>>>>> documents, that are completely unlike my original --idField, that
>> is
>>> a
>>>>>>> string. How can I relate the new ids to my original ids? Is there
>>> is
>>>> a
>>>>>>> method that allows me to output the vectors with the original
>>> --idField
>>>>>>> values that appear in the lucene index rather than the new doc ids?
>>>>>>
>>>>>>
>>>>>> Hmm, it seems the --idField stuff has been commented out, likely
>> with
>>>> the change of labels.
>>>>>>
>>>>>
>>>>> I've brought the issue up over on dev@, as it is a bug.
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem using Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>>
>>>
>>> --
>>> Dr Kris Jack,
>>> http://www.mendeley.com/profiles/kris-jack/
>>>
>>
>
>
>
> --
> Dr Kris Jack,
> http://www.mendeley.com/profiles/kris-jack/
Re: Reading Vectors Created from a Lucene Index
Posted by Kris Jack <mr...@gmail.com>.
Hi Drew,
That indeed causes the name to be emitted now. With the change that I
suggested and your patch, I'm now getting the names of vectors, as provided
by the -idField, being output with the vectors themselves.
Thanks again,
Kris
2010/7/2 Drew Farris <dr...@gmail.com>
> Hi Kris,
>
> Could you try the code in the patch at:
> https://issues.apache.org/jira/secure/attachment/12448536/MAHOUT-402.patch
>
> This should cause VectorDumper to emit the names found in NamedVectors.
>
> Thanks,
> Drew
>
> On Thu, Jul 1, 2010 at 10:23 AM, Kris Jack <mr...@gmail.com> wrote:
>
> > Hi Grant,
> >
> > I applied the patch but still no luck. In debugging, I found that in
> > LuceneIterable, line 129:
> >
> > <<
> > result = result.normalize(normPower);
> > >>
> >
> > seems to make result, which was before a NamedVector, back into a Vector
> > and
> > causes the name to be lost. If I change the code to allow the name to be
> > kept by replacing the line with:
> >
> > <<
> > result = new NamedVector(result.normalize(normPower), name);
> > >>
> >
> > then the name is included and the result remains a NamedVector but the
> > VectorDumper code still just prints out Vectors and not NamedVectors.
> > Perhaps I am going back this wrong but shouldn't there be a check in the
> > VectorDumper to find out the type of vector being dumped?
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/30 Grant Ingersoll <gs...@apache.org>
> >
> > > Kris,
> > >
> > > Can you try the patch at
> > >
> >
> https://issues.apache.org/jira/secure/attachment/12448396/MAHOUT-379-lucene.patch
> > >
> > > Thanks,
> > > Grant
> > >
> > > On Jun 30, 2010, at 8:53 AM, Grant Ingersoll wrote:
> > >
> > > >
> > > > On Jun 30, 2010, at 8:39 AM, Grant Ingersoll wrote:
> > > >
> > > >>
> > > >> On Jun 29, 2010, at 1:54 PM, Kris Jack wrote:
> > > >>
> > > >>> Hi everyone,
> > > >>>
> > > >>> I have been using mahout to generate vectors from a lucene index
> > using:
> > > >>>
> > > >>> $MAHOUT_HOME/bin/mahout lucene.vector
> > > >>>
> > > >>> In doing so, mahout creates an output file that has new ids for my
> > > >>> documents, that are completely unlike my original --idField, that
> is
> > a
> > > >>> string. How can I relate the new ids to my original ids? Is there
> > is
> > > a
> > > >>> method that allows me to output the vectors with the original
> > --idField
> > > >>> values that appear in the lucene index rather than the new doc ids?
> > > >>
> > > >>
> > > >> Hmm, it seems the --idField stuff has been commented out, likely
> with
> > > the change of labels.
> > > >>
> > > >
> > > > I've brought the issue up over on dev@, as it is a bug.
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://www.lucidimagination.com/
> > >
> > > Search the Lucene ecosystem using Solr/Lucene:
> > > http://www.lucidimagination.com/search
> > >
> > >
> >
> >
> > --
> > Dr Kris Jack,
> > http://www.mendeley.com/profiles/kris-jack/
> >
>
--
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/
Re: Reading Vectors Created from a Lucene Index
Posted by Drew Farris <dr...@gmail.com>.
Hi Kris,
Could you try the code in the patch at:
https://issues.apache.org/jira/secure/attachment/12448536/MAHOUT-402.patch
This should cause VectorDumper to emit the names found in NamedVectors.
Thanks,
Drew
On Thu, Jul 1, 2010 at 10:23 AM, Kris Jack <mr...@gmail.com> wrote:
> Hi Grant,
>
> I applied the patch but still no luck. In debugging, I found that in
> LuceneIterable, line 129:
>
> <<
> result = result.normalize(normPower);
> >>
>
> seems to make result, which was before a NamedVector, back into a Vector
> and
> causes the name to be lost. If I change the code to allow the name to be
> kept by replacing the line with:
>
> <<
> result = new NamedVector(result.normalize(normPower), name);
> >>
>
> then the name is included and the result remains a NamedVector but the
> VectorDumper code still just prints out Vectors and not NamedVectors.
> Perhaps I am going back this wrong but shouldn't there be a check in the
> VectorDumper to find out the type of vector being dumped?
>
> Thanks,
> Kris
>
>
>
> 2010/6/30 Grant Ingersoll <gs...@apache.org>
>
> > Kris,
> >
> > Can you try the patch at
> >
> https://issues.apache.org/jira/secure/attachment/12448396/MAHOUT-379-lucene.patch
> >
> > Thanks,
> > Grant
> >
> > On Jun 30, 2010, at 8:53 AM, Grant Ingersoll wrote:
> >
> > >
> > > On Jun 30, 2010, at 8:39 AM, Grant Ingersoll wrote:
> > >
> > >>
> > >> On Jun 29, 2010, at 1:54 PM, Kris Jack wrote:
> > >>
> > >>> Hi everyone,
> > >>>
> > >>> I have been using mahout to generate vectors from a lucene index
> using:
> > >>>
> > >>> $MAHOUT_HOME/bin/mahout lucene.vector
> > >>>
> > >>> In doing so, mahout creates an output file that has new ids for my
> > >>> documents, that are completely unlike my original --idField, that is
> a
> > >>> string. How can I relate the new ids to my original ids? Is there
> is
> > a
> > >>> method that allows me to output the vectors with the original
> --idField
> > >>> values that appear in the lucene index rather than the new doc ids?
> > >>
> > >>
> > >> Hmm, it seems the --idField stuff has been commented out, likely with
> > the change of labels.
> > >>
> > >
> > > I've brought the issue up over on dev@, as it is a bug.
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>
>
> --
> Dr Kris Jack,
> http://www.mendeley.com/profiles/kris-jack/
>