You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Paolo Ruscitti <pr...@gmail.com> on 2012/01/15 18:26:02 UTC
confusion with lucene.vector normalization
Hi,
I'm Mahout newbie and I'm confused with lucene.vector normalization.
I'm vectorizing from lucene-solr index running:
./bin/mahout lucene.vector \
--dir ${SOLRINDEX} \
--output ${WORK_DIR}/vec \
--field fullContent \
* --norm 2*
...
In my .../solr/schema.xml file
fullContent field is set with omitNorms="false"
...
<field name="fullContent" type="text_shingle" indexed="true"
stored="true" multiValued="false" termVectors="true" *omitNorms="false"*/>
// "text_shingle" is my own text type to have bigrams
// ... <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/> ...
The issue is:
using both omitNorms="false" on Solr schema and '--norm 2' on mahout
lucene.vector command do I normalize twice?
Are they inconsistent together?
Should I use just one of them?
If yes, which of them?
Also, for normalized vector in input which distance measure class should
I use?
CosineDistanceMeasure
or
TanimotoDistanceMeasure?
If I have understood well TanimotoDistanceMeasure applies normalization
itself.
Thanks a lot.
Paolo
Re: confusion with lucene.vector normalization
Posted by Lance Norskog <go...@gmail.com>.
Hi-
Is there advice anywhere about using an externally created Solr/Lucene index?
Norms in Solr/Lucene are not relevant here. They are used in Lucene
searches. What is important is to include term vectors with
termVectors=true.
On Sun, Jan 15, 2012 at 9:26 AM, Paolo Ruscitti <pr...@gmail.com> wrote:
> Hi,
> I'm Mahout newbie and I'm confused with lucene.vector normalization.
>
> I'm vectorizing from lucene-solr index running:
>
> ./bin/mahout lucene.vector \
> --dir ${SOLRINDEX} \
> --output ${WORK_DIR}/vec \
> --field fullContent \
> * --norm 2*
> ...
>
> In my .../solr/schema.xml file
> fullContent field is set with omitNorms="false"
> ...
> <field name="fullContent" type="text_shingle" indexed="true" stored="true"
> multiValued="false" termVectors="true" *omitNorms="false"*/>
>
> // "text_shingle" is my own text type to have bigrams
> // ... <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true"/> ...
>
> The issue is:
> using both omitNorms="false" on Solr schema and '--norm 2' on mahout
> lucene.vector command do I normalize twice?
> Are they inconsistent together?
> Should I use just one of them?
> If yes, which of them?
>
> Also, for normalized vector in input which distance measure class should I
> use?
> CosineDistanceMeasure
> or
> TanimotoDistanceMeasure?
> If I have understood well TanimotoDistanceMeasure applies normalization
> itself.
>
> Thanks a lot.
>
> Paolo
--
Lance Norskog
goksron@gmail.com