You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Paolo Ruscitti <pr...@gmail.com> on 2012/01/15 18:26:02 UTC

confusion with lucene.vector normalization

Hi,
I'm Mahout newbie and I'm confused with lucene.vector normalization.

I'm vectorizing from lucene-solr index running:

./bin/mahout lucene.vector \
    --dir ${SOLRINDEX} \
    --output ${WORK_DIR}/vec \
    --field fullContent \
*   --norm 2*
    ...

In my .../solr/schema.xml file
fullContent field is set with omitNorms="false"
...
<field name="fullContent" type="text_shingle" indexed="true" 
stored="true" multiValued="false" termVectors="true" *omitNorms="false"*/>

// "text_shingle" is my own text type to have bigrams
// ... <filter class="solr.ShingleFilterFactory" maxShingleSize="2" 
outputUnigrams="true"/> ...

The issue is:
using both omitNorms="false" on Solr schema and '--norm 2' on mahout 
lucene.vector command do I normalize twice?
Are they inconsistent together?
Should I use just one of them?
If yes, which of them?

Also, for normalized vector in input which distance measure class should 
I use?
CosineDistanceMeasure
or
TanimotoDistanceMeasure?
If I have understood well TanimotoDistanceMeasure applies normalization 
itself.

Thanks a lot.

Paolo

Re: confusion with lucene.vector normalization

Posted by Lance Norskog <go...@gmail.com>.

Hi-

Is there advice anywhere about using an externally created Solr/Lucene index?

Norms in Solr/Lucene are not relevant here. They are used in Lucene
searches. What is important is to include term vectors with
termVectors=true.

On Sun, Jan 15, 2012 at 9:26 AM, Paolo Ruscitti <pr...@gmail.com> wrote:
> Hi,
> I'm Mahout newbie and I'm confused with lucene.vector normalization.
>
> I'm vectorizing from lucene-solr index running:
>
> ./bin/mahout lucene.vector \
>   --dir ${SOLRINDEX} \
>   --output ${WORK_DIR}/vec \
>   --field fullContent \
> *   --norm 2*
>   ...
>
> In my .../solr/schema.xml file
> fullContent field is set with omitNorms="false"
> ...
> <field name="fullContent" type="text_shingle" indexed="true" stored="true"
> multiValued="false" termVectors="true" *omitNorms="false"*/>
>
> // "text_shingle" is my own text type to have bigrams
> // ... <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true"/> ...
>
> The issue is:
> using both omitNorms="false" on Solr schema and '--norm 2' on mahout
> lucene.vector command do I normalize twice?
> Are they inconsistent together?
> Should I use just one of them?
> If yes, which of them?
>
> Also, for normalized vector in input which distance measure class should I
> use?
> CosineDistanceMeasure
> or
> TanimotoDistanceMeasure?
> If I have understood well TanimotoDistanceMeasure applies normalization
> itself.
>
> Thanks a lot.
>
> Paolo



-- 
Lance Norskog
goksron@gmail.com