You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/11/08 17:13:41 UTC

Skewed IDF in multi lingual index

Hi,

We're testing a large multi lingual index with _LANG fields for each language and using dismax to query them all. Users provide, explicit or implicit, language preferences that we use for either additive or multiplicative boosting on the language of the document. However, additive boosting is not adequate because it cannot overcome the extremely high IDF values for the same word in another language so regardless of the the preference, foreign documents are returned. Multiplicative boosting solves this problem but has the other downside as it doesn't allow us with standard qf=field^boost to prefer documents in another language above the preferred language because the multiplicative is so strong. We do use the def function (boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a product of 0 for all boost queries. But it doesn't help that much

This all comes down to IDF differences between the languages, even common words such as country names like `india` show large differences in IDF. Is here anyone with some hints or experiences to share about skewed IDF in such an index?

Thanks,
Markus

Re: Skewed IDF in multi lingual index

Posted by Tom Burton-West <tb...@umich.edu>.
Hi Markus,

No answers, but I am very interested in what you find out.  We currently
index all languages in one index, which presents different IDF issues, but
are interested in exploring alternatives such as the one you describe.

Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi,
>
> We're testing a large multi lingual index with _LANG fields for each
> language and using dismax to query them all. Users provide, explicit or
> implicit, language preferences that we use for either additive or
> multiplicative boosting on the language of the document. However, additive
> boosting is not adequate because it cannot overcome the extremely high IDF
> values for the same word in another language so regardless of the the
> preference, foreign documents are returned. Multiplicative boosting solves
> this problem but has the other downside as it doesn't allow us with
> standard qf=field^boost to prefer documents in another language above the
> preferred language because the multiplicative is so strong. We do use the
> def function (boost=def(query($qq),.3)) to prevent one boost query to
> return 0 and thus a product of 0 for all boost queries. But it doesn't help
> that much
>
> This all comes down to IDF differences between the languages, even common
> words such as country names like `india` show large differences in IDF. Is
> here anyone with some hints or experiences to share about skewed IDF in
> such an index?
>
> Thanks,
> Markus
>

Re: Skewed IDF in multi lingual index

Posted by Robert Muir <rc...@gmail.com>.
Hi again Markus. Sorry for the slow reply here.

I'm confused: are you saying the score goes negative? Are you sure there is
no 3.x segments? Can you check that docCount is not -1? Do you happen to
have a test, can you share your modified similarity, or give more details?

I just want to make sure there isn't a bug in lucene here (we verify this
statistic currently in checkindex and other places, but there is always the
possibility)

On Mon, Nov 12, 2012 at 7:39 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> I'd like to add that multiplicative boosting on very scarce properties,
> e.g. you want to boost on a boolean value of which there are only very few,
> causes a problem in scoring when using docCount instead of maxDoc. If
> docCount is one IDF will be ~0.3, with the fieldWeight you'll end up with a
> score below 0. Because of this the product of all multiplicative boosts
> will be lower than the product of boosts similar boosts, lowering the
> document in rank instead of boosting it.
>
> -----Original message-----
> > From:Markus Jelsma <ma...@openindex.io>
> > Sent: Fri 09-Nov-2012 10:23
> > To: solr-user@lucene.apache.org
> > Subject: RE: Skewed IDF in multi lingual index
> >
> > Robert, Tom,
> >
> > That's it indeed! Using maxDoc as numerator opposed to docCount yields
> very skewed results for an unevenly distributed multi-lingual index. We
> have one language dominating the other twenty so the dominating language
> contains no rare terms compared to the others.
> >
> > We're now checking results using docCount and it seems alright. I do
> have to get used to the fact that document scores are now roughly 1000
> times higher than before but i'm already very happy with
> CollectionStatistics and will see if all works well.
> >
> > Any other tips to share?
> >
> > Thanks,
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Robert Muir <rc...@gmail.com>
> > > Sent: Thu 08-Nov-2012 17:44
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Skewed IDF in multi lingual index
> > >
> > > Hi Markus: how are the languages distributed across documents?
> > >
> > > Imagine I have a text_en field and a text_fr field. Lets say I have
> > > 100 documents, 95 are english and only 5 are french.
> > > So the text_en field is populated 95% of the time, and the text_fr 5%
> > > of the time.
> > >
> > > But the default IDF computation doesnt look at things this way: it
> > > always uses '100' as maxDoc. So in such a situation, any terms against
> > > text_fr are "rare" :)
> > >
> > > The first thing i would look at, is treating this situation as merging
> > > results from a english index with 95 docs and a french index with 5
> > > docs.
> > > So I would consider overriding the two idfExplain methods (term and
> > > phrase) to use CollectionStatistics.docCount() instead of
> > > CollectionStatistics.maxDoc()
> > > The former would be 95 for the english field (instead of 100), and 5
> > > for the french field (instead of 100).
> > >
> > > I dont think this will solve all your problems: but it might help.
> > >
> > > Note: you must ensure your index is fully upgraded to 4.0 to try this
> > > statistic, otherwise it will return -1 if you have any 3.x segments in
> > > your index.
> > >
> > > On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
> > > <ma...@openindex.io> wrote:
> > > > Hi,
> > > >
> > > > We're testing a large multi lingual index with _LANG fields for each
> language and using dismax to query them all. Users provide, explicit or
> implicit, language preferences that we use for either additive or
> multiplicative boosting on the language of the document. However, additive
> boosting is not adequate because it cannot overcome the extremely high IDF
> values for the same word in another language so regardless of the the
> preference, foreign documents are returned. Multiplicative boosting solves
> this problem but has the other downside as it doesn't allow us with
> standard qf=field^boost to prefer documents in another language above the
> preferred language because the multiplicative is so strong. We do use the
> def function (boost=def(query($qq),.3)) to prevent one boost query to
> return 0 and thus a product of 0 for all boost queries. But it doesn't help
> that much
> > > >
> > > > This all comes down to IDF differences between the languages, even
> common words such as country names like `india` show large differences in
> IDF. Is here anyone with some hints or experiences to share about skewed
> IDF in such an index?
> > > >
> > > > Thanks,
> > > > Markus
> > >
> >
>

RE: Skewed IDF in multi lingual index

Posted by Markus Jelsma <ma...@openindex.io>.
I'd like to add that multiplicative boosting on very scarce properties, e.g. you want to boost on a boolean value of which there are only very few, causes a problem in scoring when using docCount instead of maxDoc. If docCount is one IDF will be ~0.3, with the fieldWeight you'll end up with a score below 0. Because of this the product of all multiplicative boosts will be lower than the product of boosts similar boosts, lowering the document in rank instead of boosting it.

-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Fri 09-Nov-2012 10:23
> To: solr-user@lucene.apache.org
> Subject: RE: Skewed IDF in multi lingual index
> 
> Robert, Tom,
> 
> That's it indeed! Using maxDoc as numerator opposed to docCount yields very skewed results for an unevenly distributed multi-lingual index. We have one language dominating the other twenty so the dominating language contains no rare terms compared to the others.
> 
> We're now checking results using docCount and it seems alright. I do have to get used to the fact that document scores are now roughly 1000 times higher than before but i'm already very happy with CollectionStatistics and will see if all works well.
> 
> Any other tips to share?
> 
> Thanks,
> Markus
> 
>  
>  
> -----Original message-----
> > From:Robert Muir <rc...@gmail.com>
> > Sent: Thu 08-Nov-2012 17:44
> > To: solr-user@lucene.apache.org
> > Subject: Re: Skewed IDF in multi lingual index
> > 
> > Hi Markus: how are the languages distributed across documents?
> > 
> > Imagine I have a text_en field and a text_fr field. Lets say I have
> > 100 documents, 95 are english and only 5 are french.
> > So the text_en field is populated 95% of the time, and the text_fr 5%
> > of the time.
> > 
> > But the default IDF computation doesnt look at things this way: it
> > always uses '100' as maxDoc. So in such a situation, any terms against
> > text_fr are "rare" :)
> > 
> > The first thing i would look at, is treating this situation as merging
> > results from a english index with 95 docs and a french index with 5
> > docs.
> > So I would consider overriding the two idfExplain methods (term and
> > phrase) to use CollectionStatistics.docCount() instead of
> > CollectionStatistics.maxDoc()
> > The former would be 95 for the english field (instead of 100), and 5
> > for the french field (instead of 100).
> > 
> > I dont think this will solve all your problems: but it might help.
> > 
> > Note: you must ensure your index is fully upgraded to 4.0 to try this
> > statistic, otherwise it will return -1 if you have any 3.x segments in
> > your index.
> > 
> > On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
> > <ma...@openindex.io> wrote:
> > > Hi,
> > >
> > > We're testing a large multi lingual index with _LANG fields for each language and using dismax to query them all. Users provide, explicit or implicit, language preferences that we use for either additive or multiplicative boosting on the language of the document. However, additive boosting is not adequate because it cannot overcome the extremely high IDF values for the same word in another language so regardless of the the preference, foreign documents are returned. Multiplicative boosting solves this problem but has the other downside as it doesn't allow us with standard qf=field^boost to prefer documents in another language above the preferred language because the multiplicative is so strong. We do use the def function (boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a product of 0 for all boost queries. But it doesn't help that much
> > >
> > > This all comes down to IDF differences between the languages, even common words such as country names like `india` show large differences in IDF. Is here anyone with some hints or experiences to share about skewed IDF in such an index?
> > >
> > > Thanks,
> > > Markus
> > 
> 

RE: Skewed IDF in multi lingual index

Posted by Markus Jelsma <ma...@openindex.io>.
Robert, Tom,

That's it indeed! Using maxDoc as numerator opposed to docCount yields very skewed results for an unevenly distributed multi-lingual index. We have one language dominating the other twenty so the dominating language contains no rare terms compared to the others.

We're now checking results using docCount and it seems alright. I do have to get used to the fact that document scores are now roughly 1000 times higher than before but i'm already very happy with CollectionStatistics and will see if all works well.

Any other tips to share?

Thanks,
Markus

 
 
-----Original message-----
> From:Robert Muir <rc...@gmail.com>
> Sent: Thu 08-Nov-2012 17:44
> To: solr-user@lucene.apache.org
> Subject: Re: Skewed IDF in multi lingual index
> 
> Hi Markus: how are the languages distributed across documents?
> 
> Imagine I have a text_en field and a text_fr field. Lets say I have
> 100 documents, 95 are english and only 5 are french.
> So the text_en field is populated 95% of the time, and the text_fr 5%
> of the time.
> 
> But the default IDF computation doesnt look at things this way: it
> always uses '100' as maxDoc. So in such a situation, any terms against
> text_fr are "rare" :)
> 
> The first thing i would look at, is treating this situation as merging
> results from a english index with 95 docs and a french index with 5
> docs.
> So I would consider overriding the two idfExplain methods (term and
> phrase) to use CollectionStatistics.docCount() instead of
> CollectionStatistics.maxDoc()
> The former would be 95 for the english field (instead of 100), and 5
> for the french field (instead of 100).
> 
> I dont think this will solve all your problems: but it might help.
> 
> Note: you must ensure your index is fully upgraded to 4.0 to try this
> statistic, otherwise it will return -1 if you have any 3.x segments in
> your index.
> 
> On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
> <ma...@openindex.io> wrote:
> > Hi,
> >
> > We're testing a large multi lingual index with _LANG fields for each language and using dismax to query them all. Users provide, explicit or implicit, language preferences that we use for either additive or multiplicative boosting on the language of the document. However, additive boosting is not adequate because it cannot overcome the extremely high IDF values for the same word in another language so regardless of the the preference, foreign documents are returned. Multiplicative boosting solves this problem but has the other downside as it doesn't allow us with standard qf=field^boost to prefer documents in another language above the preferred language because the multiplicative is so strong. We do use the def function (boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a product of 0 for all boost queries. But it doesn't help that much
> >
> > This all comes down to IDF differences between the languages, even common words such as country names like `india` show large differences in IDF. Is here anyone with some hints or experiences to share about skewed IDF in such an index?
> >
> > Thanks,
> > Markus
> 

Re: Skewed IDF in multi lingual index

Posted by Robert Muir <rc...@gmail.com>.
Hi Markus: how are the languages distributed across documents?

Imagine I have a text_en field and a text_fr field. Lets say I have
100 documents, 95 are english and only 5 are french.
So the text_en field is populated 95% of the time, and the text_fr 5%
of the time.

But the default IDF computation doesnt look at things this way: it
always uses '100' as maxDoc. So in such a situation, any terms against
text_fr are "rare" :)

The first thing i would look at, is treating this situation as merging
results from a english index with 95 docs and a french index with 5
docs.
So I would consider overriding the two idfExplain methods (term and
phrase) to use CollectionStatistics.docCount() instead of
CollectionStatistics.maxDoc()
The former would be 95 for the english field (instead of 100), and 5
for the french field (instead of 100).

I dont think this will solve all your problems: but it might help.

Note: you must ensure your index is fully upgraded to 4.0 to try this
statistic, otherwise it will return -1 if you have any 3.x segments in
your index.

On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> Hi,
>
> We're testing a large multi lingual index with _LANG fields for each language and using dismax to query them all. Users provide, explicit or implicit, language preferences that we use for either additive or multiplicative boosting on the language of the document. However, additive boosting is not adequate because it cannot overcome the extremely high IDF values for the same word in another language so regardless of the the preference, foreign documents are returned. Multiplicative boosting solves this problem but has the other downside as it doesn't allow us with standard qf=field^boost to prefer documents in another language above the preferred language because the multiplicative is so strong. We do use the def function (boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a product of 0 for all boost queries. But it doesn't help that much
>
> This all comes down to IDF differences between the languages, even common words such as country names like `india` show large differences in IDF. Is here anyone with some hints or experiences to share about skewed IDF in such an index?
>
> Thanks,
> Markus