You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Artem Chereisky <a....@gmail.com> on 2009/12/10 01:53:10 UTC

idf on per-field basis

Hi,

I came across a situation when my scores are adversely affected by the IDF
component. Let me explain.

My index documents contain a number of fields, for some, TF and IDF are
important and need to be taken into account, for others niether TF nor IDF
should apply. I dealt with TF by omiting norms during indexing but I can't
find a way to calculate IDF for certain fields only.

The formula for IDF is defined in Similarity. I have my own implementation
of Similarity where I can set it to 1 or use the default implementation.
mySearcher.SetSimilarity is where I tell Lucene which similarity instance to
use, but that's global, so it applies to all fields in the index.

So, here's my question. Is there a way to calculate IDF on per-field basis?

Regards,
Art

RE: idf on per-field basis

Posted by Michael Garski <mg...@myspace-inc.com>.
Artem,

That's similar to how we manage our custom builds here - we grab a
tagged version and drop the whole thing into TFS.  With 2.9.1 we'll be
doing something similar along with providing patches to the tag with our
customizations should anyone else have use of them.

Michael

-----Original Message-----
From: Artem Chereisky [mailto:a.chereisky@gmail.com] 
Sent: Thursday, December 10, 2009 2:04 PM
To: lucene-net-user@incubator.apache.org
Subject: Re: idf on per-field basis

Michael,

QueryFilter is certainly the way to go for fields that don't require
scoring. Thanks for that.

Everyone,

Regarding making modifications to Lucene core and/or extending Lucene's
classes, what's the best practice for managing the changes?
I keep a Lucene repository under TortoiseSVN pointing to
https://svn.apache.org/repos/asf/incubator/lucene.net/tags/Lucene.Net_2_
4_0.
Then every time I make a core change or extend a Lucene class, I copy
the
files involved into a separate folder structure which is part of another
SVN
repository. That way I can source control my changes. The process is a
bit
cumbersome. Is there a better way?

Regards,
Artem



On Fri, Dec 11, 2009 at 5:09 AM, Michael Garski
<mg...@myspace-inc.com>wrote:

> Artem,
>
> I've made modifications to the internals of Lucene.Net to achieve
> modifications to scoring, specifically in being able to manually
specify the
> length norm for a field, which allowed me to retain positional
information
> while injecting multi-term synonyms, so I wouldn't worry too much
about
> making a special build for yourself with a few changes.
>
> Would using a QueryFilter in conjunction with a query work?  The
> QueryFilter would be used on fields that scoring information was not
> necessary while the other fields would be queried with the specific
query
> you need.
>
> Michael
>
> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Thursday, December 10, 2009 1:40 AM
> To: lucene-net-user@incubator.apache.org
> Cc: <lu...@incubator.apache.org>
> Subject: Re: idf on per-field basis
>
> Michael, thank you.
>
> Query filter only solves half of my problem. Unfortunately I do need
> to have a proper score for some fields.
>
> I ended up extending Term class (I removed sealed attribute which is a
> bad thing). The new myTerm class has one boolean member, omitIdf.
> Then, when I compile my queries, I use myTerm with omitIdf set to
> true, for some fields. Then I extended Similarity cladd and I cast
> Term passes into Idf method to myTerm and only calculate Idf if
> omitIdf is true. Seems to work.
>
> I don't like the solution but that's the best I could do today.
>
> Any thoughts?
>
> Regards,
> Artem
>
>
> On 10/12/2009, at 15:51, Michael Garski <mg...@myspace-inc.com>
wrote:
>
> > Artem,
> >
> > Do you need any scoring information at all on that field?  How about
> > using a QueryFilter for those fields?
> >
> > Michael
> >
> >
> > -----Original Message-----
> > From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> > Sent: Wed 12/9/2009 4:53 PM
> > To: lucene-net-user@incubator.apache.org;
> lucene-net-developer@incubator.apache.org
> > Subject: idf on per-field basis
> >
> > Hi,
> >
> > I came across a situation when my scores are adversely affected by
> > the IDF
> > component. Let me explain.
> >
> > My index documents contain a number of fields, for some, TF and IDF
> > are
> > important and need to be taken into account, for others niether TF
> > nor IDF
> > should apply. I dealt with TF by omiting norms during indexing but I
> > can't
> > find a way to calculate IDF for certain fields only.
> >
> > The formula for IDF is defined in Similarity. I have my own
> > implementation
> > of Similarity where I can set it to 1 or use the default
> > implementation.
> > mySearcher.SetSimilarity is where I tell Lucene which similarity
> > instance to
> > use, but that's global, so it applies to all fields in the index.
> >
> > So, here's my question. Is there a way to calculate IDF on per-field
> > basis?
> >
> > Regards,
> > Art
> >
> >
>
>


Re: idf on per-field basis

Posted by Artem Chereisky <a....@gmail.com>.
Michael,

QueryFilter is certainly the way to go for fields that don't require
scoring. Thanks for that.

Everyone,

Regarding making modifications to Lucene core and/or extending Lucene's
classes, what's the best practice for managing the changes?
I keep a Lucene repository under TortoiseSVN pointing to
https://svn.apache.org/repos/asf/incubator/lucene.net/tags/Lucene.Net_2_4_0.
Then every time I make a core change or extend a Lucene class, I copy the
files involved into a separate folder structure which is part of another SVN
repository. That way I can source control my changes. The process is a bit
cumbersome. Is there a better way?

Regards,
Artem



On Fri, Dec 11, 2009 at 5:09 AM, Michael Garski <mg...@myspace-inc.com>wrote:

> Artem,
>
> I've made modifications to the internals of Lucene.Net to achieve
> modifications to scoring, specifically in being able to manually specify the
> length norm for a field, which allowed me to retain positional information
> while injecting multi-term synonyms, so I wouldn't worry too much about
> making a special build for yourself with a few changes.
>
> Would using a QueryFilter in conjunction with a query work?  The
> QueryFilter would be used on fields that scoring information was not
> necessary while the other fields would be queried with the specific query
> you need.
>
> Michael
>
> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Thursday, December 10, 2009 1:40 AM
> To: lucene-net-user@incubator.apache.org
> Cc: <lu...@incubator.apache.org>
> Subject: Re: idf on per-field basis
>
> Michael, thank you.
>
> Query filter only solves half of my problem. Unfortunately I do need
> to have a proper score for some fields.
>
> I ended up extending Term class (I removed sealed attribute which is a
> bad thing). The new myTerm class has one boolean member, omitIdf.
> Then, when I compile my queries, I use myTerm with omitIdf set to
> true, for some fields. Then I extended Similarity cladd and I cast
> Term passes into Idf method to myTerm and only calculate Idf if
> omitIdf is true. Seems to work.
>
> I don't like the solution but that's the best I could do today.
>
> Any thoughts?
>
> Regards,
> Artem
>
>
> On 10/12/2009, at 15:51, Michael Garski <mg...@myspace-inc.com> wrote:
>
> > Artem,
> >
> > Do you need any scoring information at all on that field?  How about
> > using a QueryFilter for those fields?
> >
> > Michael
> >
> >
> > -----Original Message-----
> > From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> > Sent: Wed 12/9/2009 4:53 PM
> > To: lucene-net-user@incubator.apache.org;
> lucene-net-developer@incubator.apache.org
> > Subject: idf on per-field basis
> >
> > Hi,
> >
> > I came across a situation when my scores are adversely affected by
> > the IDF
> > component. Let me explain.
> >
> > My index documents contain a number of fields, for some, TF and IDF
> > are
> > important and need to be taken into account, for others niether TF
> > nor IDF
> > should apply. I dealt with TF by omiting norms during indexing but I
> > can't
> > find a way to calculate IDF for certain fields only.
> >
> > The formula for IDF is defined in Similarity. I have my own
> > implementation
> > of Similarity where I can set it to 1 or use the default
> > implementation.
> > mySearcher.SetSimilarity is where I tell Lucene which similarity
> > instance to
> > use, but that's global, so it applies to all fields in the index.
> >
> > So, here's my question. Is there a way to calculate IDF on per-field
> > basis?
> >
> > Regards,
> > Art
> >
> >
>
>

RE: idf on per-field basis

Posted by Moray McConnachie <mm...@oxford-analytica.com>.
Michael wrote:

"so I wouldn't worry too much about making a special build for yourself
with a few changes."

We did this to fix a couple of bugs and add some functionality around
sorting a few versions back - it was absolutely fine, but depending on
how much time you have to spend on Lucene, it can be a bit of a pain for
maintainability, depending on how much is changing in that area of the
Lucene code base with subsequent releases. 

Yours,
MOray

------------------------------------- 
Moray McConnachie
Director of IT    +44 1865 261 600
Oxford Analytica  http://www.oxan.com

-----Original Message-----
From: Michael Garski [mailto:mgarski@myspace-inc.com] 
Sent: 10 December 2009 18:10
To: lucene-net-user@incubator.apache.org
Subject: RE: idf on per-field basis

Artem,

I've made modifications to the internals of Lucene.Net to achieve
modifications to scoring, specifically in being able to manually specify
the length norm for a field, which allowed me to retain positional
information while injecting multi-term synonyms, so I wouldn't worry too
much about making a special build for yourself with a few changes.

Would using a QueryFilter in conjunction with a query work?  The
QueryFilter would be used on fields that scoring information was not
necessary while the other fields would be queried with the specific
query you need.

Michael

-----Original Message-----
From: Artem Chereisky [mailto:a.chereisky@gmail.com]
Sent: Thursday, December 10, 2009 1:40 AM
To: lucene-net-user@incubator.apache.org
Cc: <lu...@incubator.apache.org>
Subject: Re: idf on per-field basis

Michael, thank you.

Query filter only solves half of my problem. Unfortunately I do need to
have a proper score for some fields.

I ended up extending Term class (I removed sealed attribute which is a
bad thing). The new myTerm class has one boolean member, omitIdf.  
Then, when I compile my queries, I use myTerm with omitIdf set to true,
for some fields. Then I extended Similarity cladd and I cast Term passes
into Idf method to myTerm and only calculate Idf if omitIdf is true.
Seems to work.

I don't like the solution but that's the best I could do today.

Any thoughts?

Regards,
Artem


On 10/12/2009, at 15:51, Michael Garski <mg...@myspace-inc.com> wrote:

> Artem,
>
> Do you need any scoring information at all on that field?  How about 
> using a QueryFilter for those fields?
>
> Michael
>
>
> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Wed 12/9/2009 4:53 PM
> To: lucene-net-user@incubator.apache.org; 
> lucene-net-developer@incubator.apache.org
> Subject: idf on per-field basis
>
> Hi,
>
> I came across a situation when my scores are adversely affected by the

> IDF component. Let me explain.
>
> My index documents contain a number of fields, for some, TF and IDF 
> are important and need to be taken into account, for others niether TF

> nor IDF should apply. I dealt with TF by omiting norms during indexing

> but I can't find a way to calculate IDF for certain fields only.
>
> The formula for IDF is defined in Similarity. I have my own 
> implementation of Similarity where I can set it to 1 or use the 
> default implementation.
> mySearcher.SetSimilarity is where I tell Lucene which similarity 
> instance to use, but that's global, so it applies to all fields in the

> index.
>
> So, here's my question. Is there a way to calculate IDF on per-field 
> basis?
>
> Regards,
> Art
>
>



RE: idf on per-field basis

Posted by Michael Garski <mg...@myspace-inc.com>.
Artem,

I've made modifications to the internals of Lucene.Net to achieve modifications to scoring, specifically in being able to manually specify the length norm for a field, which allowed me to retain positional information while injecting multi-term synonyms, so I wouldn't worry too much about making a special build for yourself with a few changes.

Would using a QueryFilter in conjunction with a query work?  The QueryFilter would be used on fields that scoring information was not necessary while the other fields would be queried with the specific query you need.

Michael

-----Original Message-----
From: Artem Chereisky [mailto:a.chereisky@gmail.com] 
Sent: Thursday, December 10, 2009 1:40 AM
To: lucene-net-user@incubator.apache.org
Cc: <lu...@incubator.apache.org>
Subject: Re: idf on per-field basis

Michael, thank you.

Query filter only solves half of my problem. Unfortunately I do need  
to have a proper score for some fields.

I ended up extending Term class (I removed sealed attribute which is a  
bad thing). The new myTerm class has one boolean member, omitIdf.  
Then, when I compile my queries, I use myTerm with omitIdf set to  
true, for some fields. Then I extended Similarity cladd and I cast  
Term passes into Idf method to myTerm and only calculate Idf if  
omitIdf is true. Seems to work.

I don't like the solution but that's the best I could do today.

Any thoughts?

Regards,
Artem


On 10/12/2009, at 15:51, Michael Garski <mg...@myspace-inc.com> wrote:

> Artem,
>
> Do you need any scoring information at all on that field?  How about  
> using a QueryFilter for those fields?
>
> Michael
>
>
> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Wed 12/9/2009 4:53 PM
> To: lucene-net-user@incubator.apache.org; lucene-net-developer@incubator.apache.org
> Subject: idf on per-field basis
>
> Hi,
>
> I came across a situation when my scores are adversely affected by  
> the IDF
> component. Let me explain.
>
> My index documents contain a number of fields, for some, TF and IDF  
> are
> important and need to be taken into account, for others niether TF  
> nor IDF
> should apply. I dealt with TF by omiting norms during indexing but I  
> can't
> find a way to calculate IDF for certain fields only.
>
> The formula for IDF is defined in Similarity. I have my own  
> implementation
> of Similarity where I can set it to 1 or use the default  
> implementation.
> mySearcher.SetSimilarity is where I tell Lucene which similarity  
> instance to
> use, but that's global, so it applies to all fields in the index.
>
> So, here's my question. Is there a way to calculate IDF on per-field  
> basis?
>
> Regards,
> Art
>
>


Re: idf on per-field basis

Posted by Artem Chereisky <a....@gmail.com>.
Michael, thank you.

Query filter only solves half of my problem. Unfortunately I do need  
to have a proper score for some fields.

I ended up extending Term class (I removed sealed attribute which is a  
bad thing). The new myTerm class has one boolean member, omitIdf.  
Then, when I compile my queries, I use myTerm with omitIdf set to  
true, for some fields. Then I extended Similarity cladd and I cast  
Term passes into Idf method to myTerm and only calculate Idf if  
omitIdf is true. Seems to work.

I don't like the solution but that's the best I could do today.

Any thoughts?

Regards,
Artem


On 10/12/2009, at 15:51, Michael Garski <mg...@myspace-inc.com> wrote:

> Artem,
>
> Do you need any scoring information at all on that field?  How about  
> using a QueryFilter for those fields?
>
> Michael
>
>
> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Wed 12/9/2009 4:53 PM
> To: lucene-net-user@incubator.apache.org; lucene-net-developer@incubator.apache.org
> Subject: idf on per-field basis
>
> Hi,
>
> I came across a situation when my scores are adversely affected by  
> the IDF
> component. Let me explain.
>
> My index documents contain a number of fields, for some, TF and IDF  
> are
> important and need to be taken into account, for others niether TF  
> nor IDF
> should apply. I dealt with TF by omiting norms during indexing but I  
> can't
> find a way to calculate IDF for certain fields only.
>
> The formula for IDF is defined in Similarity. I have my own  
> implementation
> of Similarity where I can set it to 1 or use the default  
> implementation.
> mySearcher.SetSimilarity is where I tell Lucene which similarity  
> instance to
> use, but that's global, so it applies to all fields in the index.
>
> So, here's my question. Is there a way to calculate IDF on per-field  
> basis?
>
> Regards,
> Art
>
>

RE: idf on per-field basis

Posted by Michael Garski <mg...@myspace-inc.com>.
Artem,

Do you need any scoring information at all on that field?  How about using a QueryFilter for those fields?

Michael


-----Original Message-----
From: Artem Chereisky [mailto:a.chereisky@gmail.com]
Sent: Wed 12/9/2009 4:53 PM
To: lucene-net-user@incubator.apache.org; lucene-net-developer@incubator.apache.org
Subject: idf on per-field basis
 
Hi,

I came across a situation when my scores are adversely affected by the IDF
component. Let me explain.

My index documents contain a number of fields, for some, TF and IDF are
important and need to be taken into account, for others niether TF nor IDF
should apply. I dealt with TF by omiting norms during indexing but I can't
find a way to calculate IDF for certain fields only.

The formula for IDF is defined in Similarity. I have my own implementation
of Similarity where I can set it to 1 or use the default implementation.
mySearcher.SetSimilarity is where I tell Lucene which similarity instance to
use, but that's global, so it applies to all fields in the index.

So, here's my question. Is there a way to calculate IDF on per-field basis?

Regards,
Art