You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Rune Stilling <su...@rdfined.dk> on 2014/02/13 08:40:33 UTC

Adding custom weights to individual terms

Hi list

I’m trying to figure out how customizable scoring and weighting is in the Lucene API. I read about the API’s but still can’t figure out if the following is possible.

I would like to do normal document text indexing, but I would like to control the weight added to tokens my self, also I would like to control the weighting of query tokens and the how things are added together.

When indexing a word I would like attache my own weights to the word, and use these weights when querying for documents. F.ex.

Doc 1
Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) API(0.3)

Doc 2
Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)

The floats in parentheses are some I would like to add in the indexing process, not something coming from Lucene tdf/id ex.

Wen querying I would like to repeat this and also create the weights for each term “myself” and control how the final doc score is calculated.

I have read that it’s possible to attach your own custom attributes to tokens. Is this the way to go? Ie. should I add my custom weight as attributes to tokens, and then access these attributes when calculating document score in the search process (described here https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html under “adding a custom attribute”)?

The reason why I’m asking is that I can’t find any examples of this being done anywhere. But I found someone stating “With Lucene, it is impossible to increase or decrease the weight of individual terms in a document”.

With regards
Rune 

Re: Adding custom weights to individual terms

Posted by Rune Stilling <su...@rdfined.dk>.
Hi Lukai

That was a great help. Thank you.

I’m continuing reading about payloads:

http://searchhub.org/2009/08/05/getting-started-with-payloads/

Didn’t know that concept at all.

Regards,
Rune

Den 13/02/2014 kl. 23.12 skrev lukai <lu...@gmail.com>:

> Hi, Rune:
>  Per your requirement, you can generate a separated filed for the document
> before send document to lucene. Let's say the name is: score_field. The
> content of this field in this way:
> Doc 1#score_field:
>  Lucence:0.7 is:0 ...
> Doc 2#score_field:
>  Lucene:0.5 is:0 ...
> 
> Store the field with "indexed", store other fields as "stored". And store
> the weight value as payload for terms(wrap your ananlyzer to consume the
> weight value, basically you can leverage: DelimitedPayloadTokenFilter and
> WhitespaceTokenizer to form a basic analyzer which can take the input
> format). Make sure the term in each document in score_field is unique
> (according your description it's already fullfilled). You can also disable
> to index the position information for this filed, cuz you dont need it.
> 
> Then when you do query:
> 1. If you want to do score like a cosine similarity based on query and
> document, you should implement a query parser to parse weight you assigned
> in different terms in query phrase.
> 2. create a new query type and customize you score function and tell lucene
> to use your scorer.
> 
>  Here is a small snippet of a query type i had created before, basically
> you can follow this logic to manipulate your score value:
> 
>         final Terms terms = fields.terms(fieldName);
> 
>              if(terms != null ){
> 
>                final TermsEnum termsEnum = terms.iterator(null);
> 
>                BytesRef bytes = new BytesRef(wandTerm.queryTerm);
> 
>                if(termsEnum.seekExact(new BytesRef(wandTerm.queryTerm))){
> 
> 
> 
>                  float ub = termsEnum.maxFeatureValue();
> 
>                  int docFreq = termsEnum.docFreq();
> 
>              //    logger.warn("term:"+wandTerm.queryTerm +"   :" + ub);
> 
>                  DocsAndPositionsEnum docsPositionEnum =
> termsEnum.docsAndPositions(acceptDocs, null);
> 
> 
> tts.add(newWandPosting(fieldName,bytes,docsPositionEnum,ub,wandTerm.
> featureValue,(totalDocNum+1)*1.0f/docFreq ));
> 
>                }
> 
> 
> 
> On Thu, Feb 13, 2014 at 10:49 AM, Rune Stilling <su...@rdfined.dk> wrote:
> 
>> I'm not sure how I would do that, when Lucene is meant to use my custom
>> weights when calculating document weights when executing a search query.
>> 
>> Doc 1
>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
>> API(0.3)
>> 
>> Doc 2
>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>> 
>> Query
>> Lucene
>> 
>> 0.7 and 0.5 are my custom weight and should be used to return Doc 1 with
>> weight 0.7 and Doc 2 with weight 0.5 as an answer to my query.
>> 
>> /Rune
>> 
>> Den 13/02/2014 kl. 13.27 skrev Shai Erera <se...@gmail.com>:
>> 
>>> I often prefer to manage such weights outside the index. Usually managing
>>> them inside the index leads to problems in the future when e.g the
>> weights
>>> change. If they are encoded in the index, it means re-indexing. Also, if
>>> the weight changes then in some segments the weight will be different
>> than
>>> others. I think that if you manage the weights e.g. in a simple FST
>> (which
>>> is very compat), it will give you the best flexibility and it's very easy
>>> to use.
>>> 
>>> Shai
>>> 
>>> 
>>> On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless <
>>> lucene@mikemccandless.com> wrote:
>>> 
>>>> You could stuff your custom weights into a payload, and index that,
>>>> but this is per term per document per position, while it sounds like
>>>> you just want one float for each term regardless of which
>>>> documents/positions where that term occurred?
>>>> 
>>>> Doing your own custom attribute would be a challenge: not only must
>>>> you create & set this attribute during indexing, but you then must
>>>> change the indexing process (custom chain, custom codec) to get the
>>>> new attribute into the index, and then make a custom query that can
>>>> pull this attribute at search time.
>>>> 
>>>> What are these term weights?  Are you sure you can't compute these
>>>> weights at search time with a custom similarity using the stats that
>>>> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?
>>>> 
>>>> Mike McCandless
>>>> 
>>>> http://blog.mikemccandless.com
>>>> 
>>>> 
>>>> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling <su...@rdfined.dk> wrote:
>>>>> Hi list
>>>>> 
>>>>> I'm trying to figure out how customizable scoring and weighting is in
>>>> the Lucene API. I read about the API's but still can't figure out if the
>>>> following is possible.
>>>>> 
>>>>> I would like to do normal document text indexing, but I would like to
>>>> control the weight added to tokens my self, also I would like to control
>>>> the weighting of query tokens and the how things are added together.
>>>>> 
>>>>> When indexing a word I would like attache my own weights to the word,
>>>> and use these weights when querying for documents. F.ex.
>>>>> 
>>>>> Doc 1
>>>>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
>>>> API(0.3)
>>>>> 
>>>>> Doc 2
>>>>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>>>>> 
>>>>> The floats in parentheses are some I would like to add in the indexing
>>>> process, not something coming from Lucene tdf/id ex.
>>>>> 
>>>>> Wen querying I would like to repeat this and also create the weights
>> for
>>>> each term "myself" and control how the final doc score is calculated.
>>>>> 
>>>>> I have read that it's possible to attach your own custom attributes to
>>>> tokens. Is this the way to go? Ie. should I add my custom weight as
>>>> attributes to tokens, and then access these attributes when calculating
>>>> document score in the search process (described here
>>>> 
>> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.htmlunder"adding a custom attribute")?
>>>>> 
>>>>> The reason why I'm asking is that I can't find any examples of this
>>>> being done anywhere. But I found someone stating "With Lucene, it is
>>>> impossible to increase or decrease the weight of individual terms in a
>>>> document".
>>>>> 
>>>>> With regards
>>>>> Rune
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Adding custom weights to individual terms

Posted by lukai <lu...@gmail.com>.
Hi, Rune:
  Per your requirement, you can generate a separated filed for the document
before send document to lucene. Let's say the name is: score_field. The
content of this field in this way:
 Doc 1#score_field:
  Lucence:0.7 is:0 ...
Doc 2#score_field:
  Lucene:0.5 is:0 ...

 Store the field with "indexed", store other fields as "stored". And store
the weight value as payload for terms(wrap your ananlyzer to consume the
weight value, basically you can leverage: DelimitedPayloadTokenFilter and
WhitespaceTokenizer to form a basic analyzer which can take the input
format). Make sure the term in each document in score_field is unique
(according your description it's already fullfilled). You can also disable
to index the position information for this filed, cuz you dont need it.

Then when you do query:
1. If you want to do score like a cosine similarity based on query and
document, you should implement a query parser to parse weight you assigned
in different terms in query phrase.
2. create a new query type and customize you score function and tell lucene
to use your scorer.

  Here is a small snippet of a query type i had created before, basically
you can follow this logic to manipulate your score value:

         final Terms terms = fields.terms(fieldName);

              if(terms != null ){

                final TermsEnum termsEnum = terms.iterator(null);

                BytesRef bytes = new BytesRef(wandTerm.queryTerm);

                if(termsEnum.seekExact(new BytesRef(wandTerm.queryTerm))){



                  float ub = termsEnum.maxFeatureValue();

                  int docFreq = termsEnum.docFreq();

              //    logger.warn("term:"+wandTerm.queryTerm +"   :" + ub);

                  DocsAndPositionsEnum docsPositionEnum =
termsEnum.docsAndPositions(acceptDocs, null);


tts.add(newWandPosting(fieldName,bytes,docsPositionEnum,ub,wandTerm.
featureValue,(totalDocNum+1)*1.0f/docFreq ));

                }



On Thu, Feb 13, 2014 at 10:49 AM, Rune Stilling <su...@rdfined.dk> wrote:

> I'm not sure how I would do that, when Lucene is meant to use my custom
> weights when calculating document weights when executing a search query.
>
> Doc 1
> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
> API(0.3)
>
> Doc 2
> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>
> Query
> Lucene
>
> 0.7 and 0.5 are my custom weight and should be used to return Doc 1 with
> weight 0.7 and Doc 2 with weight 0.5 as an answer to my query.
>
> /Rune
>
> Den 13/02/2014 kl. 13.27 skrev Shai Erera <se...@gmail.com>:
>
> > I often prefer to manage such weights outside the index. Usually managing
> > them inside the index leads to problems in the future when e.g the
> weights
> > change. If they are encoded in the index, it means re-indexing. Also, if
> > the weight changes then in some segments the weight will be different
> than
> > others. I think that if you manage the weights e.g. in a simple FST
> (which
> > is very compat), it will give you the best flexibility and it's very easy
> > to use.
> >
> > Shai
> >
> >
> > On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> You could stuff your custom weights into a payload, and index that,
> >> but this is per term per document per position, while it sounds like
> >> you just want one float for each term regardless of which
> >> documents/positions where that term occurred?
> >>
> >> Doing your own custom attribute would be a challenge: not only must
> >> you create & set this attribute during indexing, but you then must
> >> change the indexing process (custom chain, custom codec) to get the
> >> new attribute into the index, and then make a custom query that can
> >> pull this attribute at search time.
> >>
> >> What are these term weights?  Are you sure you can't compute these
> >> weights at search time with a custom similarity using the stats that
> >> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling <su...@rdfined.dk> wrote:
> >>> Hi list
> >>>
> >>> I'm trying to figure out how customizable scoring and weighting is in
> >> the Lucene API. I read about the API's but still can't figure out if the
> >> following is possible.
> >>>
> >>> I would like to do normal document text indexing, but I would like to
> >> control the weight added to tokens my self, also I would like to control
> >> the weighting of query tokens and the how things are added together.
> >>>
> >>> When indexing a word I would like attache my own weights to the word,
> >> and use these weights when querying for documents. F.ex.
> >>>
> >>> Doc 1
> >>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
> >> API(0.3)
> >>>
> >>> Doc 2
> >>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
> >>>
> >>> The floats in parentheses are some I would like to add in the indexing
> >> process, not something coming from Lucene tdf/id ex.
> >>>
> >>> Wen querying I would like to repeat this and also create the weights
> for
> >> each term "myself" and control how the final doc score is calculated.
> >>>
> >>> I have read that it's possible to attach your own custom attributes to
> >> tokens. Is this the way to go? Ie. should I add my custom weight as
> >> attributes to tokens, and then access these attributes when calculating
> >> document score in the search process (described here
> >>
> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.htmlunder"adding a custom attribute")?
> >>>
> >>> The reason why I'm asking is that I can't find any examples of this
> >> being done anywhere. But I found someone stating "With Lucene, it is
> >> impossible to increase or decrease the weight of individual terms in a
> >> document".
> >>>
> >>> With regards
> >>> Rune
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Adding custom weights to individual terms

Posted by Rune Stilling <su...@rdfined.dk>.
I’m not sure how I would do that, when Lucene is meant to use my custom weights when calculating document weights when executing a search query.

Doc 1
Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) API(0.3)

Doc 2
Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)

Query
Lucene

0.7 and 0.5 are my custom weight and should be used to return Doc 1 with weight 0.7 and Doc 2 with weight 0.5 as an answer to my query.

/Rune

Den 13/02/2014 kl. 13.27 skrev Shai Erera <se...@gmail.com>:

> I often prefer to manage such weights outside the index. Usually managing
> them inside the index leads to problems in the future when e.g the weights
> change. If they are encoded in the index, it means re-indexing. Also, if
> the weight changes then in some segments the weight will be different than
> others. I think that if you manage the weights e.g. in a simple FST (which
> is very compat), it will give you the best flexibility and it's very easy
> to use.
> 
> Shai
> 
> 
> On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
> 
>> You could stuff your custom weights into a payload, and index that,
>> but this is per term per document per position, while it sounds like
>> you just want one float for each term regardless of which
>> documents/positions where that term occurred?
>> 
>> Doing your own custom attribute would be a challenge: not only must
>> you create & set this attribute during indexing, but you then must
>> change the indexing process (custom chain, custom codec) to get the
>> new attribute into the index, and then make a custom query that can
>> pull this attribute at search time.
>> 
>> What are these term weights?  Are you sure you can't compute these
>> weights at search time with a custom similarity using the stats that
>> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> 
>> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling <su...@rdfined.dk> wrote:
>>> Hi list
>>> 
>>> I'm trying to figure out how customizable scoring and weighting is in
>> the Lucene API. I read about the API's but still can't figure out if the
>> following is possible.
>>> 
>>> I would like to do normal document text indexing, but I would like to
>> control the weight added to tokens my self, also I would like to control
>> the weighting of query tokens and the how things are added together.
>>> 
>>> When indexing a word I would like attache my own weights to the word,
>> and use these weights when querying for documents. F.ex.
>>> 
>>> Doc 1
>>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
>> API(0.3)
>>> 
>>> Doc 2
>>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>>> 
>>> The floats in parentheses are some I would like to add in the indexing
>> process, not something coming from Lucene tdf/id ex.
>>> 
>>> Wen querying I would like to repeat this and also create the weights for
>> each term "myself" and control how the final doc score is calculated.
>>> 
>>> I have read that it's possible to attach your own custom attributes to
>> tokens. Is this the way to go? Ie. should I add my custom weight as
>> attributes to tokens, and then access these attributes when calculating
>> document score in the search process (described here
>> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.htmlunder "adding a custom attribute")?
>>> 
>>> The reason why I'm asking is that I can't find any examples of this
>> being done anywhere. But I found someone stating "With Lucene, it is
>> impossible to increase or decrease the weight of individual terms in a
>> document".
>>> 
>>> With regards
>>> Rune
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Adding custom weights to individual terms

Posted by Shai Erera <se...@gmail.com>.
I often prefer to manage such weights outside the index. Usually managing
them inside the index leads to problems in the future when e.g the weights
change. If they are encoded in the index, it means re-indexing. Also, if
the weight changes then in some segments the weight will be different than
others. I think that if you manage the weights e.g. in a simple FST (which
is very compat), it will give you the best flexibility and it's very easy
to use.

Shai


On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> You could stuff your custom weights into a payload, and index that,
> but this is per term per document per position, while it sounds like
> you just want one float for each term regardless of which
> documents/positions where that term occurred?
>
> Doing your own custom attribute would be a challenge: not only must
> you create & set this attribute during indexing, but you then must
> change the indexing process (custom chain, custom codec) to get the
> new attribute into the index, and then make a custom query that can
> pull this attribute at search time.
>
> What are these term weights?  Are you sure you can't compute these
> weights at search time with a custom similarity using the stats that
> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling <su...@rdfined.dk> wrote:
> > Hi list
> >
> > I'm trying to figure out how customizable scoring and weighting is in
> the Lucene API. I read about the API's but still can't figure out if the
> following is possible.
> >
> > I would like to do normal document text indexing, but I would like to
> control the weight added to tokens my self, also I would like to control
> the weighting of query tokens and the how things are added together.
> >
> > When indexing a word I would like attache my own weights to the word,
> and use these weights when querying for documents. F.ex.
> >
> > Doc 1
> > Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
> API(0.3)
> >
> > Doc 2
> > Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
> >
> > The floats in parentheses are some I would like to add in the indexing
> process, not something coming from Lucene tdf/id ex.
> >
> > Wen querying I would like to repeat this and also create the weights for
> each term "myself" and control how the final doc score is calculated.
> >
> > I have read that it's possible to attach your own custom attributes to
> tokens. Is this the way to go? Ie. should I add my custom weight as
> attributes to tokens, and then access these attributes when calculating
> document score in the search process (described here
> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.htmlunder "adding a custom attribute")?
> >
> > The reason why I'm asking is that I can't find any examples of this
> being done anywhere. But I found someone stating "With Lucene, it is
> impossible to increase or decrease the weight of individual terms in a
> document".
> >
> > With regards
> > Rune
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Adding custom weights to individual terms

Posted by Rune Stilling <ru...@rdfined.dk>.
Den 13/02/2014 kl. 12.36 skrev Michael McCandless <lu...@mikemccandless.com>:

> You could stuff your custom weights into a payload, and index that,
> but this is per term per document per position, while it sounds like
> you just want one float for each term regardless of which
> documents/positions where that term occurred?

No I want to store a weight per term per document. The point is that my custom term weight is semantically dependent on the document context exactly the same way the other standard term weights are.

It doesn’t make sense to also have a separate weight per position.

> Doing your own custom attribute would be a challenge: not only must
> you create & set this attribute during indexing, but you then must
> change the indexing process (custom chain, custom codec) to get the
> new attribute into the index, and then make a custom query that can
> pull this attribute at search time.

Hmmm well - But will it solve my problem then?

> What are these term weights?  Are you sure you can't compute these
> weights at search time with a custom similarity using the stats that
> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?

Yes I’m sure. I’m doing a semantic analysis of the documents before they are indexed, and it’s the result of this I want to store as a custom weight on a term per document basis. The docFreq, etc. are reflecting a quite simple approach to term weighting (i.e. - td/idf), which just isn’t precise enough in my case.

So it seems I might as well build my own term lists and code the indexing and searching process manually?

With regards,
Rune

> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling <su...@rdfined.dk> wrote:
>> Hi list
>> 
>> I'm trying to figure out how customizable scoring and weighting is in the Lucene API. I read about the API's but still can't figure out if the following is possible.
>> 
>> I would like to do normal document text indexing, but I would like to control the weight added to tokens my self, also I would like to control the weighting of query tokens and the how things are added together.
>> 
>> When indexing a word I would like attache my own weights to the word, and use these weights when querying for documents. F.ex.
>> 
>> Doc 1
>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) API(0.3)
>> 
>> Doc 2
>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>> 
>> The floats in parentheses are some I would like to add in the indexing process, not something coming from Lucene tdf/id ex.
>> 
>> Wen querying I would like to repeat this and also create the weights for each term "myself" and control how the final doc score is calculated.
>> 
>> I have read that it's possible to attach your own custom attributes to tokens. Is this the way to go? Ie. should I add my custom weight as attributes to tokens, and then access these attributes when calculating document score in the search process (described here https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html under "adding a custom attribute")?
>> 
>> The reason why I'm asking is that I can't find any examples of this being done anywhere. But I found someone stating "With Lucene, it is impossible to increase or decrease the weight of individual terms in a document".
>> 
>> With regards
>> Rune
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Adding custom weights to individual terms

Posted by Rune Stilling <su...@rdfined.dk>.
Den 13/02/2014 kl. 12.36 skrev Michael McCandless <lu...@mikemccandless.com>:

> You could stuff your custom weights into a payload, and index that,
> but this is per term per document per position, while it sounds like
> you just want one float for each term regardless of which
> documents/positions where that term occurred?

No I want to store a weight per term per document. The point is that my custom term weight is semantically dependent on the document context exactly the same way the other standard term weights are.

It doesn’t make sense to also have a separate weight per position.

> Doing your own custom attribute would be a challenge: not only must
> you create & set this attribute during indexing, but you then must
> change the indexing process (custom chain, custom codec) to get the
> new attribute into the index, and then make a custom query that can
> pull this attribute at search time.

Hmmm well - But will it solve my problem then?

> What are these term weights?  Are you sure you can't compute these
> weights at search time with a custom similarity using the stats that
> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?

Yes I’m sure. I’m doing a semantic analysis of the documents before they are indexed, and it’s the result of this I want to store as a custom weight on a term per document basis. The docFreq, etc. are reflecting a quite simple approach to term weighting (i.e. - td/idf), which just isn’t precise enough in my case.

So it seems I might as well build my own term lists and code the indexing and searching process manually?

With regards,
Rune

> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling <su...@rdfined.dk> wrote:
>> Hi list
>> 
>> I'm trying to figure out how customizable scoring and weighting is in the Lucene API. I read about the API's but still can't figure out if the following is possible.
>> 
>> I would like to do normal document text indexing, but I would like to control the weight added to tokens my self, also I would like to control the weighting of query tokens and the how things are added together.
>> 
>> When indexing a word I would like attache my own weights to the word, and use these weights when querying for documents. F.ex.
>> 
>> Doc 1
>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) API(0.3)
>> 
>> Doc 2
>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>> 
>> The floats in parentheses are some I would like to add in the indexing process, not something coming from Lucene tdf/id ex.
>> 
>> Wen querying I would like to repeat this and also create the weights for each term "myself" and control how the final doc score is calculated.
>> 
>> I have read that it's possible to attach your own custom attributes to tokens. Is this the way to go? Ie. should I add my custom weight as attributes to tokens, and then access these attributes when calculating document score in the search process (described here https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html under "adding a custom attribute")?
>> 
>> The reason why I'm asking is that I can't find any examples of this being done anywhere. But I found someone stating "With Lucene, it is impossible to increase or decrease the weight of individual terms in a document".
>> 
>> With regards
>> Rune
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Adding custom weights to individual terms

Posted by Michael McCandless <lu...@mikemccandless.com>.
You could stuff your custom weights into a payload, and index that,
but this is per term per document per position, while it sounds like
you just want one float for each term regardless of which
documents/positions where that term occurred?

Doing your own custom attribute would be a challenge: not only must
you create & set this attribute during indexing, but you then must
change the indexing process (custom chain, custom codec) to get the
new attribute into the index, and then make a custom query that can
pull this attribute at search time.

What are these term weights?  Are you sure you can't compute these
weights at search time with a custom similarity using the stats that
are already stored (docFreq, totalTermFreq, maxDoc, etc.)?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling <su...@rdfined.dk> wrote:
> Hi list
>
> I'm trying to figure out how customizable scoring and weighting is in the Lucene API. I read about the API's but still can't figure out if the following is possible.
>
> I would like to do normal document text indexing, but I would like to control the weight added to tokens my self, also I would like to control the weighting of query tokens and the how things are added together.
>
> When indexing a word I would like attache my own weights to the word, and use these weights when querying for documents. F.ex.
>
> Doc 1
> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) API(0.3)
>
> Doc 2
> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>
> The floats in parentheses are some I would like to add in the indexing process, not something coming from Lucene tdf/id ex.
>
> Wen querying I would like to repeat this and also create the weights for each term "myself" and control how the final doc score is calculated.
>
> I have read that it's possible to attach your own custom attributes to tokens. Is this the way to go? Ie. should I add my custom weight as attributes to tokens, and then access these attributes when calculating document score in the search process (described here https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html under "adding a custom attribute")?
>
> The reason why I'm asking is that I can't find any examples of this being done anywhere. But I found someone stating "With Lucene, it is impossible to increase or decrease the weight of individual terms in a document".
>
> With regards
> Rune

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org