You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by shyama <sh...@yahoo.com> on 2012/02/02 17:57:00 UTC

PayloadNearQuery and AveragePayloadFunction

Hi List
Apologies for such a long message. I have tried to include everything, that
you might need to know to answer my question. 

I am having difficulties understanding how or what AveragePayloadFunction is
doing. Here is my example

Title:Human|9 pineal|5 luteinizing hormone receptors.
Text:The presence of luteinizing hormone receptors in human|9 pineal|5
glands from five females and three males, ranging in age from 61-89 yr, was
examined by in situ hybridization and immunocytochemistry. The results
demonstrated the presence of these receptors at the mRNA|7 and protein
levels in all the pineal|5 glands examined. Pineal|5 gland luteinizing
hormone receptors could potentially be involved in the regulation of
melatonin|7 synthesis.

3 is for class A
5 is for class B
7 is for class C
9 is for class D
These are the payloads stored in the index. But when I search, I use these
values for encoding term class, and then return 3 for selected class.

I am using WhiteSpaceTokenizer and LowerCaseFilter. In my PayloadSimilarity
class, I manipulate payload in a way so that, if I am interested in class A,
it will return payload value "x=3" only for terms in class A, I decide term
class by checking its payload value. 

Now, I query for "luteinizing hormone" using PayloadNearQuery with slop of
5. First I try with interest in class B and next with interest in class A.

*Result of Class A interest:*

Explain: 10.97332 = (MATCH) sum of:
  2.5589073 = (MATCH) weight(payloadNear([AbstractText:luteinizing,
AbstractText:hormone], 5, true) in 5362133), product of:
    0.68000716 = queryWeight(payloadNear([AbstractText:luteinizing,
AbstractText:hormone], 5, true)), product of:
      14.045828 = idf(AbstractText:  luteinizing=15481 hormone=164637)
      0.048413463 = queryNorm
    3.7630591 = (MATCH) fieldWeight(AbstractText:payloadNear([luteinizing,
hormone], 5, true) in 5362133), product of:
      2.4494898 = PayloadNearQuery, product of:
        0.8164966 = tf(phraseFreq=0.6666667)
        *3.0 = AveragePayloadFunction(...)*
      14.045828 = idf(AbstractText:  luteinizing=15481 hormone=164637)
      0.109375 = fieldNorm(field=AbstractText, doc=5362133)
  8.4144125 = (MATCH) weight(payloadNear([ArticleTitle:luteinizing,
ArticleTitle:hormone], 5, true) in 5362133), product of:
    0.7332054 = queryWeight(payloadNear([ArticleTitle:luteinizing,
ArticleTitle:hormone], 5, true)), product of:
      15.144659 = idf(ArticleTitle:  hormone=86980 luteinizing=9765)
      0.048413463 = queryNorm
    11.476201 = (MATCH) fieldWeight(ArticleTitle:payloadNear([luteinizing,
hormone], 5, true) in 5362133), product of:
      1.7320508 = PayloadNearQuery, product of:
        0.57735026 = tf(phraseFreq=0.33333334)
       * 3.0 = AveragePayloadFunction(...)*
      15.144659 = idf(ArticleTitle:  hormone=86980 luteinizing=9765)
      0.4375 = fieldNorm(field=ArticleTitle, doc=5362133)
---------------------------------------------------------------------

*Result of Class B Interest:*

Explain: 3.657773 = (MATCH) sum of:
  0.85296905 = (MATCH) weight(payloadNear([AbstractText:luteinizing,
AbstractText:hormone], 5, true) in 5362133), product of:
    0.68000716 = queryWeight(payloadNear([AbstractText:luteinizing,
AbstractText:hormone], 5, true)), product of:
      14.045828 = idf(AbstractText:  luteinizing=15481 hormone=164637)
      0.048413463 = queryNorm
    1.254353 = (MATCH) fieldWeight(AbstractText:payloadNear([luteinizing,
hormone], 5, true) in 5362133), product of:
      0.8164966 = PayloadNearQuery, product of:
        0.8164966 = tf(phraseFreq=0.6666667)
        *1.0 = AveragePayloadFunction(...)*
      14.045828 = idf(AbstractText:  luteinizing=15481 hormone=164637)
      0.109375 = fieldNorm(field=AbstractText, doc=5362133)
  2.804804 = (MATCH) weight(payloadNear([ArticleTitle:luteinizing,
ArticleTitle:hormone], 5, true) in 5362133), product of:
    0.7332054 = queryWeight(payloadNear([ArticleTitle:luteinizing,
ArticleTitle:hormone], 5, true)), product of:
      15.144659 = idf(ArticleTitle:  hormone=86980 luteinizing=9765)
      0.048413463 = queryNorm
    3.8254004 = (MATCH) fieldWeight(ArticleTitle:payloadNear([luteinizing,
hormone], 5, true) in 5362133), product of:
      0.57735026 = PayloadNearQuery, product of:
        0.57735026 = tf(phraseFreq=0.33333334)
       * 1.0 = AveragePayloadFunction(...)*
      15.144659 = idf(ArticleTitle:  hormone=86980 luteinizing=9765)
      0.4375 = fieldNorm(field=ArticleTitle, doc=5362133)

As I understand, when I am interested in class B, I should get 3 from
AveragePayloadFunction, where as I should get 1 for class A, as there is no
class A term in the text, hence everything will have payload 1. Whereas, if
I am interested in Class B, there is one term in "Title" field, hence
AveragePayloadFunction returned value will be 3.

I do not understand what is going on. May be I am not getting what
AveragePayloadFunction is doing exactly. 

My similarity class is as follows:

public class PayloadSearchSimilarity extends DefaultSimilarity {

	private static final long serialVersionUID = 1L;
	public static String semantic;
	
	@Override
    public float scorePayload(int docId,String fieldName, int start, int
end, byte[] bytes, int offset, int length) {
		//System.out.println("this is gett");
		if(bytes!=null)
		{
		float payload=PayloadHelper.decodeFloat(bytes, offset);
		//System.out.println("this is getting called, load:"+payload);
			//i am now returning same payload for all semantic type so that we can
compare the score. it was changed after we showed it to Dietrich.
			if(semantic.equals("A") && (payload==3))
			{
				//System.out.println("Doc id:"+docId+"field :"+fieldName+" Semantic:"+
semantic+" Payload:"+payload);
				return 3;
			}
			else
			{
				if(semantic.equals("B") && (payload==5))
				{
					//System.out.println("Doc id:"+docId+"field :"+fieldName+" Semantic:"+
semantic+" Payload:"+payload);
					return 3;
				}
				else
				{
					if(semantic.equals("C") && (payload==7))
					{
						System.out.println("Semantic:"+ semantic);
						return 3;
					}
					else
					{
						
						if(semantic.equals("D") && (payload==9))
						{
							System.out.println("Semantic:"+ semantic);
							return 3;
						}
						else
						{
							//System.out.println("happens when term class does not match with
semantic, Semantic:"+ semantic);
							return 1;
						}
					}
				}
			}
		
	}//payload|bytes not null end
	else
	{
		//System.out.println("payload null");
		return 1;
	}
    }
}

I am really puzzled. It will be really helpful, if someone can help.

Look forward to hear from you.
Many Thanks
Shyama

--
View this message in context: http://lucene.472066.n3.nabble.com/PayloadNearQuery-and-AveragePayloadFunction-tp3710454p3710454.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PayloadNearQuery and AveragePayloadFunction

Posted by Peter Keegan <pe...@gmail.com>.
All term queries, including payload queries, deal only with words from the
query that exist in a document. They don't know what other terms are in a
matching document, due to the inverted nature of the index.

Peter

On Fri, Feb 3, 2012 at 11:50 AM, shyama <sh...@yahoo.com> wrote:

> Hi Peter
> Thanks for your reply.
> I guess I found the problem.
>
> scorePayload function is only called for query terms. Problem was, when I
> was retrieving payloads for each tokens in token stream, it was return
> misleading payloads due to the fact that I did not skip TermPositions that
> does not belongs to current document.
>
> I still wonder, whether AveragePayloadFunction will consider query terms
> for
> "payload seen so far", or all terms in current field of current document. I
> will check this out. In my previous testing I found, it only considers
> query
> terms.
>
> Thanks again.
> Shyama
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/PayloadNearQuery-and-AveragePayloadFunction-tp3710454p3713653.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: PayloadNearQuery and AveragePayloadFunction

Posted by shyama <sh...@yahoo.com>.
Hi Peter
Thanks for your reply.
I guess I found the problem. 

scorePayload function is only called for query terms. Problem was, when I
was retrieving payloads for each tokens in token stream, it was return
misleading payloads due to the fact that I did not skip TermPositions that
does not belongs to current document.

I still wonder, whether AveragePayloadFunction will consider query terms for
"payload seen so far", or all terms in current field of current document. I
will check this out. In my previous testing I found, it only considers query
terms.

Thanks again.
Shyama

--
View this message in context: http://lucene.472066.n3.nabble.com/PayloadNearQuery-and-AveragePayloadFunction-tp3710454p3713653.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PayloadNearQuery and AveragePayloadFunction

Posted by Peter Keegan <pe...@gmail.com>.
AveragPayloadFunction is just what it sounds like:
return numPayloadsSeen > 0 ? (payloadScore / numPayloadsSeen) : 1;
What values are you seeing returned from PayloadHelper.decodeFloat ?

Peter

On Fri, Feb 3, 2012 at 4:13 AM, shyama <sh...@yahoo.com> wrote:

> Hi Peter
> I have checked payload associated with terms, and they are fine in the
> index. I was not clear enough I believe. When I say interested in class A,
> then scorePayload function returns 3 for only for class A terms. Again,
> When
> I say interested in class B, then my scorePayload function returns 3 for
> only Class B terms. These searches are done separately. I mean on the same
> index, but each time i search, I set the semantic in my Similarity class.
>
> I am actually trying to do semantic ranking of documents. Hence, lucene
> ranks those documents high, which contains query terms and also has more
> terms from that semantic class.
>
> I hope now I have make it clear, why I do not understand that score
> returned
> from AveragePayloadFunction.
>
> Hope to hear about some more explanation.
>
> Many Thanks
> Shyama
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/PayloadNearQuery-and-AveragePayloadFunction-tp3710454p3712509.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: PayloadNearQuery and AveragePayloadFunction

Posted by shyama <sh...@yahoo.com>.
Hi Peter
I have checked payload associated with terms, and they are fine in the
index. I was not clear enough I believe. When I say interested in class A,
then scorePayload function returns 3 for only for class A terms. Again, When
I say interested in class B, then my scorePayload function returns 3 for
only Class B terms. These searches are done separately. I mean on the same
index, but each time i search, I set the semantic in my Similarity class.

I am actually trying to do semantic ranking of documents. Hence, lucene
ranks those documents high, which contains query terms and also has more
terms from that semantic class.

I hope now I have make it clear, why I do not understand that score returned
from AveragePayloadFunction.

Hope to hear about some more explanation.

Many Thanks
Shyama

--
View this message in context: http://lucene.472066.n3.nabble.com/PayloadNearQuery-and-AveragePayloadFunction-tp3710454p3712509.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PayloadNearQuery and AveragePayloadFunction

Posted by Peter Keegan <pe...@gmail.com>.
I don't quite follow what you're doing, but is it possible that your
payloads are not on the desired terms when you indexed them? The first
explanation shows that the matching document contained "luteinizing
hormone" in both fields 'AbstractText' and 'AbstractTitle'. The average
payload value was '3.0', so either both terms had payloads that averaged
3.0 or only one had a payload of 3.0. In the 2nd query, the phrase was
found in both fields again, but no payloads were found (thus the 1.0).
According to your 'scorePayload' method, the first match would return 3
only if semantic=A. But the Similarity class is associated with an
IndexReader, so the same 'semantic' would be used for all queries.

Peter


On Thu, Feb 2, 2012 at 11:57 AM, shyama <sh...@yahoo.com> wrote:

> Hi List
> Apologies for such a long message. I have tried to include everything, that
> you might need to know to answer my question.
>
> I am having difficulties understanding how or what AveragePayloadFunction
> is
> doing. Here is my example
>
> Title:Human|9 pineal|5 luteinizing hormone receptors.
> Text:The presence of luteinizing hormone receptors in human|9 pineal|5
> glands from five females and three males, ranging in age from 61-89 yr, was
> examined by in situ hybridization and immunocytochemistry. The results
> demonstrated the presence of these receptors at the mRNA|7 and protein
> levels in all the pineal|5 glands examined. Pineal|5 gland luteinizing
> hormone receptors could potentially be involved in the regulation of
> melatonin|7 synthesis.
>
> 3 is for class A
> 5 is for class B
> 7 is for class C
> 9 is for class D
> These are the payloads stored in the index. But when I search, I use these
> values for encoding term class, and then return 3 for selected class.
>
> I am using WhiteSpaceTokenizer and LowerCaseFilter. In my PayloadSimilarity
> class, I manipulate payload in a way so that, if I am interested in class
> A,
> it will return payload value "x=3" only for terms in class A, I decide term
> class by checking its payload value.
>
> Now, I query for "luteinizing hormone" using PayloadNearQuery with slop of
> 5. First I try with interest in class B and next with interest in class A.
>
> *Result of Class A interest:*
>
> Explain: 10.97332 = (MATCH) sum of:
>  2.5589073 = (MATCH) weight(payloadNear([AbstractText:luteinizing,
> AbstractText:hormone], 5, true) in 5362133), product of:
>    0.68000716 = queryWeight(payloadNear([AbstractText:luteinizing,
> AbstractText:hormone], 5, true)), product of:
>      14.045828 = idf(AbstractText:  luteinizing=15481 hormone=164637)
>      0.048413463 = queryNorm
>    3.7630591 = (MATCH) fieldWeight(AbstractText:payloadNear([luteinizing,
> hormone], 5, true) in 5362133), product of:
>      2.4494898 = PayloadNearQuery, product of:
>        0.8164966 = tf(phraseFreq=0.6666667)
>        *3.0 = AveragePayloadFunction(...)*
>      14.045828 = idf(AbstractText:  luteinizing=15481 hormone=164637)
>      0.109375 = fieldNorm(field=AbstractText, doc=5362133)
>  8.4144125 = (MATCH) weight(payloadNear([ArticleTitle:luteinizing,
> ArticleTitle:hormone], 5, true) in 5362133), product of:
>    0.7332054 = queryWeight(payloadNear([ArticleTitle:luteinizing,
> ArticleTitle:hormone], 5, true)), product of:
>      15.144659 = idf(ArticleTitle:  hormone=86980 luteinizing=9765)
>      0.048413463 = queryNorm
>    11.476201 = (MATCH) fieldWeight(ArticleTitle:payloadNear([luteinizing,
> hormone], 5, true) in 5362133), product of:
>      1.7320508 = PayloadNearQuery, product of:
>        0.57735026 = tf(phraseFreq=0.33333334)
>       * 3.0 = AveragePayloadFunction(...)*
>      15.144659 = idf(ArticleTitle:  hormone=86980 luteinizing=9765)
>      0.4375 = fieldNorm(field=ArticleTitle, doc=5362133)
> ---------------------------------------------------------------------
>
> *Result of Class B Interest:*
>
> Explain: 3.657773 = (MATCH) sum of:
>  0.85296905 = (MATCH) weight(payloadNear([AbstractText:luteinizing,
> AbstractText:hormone], 5, true) in 5362133), product of:
>    0.68000716 = queryWeight(payloadNear([AbstractText:luteinizing,
> AbstractText:hormone], 5, true)), product of:
>      14.045828 = idf(AbstractText:  luteinizing=15481 hormone=164637)
>      0.048413463 = queryNorm
>    1.254353 = (MATCH) fieldWeight(AbstractText:payloadNear([luteinizing,
> hormone], 5, true) in 5362133), product of:
>      0.8164966 = PayloadNearQuery, product of:
>        0.8164966 = tf(phraseFreq=0.6666667)
>        *1.0 = AveragePayloadFunction(...)*
>      14.045828 = idf(AbstractText:  luteinizing=15481 hormone=164637)
>      0.109375 = fieldNorm(field=AbstractText, doc=5362133)
>  2.804804 = (MATCH) weight(payloadNear([ArticleTitle:luteinizing,
> ArticleTitle:hormone], 5, true) in 5362133), product of:
>    0.7332054 = queryWeight(payloadNear([ArticleTitle:luteinizing,
> ArticleTitle:hormone], 5, true)), product of:
>      15.144659 = idf(ArticleTitle:  hormone=86980 luteinizing=9765)
>      0.048413463 = queryNorm
>    3.8254004 = (MATCH) fieldWeight(ArticleTitle:payloadNear([luteinizing,
> hormone], 5, true) in 5362133), product of:
>      0.57735026 = PayloadNearQuery, product of:
>        0.57735026 = tf(phraseFreq=0.33333334)
>       * 1.0 = AveragePayloadFunction(...)*
>      15.144659 = idf(ArticleTitle:  hormone=86980 luteinizing=9765)
>      0.4375 = fieldNorm(field=ArticleTitle, doc=5362133)
>
> As I understand, when I am interested in class B, I should get 3 from
> AveragePayloadFunction, where as I should get 1 for class A, as there is no
> class A term in the text, hence everything will have payload 1. Whereas, if
> I am interested in Class B, there is one term in "Title" field, hence
> AveragePayloadFunction returned value will be 3.
>
> I do not understand what is going on. May be I am not getting what
> AveragePayloadFunction is doing exactly.
>
> My similarity class is as follows:
>
> public class PayloadSearchSimilarity extends DefaultSimilarity {
>
>        private static final long serialVersionUID = 1L;
>        public static String semantic;
>
>        @Override
>    public float scorePayload(int docId,String fieldName, int start, int
> end, byte[] bytes, int offset, int length) {
>                //System.out.println("this is gett");
>                if(bytes!=null)
>                {
>                float payload=PayloadHelper.decodeFloat(bytes, offset);
>                //System.out.println("this is getting called,
> load:"+payload);
>                        //i am now returning same payload for all semantic
> type so that we can
> compare the score. it was changed after we showed it to Dietrich.
>                        if(semantic.equals("A") && (payload==3))
>                        {
>                                //System.out.println("Doc id:"+docId+"field
> :"+fieldName+" Semantic:"+
> semantic+" Payload:"+payload);
>                                return 3;
>                        }
>                        else
>                        {
>                                if(semantic.equals("B") && (payload==5))
>                                {
>                                        //System.out.println("Doc
> id:"+docId+"field :"+fieldName+" Semantic:"+
> semantic+" Payload:"+payload);
>                                        return 3;
>                                }
>                                else
>                                {
>                                        if(semantic.equals("C") &&
> (payload==7))
>                                        {
>
>  System.out.println("Semantic:"+ semantic);
>                                                return 3;
>                                        }
>                                        else
>                                        {
>
>                                                if(semantic.equals("D") &&
> (payload==9))
>                                                {
>
>  System.out.println("Semantic:"+ semantic);
>                                                        return 3;
>                                                }
>                                                else
>                                                {
>
>  //System.out.println("happens when term class does not match with
> semantic, Semantic:"+ semantic);
>                                                        return 1;
>                                                }
>                                        }
>                                }
>                        }
>
>        }//payload|bytes not null end
>        else
>        {
>                //System.out.println("payload null");
>                return 1;
>        }
>    }
> }
>
> I am really puzzled. It will be really helpful, if someone can help.
>
> Look forward to hear from you.
> Many Thanks
> Shyama
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/PayloadNearQuery-and-AveragePayloadFunction-tp3710454p3710454.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>