You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Tricia Williams <wi...@gmail.com> on 2008/04/04 00:58:15 UTC

tweak to analysis.jsp for payload

Hi,

    I think that displaying the payload (if one exists) of each token in 
the analysis.jsp would be beneficial.  My simple solution was to add a 
row to the existing table, convert the Payload byte array to a String 
and simple print the results.  I opened SOLR-522 to this effect. 

    There is a PayloadHelper class in Lucene that has decode/encode 
float and int methods.  Any ideas on how Payloads might be uniformly 
decoded into something readable/debugable from the gui?  I think bytes 
to String will give enough of a clue to be helpful.

Tricia


Re: tweak to analysis.jsp for payload

Posted by Grant Ingersoll <gs...@apache.org>.
As the guy who wrote PayloadHelper, what I really imagined was using  
Lucene's vint, etc. stuff, but that was a bit more refactoring wise.   
It can be handy for some payloads, but it is still on the app  
developer to know what was put in the payload.  What this means in  
terms of Solr is still up in the air.  No one has worked through what  
adding payloads means yet.


On Apr 4, 2008, at 8:48 PM, Chris Hostetter wrote:

>
> :    There is a PayloadHelper class in Lucene that has decode/encode  
> float and
> : int methods.  Any ideas on how Payloads might be uniformly decoded  
> into
> : something readable/debugable from the gui?  I think bytes to  
> String will give
> : enough of a clue to be helpful.
>
> I've never really looked at PayloadHelper, but if i were tasked with
> trying to find a way to display in HTML an arbitrary byte[] that may  
> or
> may not be a String, i would start by attempting a String  
> conversion, if
> that succeds *and* all chars in the resulting String are "printable" (
> ie: Character.isDefined(c) && ! Character.isISOCOntrol(c)) then  
> display
> the first N chars (where N is some reasonable max size to  
> display) ... if
> not, then just display the first N characters of the hex string
> representing the byte[].
>
> It might be overkill, but the other possibility would be to add
> <payloadInspector> config option to <fieldType> ... it could be a  
> class
> used solely for debugging purposes, and could be declared at arbitrary
> points in the <tokenfilter> chain (indicating that from this point on,
> this is how to display the payload) or completely outside of the
> <analyzer> when using standalone Analyzers (or when the payload  
> structure
> is identicle for hte entire <tokenfilter> chain)
>
>
> -Hoss
>


Re: tweak to analysis.jsp for payload

Posted by Chris Hostetter <ho...@fucit.org>.
:    There is a PayloadHelper class in Lucene that has decode/encode float and
: int methods.  Any ideas on how Payloads might be uniformly decoded into
: something readable/debugable from the gui?  I think bytes to String will give
: enough of a clue to be helpful.

I've never really looked at PayloadHelper, but if i were tasked with 
trying to find a way to display in HTML an arbitrary byte[] that may or 
may not be a String, i would start by attempting a String conversion, if 
that succeds *and* all chars in the resulting String are "printable" ( 
ie: Character.isDefined(c) && ! Character.isISOCOntrol(c)) then display 
the first N chars (where N is some reasonable max size to display) ... if 
not, then just display the first N characters of the hex string 
representing the byte[].

It might be overkill, but the other possibility would be to add 
<payloadInspector> config option to <fieldType> ... it could be a class 
used solely for debugging purposes, and could be declared at arbitrary 
points in the <tokenfilter> chain (indicating that from this point on, 
this is how to display the payload) or completely outside of the 
<analyzer> when using standalone Analyzers (or when the payload structure 
is identicle for hte entire <tokenfilter> chain)


-Hoss


Re: tweak to analysis.jsp for payload

Posted by Yonik Seeley <yo...@apache.org>.
As a useful first step for debugging purposes, it seems like the full
hex of the raw bytes should always be output.  If it seems to be
ascii, that could be put in parens.
example: 636f6f6c (cool)

This can be changed later as payloads gain the ability to be
introspected more fully by Solr.

-Yonik

Re: tweak to analysis.jsp for payload

Posted by Grant Ingersoll <gs...@apache.org>.
I don't know just yet that the AnalysisReqH (ARH) is going to replace  
analysis.jsp.  The JSP page does things that the ARH doesn't,  
specifically, handling the output after every token filter.  In my  
mind, the ARH is useful as a Token server for things like machine  
learning (i.e. Mahout :-)  ) and/or other applications that just have  
a need for the final tokens of a document.  I think the response would  
get pretty ugly looking if it were to try to serve up the intermediate  
tokens.  In other words, I have no intent on working on it, but if  
someone else comes up w/ a useful way of doing it, then I wouldn't try  
to stop it, either.

It might be useful to define a mechanism whereby one can plugin a  
Payload decoder into Solr that could be used by analysis.jsp.  This  
would allow applications a means to make sense of payloads and have  
them attached to tokens.

-Grant

On Apr 6, 2008, at 1:59 AM, Tricia Williams wrote:

> Replies to several comments in this thread inline:
>
> Grant Ingersoll wrote:
>> Yes, that is definitely the case, but I think Tricia was more  
>> getting at how to use them for display, i.e deserializing them into  
>> a String or whatever.  I still have on my plate that I want to  
>> figure out how to incorporate payloads with SpanQuery as that is  
>> the logical means of getting at them query wise.
>>
>> -Grant
>>
>
> Grant is right that my intention is to visualize the Payloads in the  
> same way that analysis.jsp allows users to visualize what  
> TokenFilters are doing to the position, term text, token type, and  
> start and end offsets.  This would be a crude way to debug or demo  
> what your payload savvy TokenFilter/Tokenizer does to a given  
> TokenStream.
>
> I went through the JIRA issues trying to figure out what was being  
> done with Payloads to see if this would help clarify my display  
> problem.  I came across Grant's AnalysisRequestHandler which looks  
> like its intent is to replace analysis.jsp at some point.  It looks  
> like two short months ago the call on including Payloads was to  
> punt, "since Solr doesn't currently support payloads, not much point  
> in outputting them just yet."  I guess that is what he was trying to  
> tell me in this thread too.
>
> Grant Ingersoll wrote:
>> As the guy who wrote PayloadHelper, what I really imagined was  
>> using Lucene's vint, etc. stuff, but that was a bit more  
>> refactoring wise.  It can be handy for some payloads, but it is  
>> still on the app developer to know what was put in the payload.   
>> What this means in terms of Solr is still up in the air.  No one  
>> has worked through what adding payloads means yet.
>
> Would it be completely ignorant of me to suggest that an abstraction  
> of Payload contain a public decode() method with an Object as a  
> return type?  Or maybe Payload's toString should be overridden to  
> provide a string representation for display -- possibly doing  
> something like Hoss described?
>
> Chris Hostetter wrote:
>> I've never really looked at PayloadHelper, but if i were tasked  
>> with trying to find a way to display in HTML an arbitrary byte[]  
>> that may or may not be a String, i would start by attempting a  
>> String conversion, if that succeds *and* all chars in the resulting  
>> String are "printable" ( ie: Character.isDefined(c) && !  
>> Character.isISOCOntrol(c)) then display the first N chars (where N  
>> is some reasonable max size to display) ... if not, then just  
>> display the first N characters of the hex string representing the  
>> byte[].
> Thanks for the feedback.  It is always appreciated!
>
> Tricia


Re: tweak to analysis.jsp for payload

Posted by Tricia Williams <wi...@gmail.com>.
Replies to several comments in this thread inline:

Grant Ingersoll wrote:
> Yes, that is definitely the case, but I think Tricia was more getting 
> at how to use them for display, i.e deserializing them into a String 
> or whatever.  I still have on my plate that I want to figure out how 
> to incorporate payloads with SpanQuery as that is the logical means of 
> getting at them query wise.
>
> -Grant
>

Grant is right that my intention is to visualize the Payloads in the 
same way that analysis.jsp allows users to visualize what TokenFilters 
are doing to the position, term text, token type, and start and end 
offsets.  This would be a crude way to debug or demo what your payload 
savvy TokenFilter/Tokenizer does to a given TokenStream.

I went through the JIRA issues trying to figure out what was being done 
with Payloads to see if this would help clarify my display problem.  I 
came across Grant's AnalysisRequestHandler which looks like its intent 
is to replace analysis.jsp at some point.  It looks like two short 
months ago the call on including Payloads was to punt, "since Solr 
doesn't currently support payloads, not much point in outputting them 
just yet."  I guess that is what he was trying to tell me in this thread 
too.

Grant Ingersoll wrote:
> As the guy who wrote PayloadHelper, what I really imagined was using 
> Lucene's vint, etc. stuff, but that was a bit more refactoring wise.  
> It can be handy for some payloads, but it is still on the app 
> developer to know what was put in the payload.  What this means in 
> terms of Solr is still up in the air.  No one has worked through what 
> adding payloads means yet. 

Would it be completely ignorant of me to suggest that an abstraction of 
Payload contain a public decode() method with an Object as a return 
type?  Or maybe Payload's toString should be overridden to provide a 
string representation for display -- possibly doing something like Hoss 
described?

Chris Hostetter wrote:
> I've never really looked at PayloadHelper, but if i were tasked with 
> trying to find a way to display in HTML an arbitrary byte[] that may or 
> may not be a String, i would start by attempting a String conversion, if 
> that succeds *and* all chars in the resulting String are "printable" ( 
> ie: Character.isDefined(c) && ! Character.isISOCOntrol(c)) then display 
> the first N chars (where N is some reasonable max size to display) ... if 
> not, then just display the first N characters of the hex string 
> representing the byte[].
Thanks for the feedback.  It is always appreciated!

Tricia

Re: tweak to analysis.jsp for payload

Posted by Grant Ingersoll <gs...@apache.org>.
Yes, that is definitely the case, but I think Tricia was more getting  
at how to use them for display, i.e deserializing them into a String  
or whatever.  I still have on my plate that I want to figure out how  
to incorporate payloads with SpanQuery as that is the logical means of  
getting at them query wise.

-Grant

On Apr 5, 2008, at 4:51 AM, Mike Klaas wrote:

> On 3-Apr-08, at 3:58 PM, Tricia Williams wrote:
>> Hi,
>>
>>  I think that displaying the payload (if one exists) of each token  
>> in the analysis.jsp would be beneficial.  My simple solution was to  
>> add a row to the existing table, convert the Payload byte array to  
>> a String and simple print the results.  I opened SOLR-522 to this  
>> effect.
>>  There is a PayloadHelper class in Lucene that has decode/encode  
>> float and int methods.  Any ideas on how Payloads might be  
>> uniformly decoded into something readable/debugable from the gui?   
>> I think bytes to String will give enough of a clue to be helpful.
>
> Similarity.scorePayload(), if defined, should be the commonly-used  
> method (at least, that's what I do):
>
>  public float scorePayload(byte [] payload, int offset, int length) {
>    assert length == 4;
>    int accum = ((payload[0+offset]&0xff)) |
>                ((payload[1+offset]&0xff)<<8) |
>                ((payload[2+offset]&0xff)<<16)  |
>                ((payload[3+offset]&0xff)<<24);
>
>    return Float.intBitsToFloat(accum);
> }
>
> -Mike


Re: tweak to analysis.jsp for payload

Posted by Mike Klaas <mi...@gmail.com>.
On 3-Apr-08, at 3:58 PM, Tricia Williams wrote:
> Hi,
>
>   I think that displaying the payload (if one exists) of each token  
> in the analysis.jsp would be beneficial.  My simple solution was to  
> add a row to the existing table, convert the Payload byte array to a  
> String and simple print the results.  I opened SOLR-522 to this  
> effect.
>   There is a PayloadHelper class in Lucene that has decode/encode  
> float and int methods.  Any ideas on how Payloads might be uniformly  
> decoded into something readable/debugable from the gui?  I think  
> bytes to String will give enough of a clue to be helpful.

Similarity.scorePayload(), if defined, should be the commonly-used  
method (at least, that's what I do):

   public float scorePayload(byte [] payload, int offset, int length) {
     assert length == 4;
     int accum = ((payload[0+offset]&0xff)) |
                 ((payload[1+offset]&0xff)<<8) |
                 ((payload[2+offset]&0xff)<<16)  |
                 ((payload[3+offset]&0xff)<<24);

     return Float.intBitsToFloat(accum);
}

-Mike