You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Antony Bowesman <ad...@teamware.com> on 2008/08/14 05:14:00 UTC

Payloads and tokenizers

I started playing with payloads and have been trying to work out how to get the 
data into the payload

I have a field where I want to add the following untokenized fields

A1
A2
A3

With these fields, I would like to add the payloads

B1
B2
B3

Firstly, it looks like you cannot add payloads to untokenized fields.  Is this 
correct?  In my usage, A and B are simply external Ids so must not be tokenized 
and there is always a 1-->1 relationship between them.

Secondly, what is the way to provide the payload data to the tokenizer.  It 
looks like I have to add a List/Map of payload data to a custom Tokenizer and 
Analyzer, which is then consumed each "next(Token)".  However, it would be nice 
if, in my use case, I could use some kind of construct like:

Document doc = new Document()
Field f = new Field("myField", "A1", Field.Store.NO, Field.Index.UNTOKENIZED);
f.setPayload("B1");
doc.add(f);

and avoid the whole unnecessary Tokenizer/Analyzer overhead and give support for 
payloads in untokenized fields.

It looks like it would be trivial to implement in DocumentsWriter.invertField(). 
  Or would this corrupt the Fieldable interface in an undesirable way?

Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

Posted by Doron Cohen <cd...@gmail.com>.

On Tue, Aug 19, 2008 at 2:15 AM, Antony Bowesman <ad...@teamware.com> wrote:

>
> Thanks for you time and I appreciate your valuable insight Doron.
> Antony
>

I'm glad I could help!
Doron

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

Posted by Antony Bowesman <ad...@teamware.com>.

Doron Cohen wrote:
> The API definitely doesn't promise this.
> AFAIK implementation wise it happens to be like this but I can be wrong and
> plus it might change in the future. It would make me nervous to rely on
> this.

I made some tests and it 'seems' to work, but I agree, it also makes me nervous 
to rely on empirical evidence for the design rather than a clearly documented API!

> Anyhow, for your need I can think of two options:
> 
> Option 1:  just index the owenerID, do not store it, do not index or store
> accessID (unless you wish to search by it, in this case just index it). In
> addition store a dedicated mapping field that maps from ownerID to accessID.
> E.g. with serialization of HashMap or something thinner. At runtime retrieve
> this map from the document and it has all that information.

Hey that's an interesting idea!  I'd not considered storing the mapping, only 
re-creating it from fields at runtime.  I'll explore this.

> Option 2: as you describe above, just index the ownerID with accessID as
> payload, and then for the hitting docid of interest use termPositions to get
> the payload, i.e. something like:
>     TermPositions tp = reader.termPositions();
>     tp.seek(new Term("ownerID",oid));
>     tp.skipTo(docid);
>     tp.nextPosition();
>     if (tp.isPayloadAvailable()) {
>       byte [] accessIDBytes = tp.getPayload(...);
>       ...

Yes, I was playing with this technique yesterday.  It's not easy to determine 
the performance implications of this method.  I will be using caches, but my 
volumes are potentially so large that I may never be able to cache everything 
(perhaps 500M Docs), so this has to be very quick.

I'll play with both approaches and see which works best.

Thanks for you time and I appreciate your valuable insight Doron.
Antony

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

Posted by Doron Cohen <cd...@gmail.com>.

>
> payload and the other part for storing, i.e. something like this:
>>
>>    Token token = new Token(...);
>>    token.setPayload(...);
>>    SingleTokenTokenStream ts = new SingleTokenTokenStream(token);
>>
>>    Field f1 = new Field("f","some-stored-content",Store.YES,Index.NO);
>>    Field f2 = new Field("f", ts);
>>
> How about adding this field in two parts, one part for indexing with the
>
> Now that got me thinking and I have exposed a rather large misconception in
> my understanding of the Lucene internals when consider fields of the same
> name.
>
> Your idea above looked like a good one.  However, I realise I am probably
> trying to use payloads wrongly.  I have the following information to store
> for a single Document
>
> contentId - 1 instance
> ownerId 1..n instances
> accessId 1..n instances
>
> One ownerId has a corresponding accessId for the contentId.
>
> My search criteria are ownerId:XXX + user criteria.  When there is a hit, I
> need the contentId and the corresponding accessId (for the owner) back.  So,
> I wanted to store the accessId as a payload to the ownerId.
>
> This is where I came unstuck.  For 'n=3' above, I used the
> SingleTokenTokenStream as you suggested with the accessId as the payload for
> ownerId.  However, at the Document level, I cannot get the payloads from the
> field so, in trying to understand fields with the same name, I discovered
> that there is a big difference between
>
> (a)
> Field f = new Field("ownerId", "OID1", Store.YES, Index.NO_NORMS);
> f = new Field("ownerId", "OID2", Store.YES, Index.NO_NORMS);
> f = new Field("ownerId", "OID3", Store.YES, Index.NO_NORMS);
>
> and (b)
> Field f = new Field("ownerId", "OID1 OID2 OID3", Store.YES,
> Index.NO_NORMS);
>
> as Document.getFields("ownerId") for (a) will be 3 and for (b) it will be
> 1.
>
> My question then is, if I do
>
> for (int i = 0; i < owners; i++)
> {
>    f = new Field("ownerId", oid[i], Store.YES, Index.NO_NORMS);
>    doc.add(f);
>    f = new Field("accessId", aid[i], Store.YES, Index.NO_NORMS);
>    doc.add(f);
> }
>
> then will the array elements for the corresponding Field arrays returned by
>
> Document.getFields("ownerId")
> Document.getFields("accessId")
>
> **guarantee** that the array element order is the same as the order they
> were added?
>


The API definitely doesn't promise this.
AFAIK implementation wise it happens to be like this but I can be wrong and
plus it might change in the future. It would make me nervous to rely on
this.

The difficulty stems from that any specific information on the actual
matching token is digested at scoring and not reaching the hit collector in
effect. It somewhat reminds me the situation with highlighting, where
positions might have been considered for scoring, yet for a certain matching
doc of interest that is being displayed with highlighting, positions (and
offsets) need to be found again.

Anyhow, for your need I can think of two options:

Option 1:  just index the owenerID, do not store it, do not index or store
accessID (unless you wish to search by it, in this case just index it). In
addition store a dedicated mapping field that maps from ownerID to accessID.
E.g. with serialization of HashMap or something thinner. At runtime retrieve
this map from the document and it has all that information.

Option 2: as you describe above, just index the ownerID with accessID as
payload, and then for the hitting docid of interest use termPositions to get
the payload, i.e. something like:
    TermPositions tp = reader.termPositions();
    tp.seek(new Term("ownerID",oid));
    tp.skipTo(docid);
    tp.nextPosition();
    if (tp.isPayloadAvailable()) {
      byte [] accessIDBytes = tp.getPayload(...);
      ...

Each has its overhead but I think both should work...

Doron

Fields with the same name?? - Was Re: Payloads and tokenizers

Posted by Antony Bowesman <ad...@teamware.com>.

> I assume you already know this but just to make sure what I meant was clear
> - on tokenization but still indexing just means that the entire field's text
> becomes a single unchanged token. I believe this is exactly what
> SingleTokenTokenStream can buy you - a single token, for which you can pre
> set a payload.

Yes, I was with you :)


> It is.  Field maintains its  value and it is either string/stream/etc. Once
> you set it to tokenStream the string value is lost and there's no way to
> store it.

Thanks for that - I delved a little further into FieldsWriter and see what you 
mean.


> How about adding this field in two parts, one part for indexing with the
> payload and the other part for storing, i.e. something like this:
> 
>     Token token = new Token(...);
>     token.setPayload(...);
>     SingleTokenTokenStream ts = new SingleTokenTokenStream(token);
> 
>     Field f1 = new Field("f","some-stored-content",Store.YES,Index.NO);
>     Field f2 = new Field("f", ts);

Now that got me thinking and I have exposed a rather large misconception in my 
understanding of the Lucene internals when consider fields of the same name.

Your idea above looked like a good one.  However, I realise I am probably trying 
to use payloads wrongly.  I have the following information to store for a single 
Document

contentId - 1 instance
ownerId 1..n instances
accessId 1..n instances

One ownerId has a corresponding accessId for the contentId.

My search criteria are ownerId:XXX + user criteria.  When there is a hit, I need 
the contentId and the corresponding accessId (for the owner) back.  So, I wanted 
to store the accessId as a payload to the ownerId.

This is where I came unstuck.  For 'n=3' above, I used the 
SingleTokenTokenStream as you suggested with the accessId as the payload for 
ownerId.  However, at the Document level, I cannot get the payloads from the 
field so, in trying to understand fields with the same name, I discovered that 
there is a big difference between

(a)
Field f = new Field("ownerId", "OID1", Store.YES, Index.NO_NORMS);
f = new Field("ownerId", "OID2", Store.YES, Index.NO_NORMS);
f = new Field("ownerId", "OID3", Store.YES, Index.NO_NORMS);

and (b)
Field f = new Field("ownerId", "OID1 OID2 OID3", Store.YES, Index.NO_NORMS);

as Document.getFields("ownerId") for (a) will be 3 and for (b) it will be 1.

My question then is, if I do

for (int i = 0; i < owners; i++)
{
     f = new Field("ownerId", oid[i], Store.YES, Index.NO_NORMS);
     doc.add(f);
     f = new Field("accessId", aid[i], Store.YES, Index.NO_NORMS);
     doc.add(f);
}

then will the array elements for the corresponding Field arrays returned by

Document.getFields("ownerId")
Document.getFields("accessId")

**guarantee** that the array element order is the same as the order they were added?

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Payloads and tokenizers

Posted by Doron Cohen <cd...@gmail.com>.

>
> Implementing payloads via Tokens explicitly prevents the use of payloads
> for untokenized fields, as they only support field.stringValue().  There
> seems no way to override this.


I assume you already know this but just to make sure what I meant was clear
- on tokenization but still indexing just means that the entire field's text
becomes a single unchanged token. I believe this is exactly what
SingleTokenTokenStream can buy you - a single token, for which you can pre
set a payload.

My field is currently stored, so the tokenStream approach you suggested,
> (Lucene-580) will not work as it's theoretically only for non-stored fields.
>


This is new input :) - the original code snippet said - new Field("myField",
"A1", Field.Store.NO, Field.Index.UNTOKENIZED) - so I thought the token
stream approach would work.


> In practice, I expect I can create a stored/indexed Field with a dummy
> string value, then use setValue(TokenStream).  At least I can have stored
> fields with Payloads using the analyzer/tokenStream route.  Is this illegal?


It is.  Field maintains its  value and it is either string/stream/etc. Once
you set it to tokenStream the string value is lost and there's no way to
store it.

What if the Fieldable had a tokenValue(), in addition to the existing
> stream/string/binary/reader values, which could be used for untokenized
> fields and used in invertField()?


With this too, at least in current design, the stored string is gone once
the value is set to the suggested token.

I'd rather stick with core Lucene than start making proprietary changes, but
> it seems I can't quite get to where I want to be without some quite cludgy
> code for a very simple use case :(
>

There is LUCENE-1231 that will allow payloads per field, but I didn't follow
is closely enough to tell if it would solve your need to both store and have
payload? It is interesting that you need the two together.

How about adding this field in two parts, one part for indexing with the
payload and the other part for storing, i.e. something like this:

    Token token = new Token(...);
    token.setPayload(...);
    SingleTokenTokenStream ts = new SingleTokenTokenStream(token);

    Field f1 = new Field("f","some-stored-content",Store.YES,Index.NO);
    Field f2 = new Field("f", ts);

    doc.add(f1);
    doc.add(f2);

Doron

Re: Payloads and tokenizers

Posted by Antony Bowesman <ad...@teamware.com>.

Thanks for your comments Doron.  I found the earlier discussions on the dev list 
(21/12/06), where this issue is discussed - my use case is similar to Nadav Har'El.

Implementing payloads via Tokens explicitly prevents the use of payloads for 
untokenized fields, as they only support field.stringValue().  There seems no 
way to override this.

My field is currently stored, so the tokenStream approach you suggested, 
(Lucene-580) will not work as it's theoretically only for non-stored fields.  In 
practice, I expect I can create a stored/indexed Field with a dummy string 
value, then use setValue(TokenStream).  At least I can have stored fields with 
Payloads using the analyzer/tokenStream route.  Is this illegal?

What if the Fieldable had a tokenValue(), in addition to the existing 
stream/string/binary/reader values, which could be used for untokenized fields 
and used in invertField()?

I'd rather stick with core Lucene than start making proprietary changes, but it 
seems I can't quite get to where I want to be without some quite cludgy code for 
a very simple use case :(

Antony

Doron Cohen wrote:
> IIRC first versions of patches that added payloads support had this notion
> of payload by field rather than by token, but later it was modified to be by
> token only.
> 
> I have seen two code patterns to add payloads to tokens.
> 
> The first one created the field text with a reserved separator/delimiter
> which was later identified by the analyzer who separated the payload part
> from the token part, created the token and set the payload.
> 
> The other pattern was to create a field with a TokenStream. Can be done only
> for non storable fields. Here you can create the token in advance, and you
> have a SingleTokenStream (I think this is how it is called) to wrap it in
> case it is a single token. Since the token is created in advance, there's no
> analysis going on, and you can set the payload of that token on the spot.I
> prefer this pattern - more efficient and elegant.
> 
> Doron

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Payloads and tokenizers

Posted by Doron Cohen <cd...@gmail.com>.

IIRC first versions of patches that added payloads support had this notion
of payload by field rather than by token, but later it was modified to be by
token only.

I have seen two code patterns to add payloads to tokens.

The first one created the field text with a reserved separator/delimiter
which was later identified by the analyzer who separated the payload part
from the token part, created the token and set the payload.

The other pattern was to create a field with a TokenStream. Can be done only
for non storable fields. Here you can create the token in advance, and you
have a SingleTokenStream (I think this is how it is called) to wrap it in
case it is a single token. Since the token is created in advance, there's no
analysis going on, and you can set the payload of that token on the spot.I
prefer this pattern - more efficient and elegant.

Doron

On Thu, Aug 14, 2008 at 6:14 AM, Antony Bowesman <ad...@teamware.com> wrote:

> I started playing with payloads and have been trying to work out how to get
> the data into the payload
>
> I have a field where I want to add the following untokenized fields
>
> A1
> A2
> A3
>
> With these fields, I would like to add the payloads
>
> B1
> B2
> B3
>
> Firstly, it looks like you cannot add payloads to untokenized fields.  Is
> this correct?  In my usage, A and B are simply external Ids so must not be
> tokenized and there is always a 1-->1 relationship between them.
>
> Secondly, what is the way to provide the payload data to the tokenizer.  It
> looks like I have to add a List/Map of payload data to a custom Tokenizer
> and Analyzer, which is then consumed each "next(Token)".  However, it would
> be nice if, in my use case, I could use some kind of construct like:
>
> Document doc = new Document()
> Field f = new Field("myField", "A1", Field.Store.NO,
> Field.Index.UNTOKENIZED);
> f.setPayload("B1");
> doc.add(f);
>
> and avoid the whole unnecessary Tokenizer/Analyzer overhead and give
> support for payloads in untokenized fields.
>
> It looks like it would be trivial to implement in
> DocumentsWriter.invertField().  Or would this corrupt the Fieldable
> interface in an undesirable way?
>
> Antony
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>