You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Carsten Schnober <sc...@ids-mannheim.de> on 2013/04/23 13:03:49 UTC

Reading Payloads

Hi,
I'm trying to extract payloads from an index for specific tokens the
following way (inserting sample document number and term):

Terms terms = reader.getTermVector(16504, "term");
TokenStream tokenstream = TokenSources.getTokenStream(terms);
while (tokenstream.incrementToken()) {
  OffsetAttribute offset = tokenstream.getAttribute(OffsetAttribute.class);
  int start = offset.startOffset();
  int end = offset.endOffset();
  String token =
tokenstream.getAttribute(CharTermAttribute.class).toString();

  PayloadAttribute payloadAttr =
tokenstream.addAttribute(PayloadAttribute.class);
  BytesRef payloadBytes = payloadAttr.getPayload();

  ...
}

This works fine for the OffsetAttribute and the CharTermAttribute, but
payloadAttr.getPayload() always returns null for all documents and all
tokens, unfortunately. However, I know that the payloads are stored in
the index as I can retrieve them through a SpanQuery with
Spans.getPayload(). I actually expect every token to carry a payload, as
I'm my custom tokenizer implementation has the following lines:

public class KoraTokenizer extends Tokenizer {
  ...
  private PayloadAttribute payloadAttr =
addAttribute(PayloadAttribute.class);
  ...
  public boolean incrementToken() {
    ...
    payloadAttr.setPayload(new BytesRef(payloadString));
    ...
  }
  ...
}

I've asserted that the payloadString variable is never an empty String
and as I said above, I can retrieve the Payloads with
Spans.getPayload(). So what do I do wrong in my
tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used
tokenstream.getAttribute() before as for the other attributes but this
obviously threw an IllegalArgumentException so I implemented the
recommendation given in the documentation and replaced it by addAttribute().

Thanks!
Carsten




-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reading Payloads

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 23.04.2013 16:17, schrieb Alan Woodward:

> It doesn't sound as though an inverted index is really what you want to be querying here, if I'm reading you right. You want to get the payloads for spans at a specific position, but you don't particularly care about the actual term at that position? You might find that BinaryDocValues are a better fit here, but it's difficult to tell without knowing what your actual use case is.

Hi Alan,
you are right that this specific aspect is not really suitable for an
inverted index. I've still been hoping that I could misuse it for some
cases. Let me sketch my use case:
A user performs a query that is parsed and executed in the form of a
SpanQuery. The offsets of the match(es) are extracted and returned. From
that point on, the user uses these offsets to retrieve certain segments
of a document from an external database.
However, I also store additional information (linguistic annotations) in
the token payloads because they are also used for more complex queries
that filter matches depending on these payloads. As they are stored in
the index anyway, I thought I could as well extract them upon request. I
am aware that such a request wouldn't perform very well, but apart from
that, I think it would be very handy if I were able to extract the
payloads for a given span.
However, I can't find a way other than via TokenSources.getTokenStream;
but that doesn't work apparently.
I'm now thinking about storing the resulting Spans in memory so that I
could extract the payloads upon user request. However, that still
wouldn't allow me to extract the payloads of any other token which would
be a typical use case when a user wants to retrieve annotations for
adjacent tokens, for example.
Carsten

--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reading Payloads

Posted by Alan Woodward <al...@flax.co.uk>.

Hi Carsten,

It doesn't sound as though an inverted index is really what you want to be querying here, if I'm reading you right.  You want to get the payloads for spans at a specific position, but you don't particularly care about the actual term at that position?  You might find that BinaryDocValues are a better fit here, but it's difficult to tell without knowing what your actual use case is.

Alan Woodward
www.flax.co.uk


On 23 Apr 2013, at 15:06, Carsten Schnober wrote:

> Am 23.04.2013 15:27, schrieb Alan Woodward:
>> There's the SpanPositionCheckQuery family - SpanRangeQuery, SpanFirstQuery, etc.  Is that the sort of thing you're looking for?
> 
> Hi Alan,
> thanks for the pointer, this is the right direction indeed. However,
> these queries are based on a SpanQuery which depends on a specific
> expression to search for. In my use case, I need to retrieve Spans
> specified by their offsets only, and then get their payloads and process
> them further. Alternatively, I could query for the occurence of certain
> string patterns in the payloads and check the offsets subsequently, but
> either way I'm no longer interested in the actual term at that point.
> I don't see a way to do this with these Query type, or is there?
> Carsten
> 
> 
> -- 
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP                 | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: Reading Payloads

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 23.04.2013 15:27, schrieb Alan Woodward:
> There's the SpanPositionCheckQuery family - SpanRangeQuery, SpanFirstQuery, etc.  Is that the sort of thing you're looking for?

Hi Alan,
thanks for the pointer, this is the right direction indeed. However,
these queries are based on a SpanQuery which depends on a specific
expression to search for. In my use case, I need to retrieve Spans
specified by their offsets only, and then get their payloads and process
them further. Alternatively, I could query for the occurence of certain
string patterns in the payloads and check the offsets subsequently, but
either way I'm no longer interested in the actual term at that point.
I don't see a way to do this with these Query type, or is there?
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reading Payloads

Posted by Alan Woodward <al...@flax.co.uk>.

There's the SpanPositionCheckQuery family - SpanRangeQuery, SpanFirstQuery, etc.  Is that the sort of thing you're looking for?

Alan Woodward
www.flax.co.uk


On 23 Apr 2013, at 13:36, Carsten Schnober wrote:

> Am 23.04.2013 13:47, schrieb Carsten Schnober:
>> I'm trying to figure out a way to use a query as Uwe suggested. My
>> scenario is to perform a query and then retrieve some of the payloads
>> upon user request, so there no obvious way to wrap this into a query as
>> I can't know what (terms) to query for.
> 
> I wonder: is there a way to perform a (Span)Query restricting the search
> to tokens within certain offsets in a document, e.g. by a Filter?
> Thanks!
> Carsten
> 
> -- 
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP                 | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: Reading Payloads

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 23.04.2013 13:47, schrieb Carsten Schnober:
> I'm trying to figure out a way to use a query as Uwe suggested. My
> scenario is to perform a query and then retrieve some of the payloads
> upon user request, so there no obvious way to wrap this into a query as
> I can't know what (terms) to query for.

I wonder: is there a way to perform a (Span)Query restricting the search
to tokens within certain offsets in a document, e.g. by a Filter?
Thanks!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reading Payloads

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 23.04.2013 13:21, schrieb Michael McCandless:
> Actually, term vectors can store payloads now (LUCENE-1888), so if that
> field was indexed with FieldType.setStoreTermVectorPayloads they should be
> there.
> 
> But I suspect the TokenSources.getTokenStream API (which I think un-inverts
> the term vectors to recreate the token stream = very slow?) wasn't fixed to
> also carry the payloads through?

I use the following FieldType:

private final static FieldType textFieldWithTermVector = new
FieldType(TextField.TYPE_STORED);
textFieldWithTermVector.setStoreTermVectors(true);
textFieldWithTermVector.setStoreTermVectorPositions(true);
textFieldWithTermVector.setStoreTermVectorOffsets(true);
textFieldWithTermVector.setStoreTermVectorPayloads(true);

So I suppose your assumption is right that the
TokenSources.getTokensStream API is not ready to make use of this.

I'm trying to figure out a way to use a query as Uwe suggested. My
scenario is to perform a query and then retrieve some of the payloads
upon user request, so there no obvious way to wrap this into a query as
I can't know what (terms) to query for.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Reading Payloads

Posted by Michael McCandless <lu...@mikemccandless.com>.

Actually, term vectors can store payloads now (LUCENE-1888), so if that
field was indexed with FieldType.setStoreTermVectorPayloads they should be
there.

But I suspect the TokenSources.getTokenStream API (which I think un-inverts
the term vectors to recreate the token stream = very slow?) wasn't fixed to
also carry the payloads through?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 23, 2013 at 7:10 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

> TermVectors are per-document and do not contain payloads. You are reading
> the per-document TermVectors which is a "small index" *stored* for each
> document as a binary blob. This blob only contains the terms of this
> document with its positions/offsets, but no payloads (offsets are used e.g.
> for highlighting).
>
> To retrieve payloads, you have to use the main TermsEnum and main posting
> lists, but this does *not* work per document. In general you would execute
> a query and then retrieve the payload for each hit while iterating the
> scorer (e.g. function queries can do this).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Carsten Schnober [mailto:schnober@ids-mannheim.de]
> > Sent: Tuesday, April 23, 2013 1:04 PM
> > To: java-user
> > Subject: Reading Payloads
> >
> > Hi,
> > I'm trying to extract payloads from an index for specific tokens the
> following
> > way (inserting sample document number and term):
> >
> > Terms terms = reader.getTermVector(16504, "term"); TokenStream
> > tokenstream = TokenSources.getTokenStream(terms);
> > while (tokenstream.incrementToken()) {
> >   OffsetAttribute offset =
> tokenstream.getAttribute(OffsetAttribute.class);
> >   int start = offset.startOffset();
> >   int end = offset.endOffset();
> >   String token =
> > tokenstream.getAttribute(CharTermAttribute.class).toString();
> >
> >   PayloadAttribute payloadAttr =
> > tokenstream.addAttribute(PayloadAttribute.class);
> >   BytesRef payloadBytes = payloadAttr.getPayload();
> >
> >   ...
> > }
> >
> > This works fine for the OffsetAttribute and the CharTermAttribute, but
> > payloadAttr.getPayload() always returns null for all documents and all
> > tokens, unfortunately. However, I know that the payloads are stored in
> the
> > index as I can retrieve them through a SpanQuery with
> Spans.getPayload(). I
> > actually expect every token to carry a payload, as I'm my custom
> tokenizer
> > implementation has the following lines:
> >
> > public class KoraTokenizer extends Tokenizer {
> >   ...
> >   private PayloadAttribute payloadAttr =
> > addAttribute(PayloadAttribute.class);
> >   ...
> >   public boolean incrementToken() {
> >     ...
> >     payloadAttr.setPayload(new BytesRef(payloadString));
> >     ...
> >   }
> >   ...
> > }
> >
> > I've asserted that the payloadString variable is never an empty String
> and as I
> > said above, I can retrieve the Payloads with Spans.getPayload(). So what
> do I
> > do wrong in my
> > tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used
> > tokenstream.getAttribute() before as for the other attributes but this
> > obviously threw an IllegalArgumentException so I implemented the
> > recommendation given in the documentation and replaced it by
> > addAttribute().
> >
> > Thanks!
> > Carsten
> >
> >
> >
> >
> > --
> > Institut für Deutsche Sprache | http://www.ids-mannheim.de
> > Projekt KorAP                 | http://korap.ids-mannheim.de
> > Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> > Korpusanalyseplattform der nächsten Generation Next Generation Corpus
> > Analysis Platform
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Reading Payloads

Posted by Uwe Schindler <uw...@thetaphi.de>.

TermVectors are per-document and do not contain payloads. You are reading the per-document TermVectors which is a "small index" *stored* for each document as a binary blob. This blob only contains the terms of this document with its positions/offsets, but no payloads (offsets are used e.g. for highlighting).

To retrieve payloads, you have to use the main TermsEnum and main posting lists, but this does *not* work per document. In general you would execute a query and then retrieve the payload for each hit while iterating the scorer (e.g. function queries can do this).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Carsten Schnober [mailto:schnober@ids-mannheim.de]
> Sent: Tuesday, April 23, 2013 1:04 PM
> To: java-user
> Subject: Reading Payloads
> 
> Hi,
> I'm trying to extract payloads from an index for specific tokens the following
> way (inserting sample document number and term):
> 
> Terms terms = reader.getTermVector(16504, "term"); TokenStream
> tokenstream = TokenSources.getTokenStream(terms);
> while (tokenstream.incrementToken()) {
>   OffsetAttribute offset = tokenstream.getAttribute(OffsetAttribute.class);
>   int start = offset.startOffset();
>   int end = offset.endOffset();
>   String token =
> tokenstream.getAttribute(CharTermAttribute.class).toString();
> 
>   PayloadAttribute payloadAttr =
> tokenstream.addAttribute(PayloadAttribute.class);
>   BytesRef payloadBytes = payloadAttr.getPayload();
> 
>   ...
> }
> 
> This works fine for the OffsetAttribute and the CharTermAttribute, but
> payloadAttr.getPayload() always returns null for all documents and all
> tokens, unfortunately. However, I know that the payloads are stored in the
> index as I can retrieve them through a SpanQuery with Spans.getPayload(). I
> actually expect every token to carry a payload, as I'm my custom tokenizer
> implementation has the following lines:
> 
> public class KoraTokenizer extends Tokenizer {
>   ...
>   private PayloadAttribute payloadAttr =
> addAttribute(PayloadAttribute.class);
>   ...
>   public boolean incrementToken() {
>     ...
>     payloadAttr.setPayload(new BytesRef(payloadString));
>     ...
>   }
>   ...
> }
> 
> I've asserted that the payloadString variable is never an empty String and as I
> said above, I can retrieve the Payloads with Spans.getPayload(). So what do I
> do wrong in my
> tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used
> tokenstream.getAttribute() before as for the other attributes but this
> obviously threw an IllegalArgumentException so I implemented the
> recommendation given in the documentation and replaced it by
> addAttribute().
> 
> Thanks!
> Carsten
> 
> 
> 
> 
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP                 | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation Next Generation Corpus
> Analysis Platform
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org