You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Christopher Condit <co...@sdsc.edu> on 2010/02/26 21:41:51 UTC

recovering payload from fields

I'm trying to store semantic information in payloads at index time. I believe this part is successful - but I'm having trouble getting access to the payload locations after the index is created. I'd like to know the offset in the original text for the token with the payload - and get this information for all payloads that are set in a Field even if they don't relate to the query. I tried (from the highlighting filter):
TokenStream tokens = TokenSources.getTokenStream(reader, 0, "body");
 while (tokens.incrementToken()) {
    TermAttribute term = tokens.getAttribute(TermAttribute.class);
    if (toker.hasAttribute(PayloadAttribute.class)) {
      PayloadAttribute payload = tokens.getAttribute(PayloadAttribute.class);
      OffsetAttribute offset = toker.getAttribute(OffsetAttribute.class);
    }			
  }
But the OffsetAttribute never seems to contain any information.
In my token filter do I need to do more than:
offsetAtt = addAttribute(OffsetAttribute.class);
during construction in order to store Offset information?

Thanks,
-Chris
	

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: custom FieldCache cost too much time. how can I preload the the custom fieldCache when new segment exits!

Posted by Michael McCandless <lu...@mikemccandless.com>.

If you look at the javadocs for IndexWriter it explains how to do it.
You just provide a class that implements the warm method, and inside
that method you do whatever app specific things you need to do to the
provided IndexReader to warm it.

Note that the SearcherManager class from LIA2 handles setting up the
MergedSegmentWarmer, if you implement the warm method.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

答复: custom FieldCache cost too much time. how can I preload the the custom fieldCache when new segment exits!

Posted by luocanrao <lu...@sohu.com>.

Ps:in our evrioment
a document have more than ten field,in a short time,may be have many update.
installing a mergedSegmentWarmer on the writer,can you give me a small case,
thanks very much!

-----邮件原件-----
发件人: luocanrao [mailto:luocan19826164@sohu.com] 
发送时间: 2010年2月27日 19:09
收件人: java-user@lucene.apache.org
主题: 答复: custom FieldCache cost too much time. how can I preload the the custom fieldCache when new segment exits!

I set merge factor 4, every five minute I reopen the reader.
yes most of the time is very fast. But sometimes it is very slow.
For example,when start the program,the first query cosume 10s!
So when newly created segment generated,the query cosume more than 1s.
Our performance is key point.
Sorry ,my English is not good!
I hope I can preload custom fieldCache in a another thread,not the query thread.
So I will not have performace issue.
-----邮件原件-----
发件人: Michael McCandless [mailto:lucene@mikemccandless.com] 
发送时间: 2010年2月27日 18:37
收件人: java-user@lucene.apache.org
主题: Re: custom FieldCache cost too much time. how can I preload the the custom fieldCache when new segment exits!

How are you opening a new reader?

If it's a near-real-time reader (IndexWriter.getReader), or you use
IndexReader.reopen, it should only be the newly created segments that
have to generate the field cache entry, which most of the time should
be fast.

If you are already using those APIs and its still not fast enough,
then you should just warm the reader before using it (additionally,
for a near-real-time reader you should warm newly merged segments by
installing a mergedSegmentWarmer on the writer).

Mike

On Sat, Feb 27, 2010 at 3:35 AM, luocanrao <lu...@sohu.com> wrote:
> custom FieldCache cost too much time.
> So every first time,reopen the new reader ,it interfere the performance of
> search
> I hope someone can tell me,how can I preload the the custom fieldCache when
> new segment exits!
> Thanks again!
>
> here is source  , In FieldComparator.setNextReader method
> ((C2CFieldManager)fieldManager).lCommID =
> FieldCache.DEFAULT.getLongs(reader, "1",new LongParser(){
>                                public long parseLong(String documentIDStr)
> {
>                                        documentIDStr =
> documentIDStr.substring(16);
>                                        long documentID =
> Long.parseLong(documentIDStr,16);
>                                        return documentID;
>                                }
>
>                });
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

答复: custom FieldCache cost too much time. how can I preload the the custom fieldCache when new segment exits!

Posted by luocanrao <lu...@sohu.com>.

I set merge factor 4, every five minute I reopen the reader.
yes most of the time is very fast. But sometimes it is very slow.
For example,when start the program,the first query cosume 10s!
So when newly created segment generated,the query cosume more than 1s.
Our performance is key point.
Sorry ,my English is not good!
I hope I can preload custom fieldCache in a another thread,not the query thread.
So I will not have performace issue.
-----邮件原件-----
发件人: Michael McCandless [mailto:lucene@mikemccandless.com] 
发送时间: 2010年2月27日 18:37
收件人: java-user@lucene.apache.org
主题: Re: custom FieldCache cost too much time. how can I preload the the custom fieldCache when new segment exits!

How are you opening a new reader?

If it's a near-real-time reader (IndexWriter.getReader), or you use
IndexReader.reopen, it should only be the newly created segments that
have to generate the field cache entry, which most of the time should
be fast.

If you are already using those APIs and its still not fast enough,
then you should just warm the reader before using it (additionally,
for a near-real-time reader you should warm newly merged segments by
installing a mergedSegmentWarmer on the writer).

Mike

On Sat, Feb 27, 2010 at 3:35 AM, luocanrao <lu...@sohu.com> wrote:
> custom FieldCache cost too much time.
> So every first time,reopen the new reader ,it interfere the performance of
> search
> I hope someone can tell me,how can I preload the the custom fieldCache when
> new segment exits!
> Thanks again!
>
> here is source  , In FieldComparator.setNextReader method
> ((C2CFieldManager)fieldManager).lCommID =
> FieldCache.DEFAULT.getLongs(reader, "1",new LongParser(){
>                                public long parseLong(String documentIDStr)
> {
>                                        documentIDStr =
> documentIDStr.substring(16);
>                                        long documentID =
> Long.parseLong(documentIDStr,16);
>                                        return documentID;
>                                }
>
>                });
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: custom FieldCache cost too much time. how can I preload the the custom fieldCache when new segment exits!

Posted by Michael McCandless <lu...@mikemccandless.com>.

How are you opening a new reader?

If it's a near-real-time reader (IndexWriter.getReader), or you use
IndexReader.reopen, it should only be the newly created segments that
have to generate the field cache entry, which most of the time should
be fast.

If you are already using those APIs and its still not fast enough,
then you should just warm the reader before using it (additionally,
for a near-real-time reader you should warm newly merged segments by
installing a mergedSegmentWarmer on the writer).

Mike

On Sat, Feb 27, 2010 at 3:35 AM, luocanrao <lu...@sohu.com> wrote:
> custom FieldCache cost too much time.
> So every first time,reopen the new reader ,it interfere the performance of
> search
> I hope someone can tell me,how can I preload the the custom fieldCache when
> new segment exits!
> Thanks again!
>
> here is source  , In FieldComparator.setNextReader method
> ((C2CFieldManager)fieldManager).lCommID =
> FieldCache.DEFAULT.getLongs(reader, "1",new LongParser(){
>                                public long parseLong(String documentIDStr)
> {
>                                        documentIDStr =
> documentIDStr.substring(16);
>                                        long documentID =
> Long.parseLong(documentIDStr,16);
>                                        return documentID;
>                                }
>
>                });
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

custom FieldCache cost too much time. how can I preload the the custom fieldCache when new segment exits!

Posted by luocanrao <lu...@sohu.com>.

custom FieldCache cost too much time.
So every first time,reopen the new reader ,it interfere the performance of
search
I hope someone can tell me,how can I preload the the custom fieldCache when
new segment exits!
Thanks again!

here is source	, In FieldComparator.setNextReader method    	
((C2CFieldManager)fieldManager).lCommID =
FieldCache.DEFAULT.getLongs(reader, "1",new LongParser(){
				public long parseLong(String documentIDStr)
{
					documentIDStr =
documentIDStr.substring(16);
					long documentID =
Long.parseLong(documentIDStr,16);
					return documentID;
				}
	    		
	    	});


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: recovering payload from fields

Posted by Christopher Tignor <ct...@thinkmap.com>.

What I'd ideally like to do is to take SpanQuery, loop over the PayloadSpans
returned from SpanQuery.getPayloadSpans() and store all PayloadSpans for a
given document in a Map by their doc id.

Then later after deciding in memory which documents I need, load the Payload
data for just those PayloadSpans pulled out of my Map.

But it seems I can't do this as loading Payload data is only done through
the PayloadSpans iterator so must iterate through the entire collection to
get to my PaylaodSpan.  Is there not a way to just save a PayloadSpan and
loads it's payload data later as needed?

thanks,

C>T>

On Sat, Feb 27, 2010 at 5:42 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> You can also access payloads through the TermPositions enum, but, this
> is by term and then by doc.
>
> It sounds like you need to iterate through all terms sequentially in a
> given field in the doc, accessing offset & payload?  In which case
> reanalyzing at search time may be the best way to go.
>
> You can store term vectors in the index, which will store offsets (if
> you ask it to), but, payloads are not currently stored with term
> vectors.
>
> Mike
>
> On Fri, Feb 26, 2010 at 7:42 PM, Christopher Condit <co...@sdsc.edu>
> wrote:
> >> Payload Data is accessed through PayloadSpans so using SpanQUeries is
> the
> >> netry point it seems.  There are tools like PayloadSpanUtil that convert
> other
> >> queries into SpanQueries for this purpose if needed but the api for
> Payloads
> >> looks it like it goes through Spans is the bottom line.
> >
> > So there's no way to iterate through all the payloads for a given field?
> I can't use the SpanQuery mechanism because in this case the entire field
> will be displayed - and I can't search for "*". Is there some trick I'm not
> thinking of?
> >
> >> this is the tip of the iceberg; a big dangerous iceberg...
> >
> > Yes - I'm beginning to see that...
> >
> > Thanks,
> > -Chris
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999

Re: recovering payload from fields

Posted by Grant Ingersoll <gs...@apache.org>.

It's not implemented, but http://issues.apache.org/jira/browse/LUCENE-1888 is how I would solve it.  It probably isn't that hard to implement, actually.  A patch would be great.  Happy to review one.


On Feb 27, 2010, at 5:29 PM, Christopher Condit wrote:

>> It sounds like you need to iterate through all terms sequentially in a given
>> field in the doc, accessing offset & payload?  In which case reanalyzing at
>> search time may be the best way to go.
> 
> If it matters it doesn't need to be sequential. I just need access to all the payloads for a given doc in the index. If reanalyzing is the best option I suppose I'll do that. Or perhaps build some auxiliary structure to cache the information.
> 
> Thanks for the clarification,
> -Chris
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: recovering payload from fields

Posted by Christopher Condit <co...@sdsc.edu>.

> It sounds like you need to iterate through all terms sequentially in a given
> field in the doc, accessing offset & payload?  In which case reanalyzing at
> search time may be the best way to go.

If it matters it doesn't need to be sequential. I just need access to all the payloads for a given doc in the index. If reanalyzing is the best option I suppose I'll do that. Or perhaps build some auxiliary structure to cache the information.

Thanks for the clarification,
-Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: recovering payload from fields

Posted by Michael McCandless <lu...@mikemccandless.com>.

You can also access payloads through the TermPositions enum, but, this
is by term and then by doc.

It sounds like you need to iterate through all terms sequentially in a
given field in the doc, accessing offset & payload?  In which case
reanalyzing at search time may be the best way to go.

You can store term vectors in the index, which will store offsets (if
you ask it to), but, payloads are not currently stored with term
vectors.

Mike

On Fri, Feb 26, 2010 at 7:42 PM, Christopher Condit <co...@sdsc.edu> wrote:
>> Payload Data is accessed through PayloadSpans so using SpanQUeries is the
>> netry point it seems.  There are tools like PayloadSpanUtil that convert other
>> queries into SpanQueries for this purpose if needed but the api for Payloads
>> looks it like it goes through Spans is the bottom line.
>
> So there's no way to iterate through all the payloads for a given field? I can't use the SpanQuery mechanism because in this case the entire field will be displayed - and I can't search for "*". Is there some trick I'm not thinking of?
>
>> this is the tip of the iceberg; a big dangerous iceberg...
>
> Yes - I'm beginning to see that...
>
> Thanks,
> -Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: recovering payload from fields

Posted by Christopher Condit <co...@sdsc.edu>.

> Payload Data is accessed through PayloadSpans so using SpanQUeries is the
> netry point it seems.  There are tools like PayloadSpanUtil that convert other
> queries into SpanQueries for this purpose if needed but the api for Payloads
> looks it like it goes through Spans is the bottom line.

So there's no way to iterate through all the payloads for a given field? I can't use the SpanQuery mechanism because in this case the entire field will be displayed - and I can't search for "*". Is there some trick I'm not thinking of?

> this is the tip of the iceberg; a big dangerous iceberg...

Yes - I'm beginning to see that...

Thanks,
-Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: recovering payload from fields

Posted by Christopher Condit <co...@sdsc.edu>.

Hi Chris-
> To my knoweldge, the character position of the tokens is not preserved by
> Lucene - only the ordinal postion of token's within a document / field is
> preserved.  Thus you need to store this character offset information
> separately, say, as Payload data.

Thanks for the information. So adding the OffsetAttribute at index time doesn't embed the offset information in the index - it just makes it available to the TokenFilter? I'll try adding the offset from the attribute to the payload..

In terms of getting access to the payloads is the best way to reconstruct the token stream (as the Highlighter does)? Or is than an easier way to just get access to the payloads?

Thanks,
-Chris


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: recovering payload from fields

Posted by Christopher Tignor <ct...@thinkmap.com>.

Hello,

To my knoweldge, the character position of the tokens is not preserved by
Lucene - only the ordinal postion of token's within a document / field is
preserved.  Thus you need to store this character offset information
separately, say, as Payload data.

best,

C>T>

On Fri, Feb 26, 2010 at 3:41 PM, Christopher Condit <co...@sdsc.edu> wrote:

> I'm trying to store semantic information in payloads at index time. I
> believe this part is successful - but I'm having trouble getting access to
> the payload locations after the index is created. I'd like to know the
> offset in the original text for the token with the payload - and get this
> information for all payloads that are set in a Field even if they don't
> relate to the query. I tried (from the highlighting filter):
> TokenStream tokens = TokenSources.getTokenStream(reader, 0, "body");
>  while (tokens.incrementToken()) {
>    TermAttribute term = tokens.getAttribute(TermAttribute.class);
>    if (toker.hasAttribute(PayloadAttribute.class)) {
>      PayloadAttribute payload =
> tokens.getAttribute(PayloadAttribute.class);
>      OffsetAttribute offset = toker.getAttribute(OffsetAttribute.class);
>    }
>  }
> But the OffsetAttribute never seems to contain any information.
> In my token filter do I need to do more than:
> offsetAtt = addAttribute(OffsetAttribute.class);
> during construction in order to store Offset information?
>
> Thanks,
> -Chris
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999