You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Stoppelman <st...@gmail.com> on 2009/02/03 08:24:06 UTC

Poor QPS with highlighting

Hi all,

My search backends are only able to eek out 13-15 qps even with the entire
index in memory (this makes it very expensive to scale). According to my
YourKit profiler 80% of the program's time ends up in highlighting. With
highlighting disabled my backend gets about 45-50 qps (cheaper scaling)!
We're using Mark's TokenSources contrib. to make reconstructing of the
document quicker. I was contemplating patching the index to store offsets
for every term (instead of just the ordinal positions) so that I could make
the highlighting faster (since you would know where you hit in the document
on the search pass). I saw this thread from 2004:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04743.html -
which asks about adding offsets to the index but it was decided against
because it would make the index too large. I can totally understand this;
but as machines get more beefy it would probably be nice to make this
optional since having 15 qps vs 50qps is quite a trade-off right now. Are
other folks seeing this? My documents are quite big sometimes up to 300k
tokens. Also my document fields are compressed which is also a time sink for
the cpu.

Please let me know if you need more details, happy to share.

Sincerely,
M

Re: Poor QPS with highlighting

Posted by Michael Stoppelman <st...@gmail.com>.

Thanks Mark for the explanation. I think your solution would definitely
change the tf-idf scoring for documents since your field is now split up
over multiple docs.  One option to get around the changing scoring would be
to to run a completely separate index for highlighting (with the overlapping
docs you described). It still seems like storing the offsets would be the
most efficient solution since I wouldn't need a new service to do the
highlighting.

M

On Tue, Feb 3, 2009 at 12:52 PM, markharw00d <ma...@yahoo.co.uk>wrote:

>
>  Can you describe this in a little more detail; I'm not exactly sure what
>> you
>> mean.
>>
>>
>
> Break your large text documents into multiple Lucene documents. Rather than
> dividing them up into entirely discreet chunks of text consider
> storing/indexing *overlapping* sections of text with an overlap as big as
> the largest "slop" factor you use on Phrase/Span queries so that you don't
> cut any potential phrases in half and fail to match e.g.
>
> This non-overlapping indexing scheme will not match a search for "George
> Bush"
>
>   Doc 1 = "....  outgoing president George "
>   Doc 2=  "Bush stated that ..."
>
> While this overlapping scheme will match...
>   Doc 1 = "....  outgoing president George "
>   Doc 2=  "president George Bush stated that ..."
>
> This fragmenting approach helps avoid the performance cost of highlighting
> very large documents.
>
> The remaining issue is to remove duplicates in your search results when you
> match multiple chunks e.g. Lucene Docs #1 and #2 both refer to Input Doc#1
> and will match a search for "president". You will need to store a field for
> the "original document number" and remove any duplicates (or merge them in
> the display if that is what is required).
>
> Cheers,
> Mark
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Poor QPS with highlighting

Posted by markharw00d <ma...@yahoo.co.uk>.

> Can you describe this in a little more detail; I'm not exactly sure what you
> mean.
>   

Break your large text documents into multiple Lucene documents. Rather 
than dividing them up into entirely discreet chunks of text consider 
storing/indexing *overlapping* sections of text with an overlap as big 
as the largest "slop" factor you use on Phrase/Span queries so that you 
don't cut any potential phrases in half and fail to match e.g.

This non-overlapping indexing scheme will not match a search for "George 
Bush"

    Doc 1 = "....  outgoing president George "
    Doc 2=  "Bush stated that ..."

While this overlapping scheme will match...
    Doc 1 = "....  outgoing president George "
    Doc 2=  "president George Bush stated that ..."

This fragmenting approach helps avoid the performance cost of 
highlighting very large documents.

The remaining issue is to remove duplicates in your search results when 
you match multiple chunks e.g. Lucene Docs #1 and #2 both refer to Input 
Doc#1 and will match a search for "president". You will need to store a 
field for the "original document number" and remove any duplicates (or 
merge them in the display if that is what is required).

Cheers,
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor QPS with highlighting

Posted by Michael Stoppelman <st...@gmail.com>.

On Tue, Feb 3, 2009 at 1:14 AM, mark harwood <ma...@yahoo.co.uk>wrote:

> >>My documents are quite big sometimes up to 300ktokens.
>
> You could look at indexing them as seperate documents using overlapping
> sections of text. Erik used this for one of his projects.
>

Can you describe this in a little more detail; I'm not exactly sure what you
mean.


> Cheers
> Mark
>
>
>
> ----- Original Message ----
> From: Michael Stoppelman <st...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 3 February, 2009 7:24:06
> Subject: Poor QPS with highlighting
>
> Hi all,
>
> My search backends are only able to eek out 13-15 qps even with the entire
> index in memory (this makes it very expensive to scale). According to my
> YourKit profiler 80% of the program's time ends up in highlighting. With
> highlighting disabled my backend gets about 45-50 qps (cheaper scaling)!
> We're using Mark's TokenSources contrib. to make reconstructing of the
> document quicker. I was contemplating patching the index to store offsets
> for every term (instead of just the ordinal positions) so that I could make
> the highlighting faster (since you would know where you hit in the document
> on the search pass). I saw this thread from 2004:
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04743.html -
> which asks about adding offsets to the index but it was decided against
> because it would make the index too large. I can totally understand this;
> but as machines get more beefy it would probably be nice to make this
> optional since having 15 qps vs 50qps is quite a trade-off right now. Are
> other folks seeing this? My documents are quite big sometimes up to 300k
> tokens. Also my document fields are compressed which is also a time sink
> for
> the cpu.
>
> Please let me know if you need more details, happy to share.
>
> Sincerely,
> M
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Poor QPS with highlighting

Posted by mark harwood <ma...@yahoo.co.uk>.

>>My documents are quite big sometimes up to 300ktokens.

You could look at indexing them as seperate documents using overlapping sections of text. Erik used this for one of his projects.

Cheers
Mark



----- Original Message ----
From: Michael Stoppelman <st...@gmail.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 3 February, 2009 7:24:06
Subject: Poor QPS with highlighting

Hi all,

My search backends are only able to eek out 13-15 qps even with the entire
index in memory (this makes it very expensive to scale). According to my
YourKit profiler 80% of the program's time ends up in highlighting. With
highlighting disabled my backend gets about 45-50 qps (cheaper scaling)!
We're using Mark's TokenSources contrib. to make reconstructing of the
document quicker. I was contemplating patching the index to store offsets
for every term (instead of just the ordinal positions) so that I could make
the highlighting faster (since you would know where you hit in the document
on the search pass). I saw this thread from 2004:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04743.html -
which asks about adding offsets to the index but it was decided against
because it would make the index too large. I can totally understand this;
but as machines get more beefy it would probably be nice to make this
optional since having 15 qps vs 50qps is quite a trade-off right now. Are
other folks seeing this? My documents are quite big sometimes up to 300k
tokens. Also my document fields are compressed which is also a time sink for
the cpu.

Please let me know if you need more details, happy to share.

Sincerely,
M



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Poor QPS with highlighting

Posted by Jason Rutherglen <ja...@gmail.com>.

http://en.wikipedia.org/wiki/Google_platform

Document server summarization

On Thu, Feb 5, 2009 at 12:57 PM, Michael Stoppelman <st...@gmail.com>wrote:

> On Thu, Feb 5, 2009 at 12:47 PM, Michael Stoppelman <stopman@gmail.com
> >wrote:
>
> >
> >
> > On Thu, Feb 5, 2009 at 9:05 AM, Jason Rutherglen <
> > jason.rutherglen@gmail.com> wrote:
> >
> >> Google uses dedicated highlighting servers.  Maybe this architecture
> would
> >> work for you.
> >>
> >
> > What's your reference? I used to work at Google.
> >
>
> I think creating a separate index/service would be reasonable and it's what
> I purposed in a previous email on this thread...
> "One option to get around the changing scoring would be to to run a
> completely separate index for highlighting (with the overlapping docs you
> described)."
>
> Still do lucene developers think storing the offsets is a bad idea from an
> index size prospective or some other reason?
>
> M
>
>
> >
> >> On Mon, Feb 2, 2009 at 11:24 PM, Michael Stoppelman <stopman@gmail.com
> >> >wrote:
> >>
> >> > Hi all,
> >> >
> >> > My search backends are only able to eek out 13-15 qps even with the
> >> entire
> >> > index in memory (this makes it very expensive to scale). According to
> my
> >> > YourKit profiler 80% of the program's time ends up in highlighting.
> With
> >> > highlighting disabled my backend gets about 45-50 qps (cheaper
> scaling)!
> >> > We're using Mark's TokenSources contrib. to make reconstructing of the
> >> > document quicker. I was contemplating patching the index to store
> >> offsets
> >> > for every term (instead of just the ordinal positions) so that I could
> >> make
> >> > the highlighting faster (since you would know where you hit in the
> >> document
> >> > on the search pass). I saw this thread from 2004:
> >> >
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04743.html-
> >> > which asks about adding offsets to the index but it was decided
> against
> >> > because it would make the index too large. I can totally understand
> >> this;
> >> > but as machines get more beefy it would probably be nice to make this
> >> > optional since having 15 qps vs 50qps is quite a trade-off right now.
> >> Are
> >> > other folks seeing this? My documents are quite big sometimes up to
> 300k
> >> > tokens. Also my document fields are compressed which is also a time
> sink
> >> > for
> >> > the cpu.
> >> >
> >> > Please let me know if you need more details, happy to share.
> >> >
> >> > Sincerely,
> >> > M
> >> >
> >>
> >
> >
>

Re: Poor QPS with highlighting

Posted by Michael Stoppelman <st...@gmail.com>.

On Thu, Feb 5, 2009 at 12:47 PM, Michael Stoppelman <st...@gmail.com>wrote:

>
>
> On Thu, Feb 5, 2009 at 9:05 AM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> Google uses dedicated highlighting servers.  Maybe this architecture would
>> work for you.
>>
>
> What's your reference? I used to work at Google.
>

I think creating a separate index/service would be reasonable and it's what
I purposed in a previous email on this thread...
"One option to get around the changing scoring would be to to run a
completely separate index for highlighting (with the overlapping docs you
described)."

Still do lucene developers think storing the offsets is a bad idea from an
index size prospective or some other reason?

M


>
>> On Mon, Feb 2, 2009 at 11:24 PM, Michael Stoppelman <stopman@gmail.com
>> >wrote:
>>
>> > Hi all,
>> >
>> > My search backends are only able to eek out 13-15 qps even with the
>> entire
>> > index in memory (this makes it very expensive to scale). According to my
>> > YourKit profiler 80% of the program's time ends up in highlighting. With
>> > highlighting disabled my backend gets about 45-50 qps (cheaper scaling)!
>> > We're using Mark's TokenSources contrib. to make reconstructing of the
>> > document quicker. I was contemplating patching the index to store
>> offsets
>> > for every term (instead of just the ordinal positions) so that I could
>> make
>> > the highlighting faster (since you would know where you hit in the
>> document
>> > on the search pass). I saw this thread from 2004:
>> > http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04743.html-
>> > which asks about adding offsets to the index but it was decided against
>> > because it would make the index too large. I can totally understand
>> this;
>> > but as machines get more beefy it would probably be nice to make this
>> > optional since having 15 qps vs 50qps is quite a trade-off right now.
>> Are
>> > other folks seeing this? My documents are quite big sometimes up to 300k
>> > tokens. Also my document fields are compressed which is also a time sink
>> > for
>> > the cpu.
>> >
>> > Please let me know if you need more details, happy to share.
>> >
>> > Sincerely,
>> > M
>> >
>>
>
>

Re: Poor QPS with highlighting

Posted by Michael Stoppelman <st...@gmail.com>.

On Thu, Feb 5, 2009 at 9:05 AM, Jason Rutherglen <jason.rutherglen@gmail.com
> wrote:

> Google uses dedicated highlighting servers.  Maybe this architecture would
> work for you.
>

What's your reference? I used to work at Google.


>
> On Mon, Feb 2, 2009 at 11:24 PM, Michael Stoppelman <stopman@gmail.com
> >wrote:
>
> > Hi all,
> >
> > My search backends are only able to eek out 13-15 qps even with the
> entire
> > index in memory (this makes it very expensive to scale). According to my
> > YourKit profiler 80% of the program's time ends up in highlighting. With
> > highlighting disabled my backend gets about 45-50 qps (cheaper scaling)!
> > We're using Mark's TokenSources contrib. to make reconstructing of the
> > document quicker. I was contemplating patching the index to store offsets
> > for every term (instead of just the ordinal positions) so that I could
> make
> > the highlighting faster (since you would know where you hit in the
> document
> > on the search pass). I saw this thread from 2004:
> > http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04743.html-
> > which asks about adding offsets to the index but it was decided against
> > because it would make the index too large. I can totally understand this;
> > but as machines get more beefy it would probably be nice to make this
> > optional since having 15 qps vs 50qps is quite a trade-off right now. Are
> > other folks seeing this? My documents are quite big sometimes up to 300k
> > tokens. Also my document fields are compressed which is also a time sink
> > for
> > the cpu.
> >
> > Please let me know if you need more details, happy to share.
> >
> > Sincerely,
> > M
> >
>

Re: Poor QPS with highlighting

Posted by Jason Rutherglen <ja...@gmail.com>.

Google uses dedicated highlighting servers.  Maybe this architecture would
work for you.

On Mon, Feb 2, 2009 at 11:24 PM, Michael Stoppelman <st...@gmail.com>wrote:

> Hi all,
>
> My search backends are only able to eek out 13-15 qps even with the entire
> index in memory (this makes it very expensive to scale). According to my
> YourKit profiler 80% of the program's time ends up in highlighting. With
> highlighting disabled my backend gets about 45-50 qps (cheaper scaling)!
> We're using Mark's TokenSources contrib. to make reconstructing of the
> document quicker. I was contemplating patching the index to store offsets
> for every term (instead of just the ordinal positions) so that I could make
> the highlighting faster (since you would know where you hit in the document
> on the search pass). I saw this thread from 2004:
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04743.html -
> which asks about adding offsets to the index but it was decided against
> because it would make the index too large. I can totally understand this;
> but as machines get more beefy it would probably be nice to make this
> optional since having 15 qps vs 50qps is quite a trade-off right now. Are
> other folks seeing this? My documents are quite big sometimes up to 300k
> tokens. Also my document fields are compressed which is also a time sink
> for
> the cpu.
>
> Please let me know if you need more details, happy to share.
>
> Sincerely,
> M
>