You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Greg Shackles <gs...@gmail.com> on 2008/11/12 16:47:47 UTC

Lucene implementation/performance question

I hope this isn't a dumb question or anything, I'm fairly new to Lucene so
I've been picking it up as I go pretty much.  Without going into too much
detail, I need to store pages of text, and for each word on each page, store
detailed information about it.  To do this, I have 2 indexes:

1) pages: this stores the full text of the page, and identifying information
about it
2) words: this stores a single word, along with the page it was on and is
stored in the order they appear on the page

When doing a search, not only do I need to return the page it was found on,
but also the details of the matching words.  Since I couldn't think of a
better way to do it, I first search the pages index and find any matching
pages.  Then I iterate the words on those pages to find where the match
occurred.  Obviously this is costly as far as execution time goes, but at
least it only has to get done for matching pages rather than every page.
Searches still take way longer than I'd like though, and the bottleneck is
almost entirely in the code to find the matches on the page.

One simple optimization I can think of is store the pages in smaller blocks
so that the scope of the iteration is made smaller.  This is not really
ideal, since I also need the ability to narrow down results based on other
words that can/can't appear on the same page which would mean storing 3 full
copies of every word on every page (one in each of the 3 resulting indexes).

I know this isn't a Java performance forum so I'll try to keep this Lucene
related, but has anyone done anything similar to this, or have any
comments/ideas on how to improve it?  I'm in the process of trying to speed
things up since I need to perform many searches often over very large sets
of pages.  Thanks!

- Greg

Re: Latest stable release?

Posted by Ian Lea <ia...@gmail.com>.

Ummm ...

I know it because there was an email sent to this list on 11-Oct
saying "Release 2.4.0 of Lucene is now available!".  Doesn't explicily
say stable, but that is the implication.

I'm not sure about any convention but it seems a fair bet.


--
Ian.


On Fri, Nov 28, 2008 at 11:32 AM, Chris Bamford
<ch...@scalix.com> wrote:
> Thanks Ian.
>
> Is that the convention - the top of the list on
> http://lucene.apache.org/java/docs/index.html is always the latest stable
> release - or do you know that by some other means?
>
> Cheers,
>
> - Chris
>
>
> Ian Lea wrote:
>>
>> 2.4.0.
>>
>>
>> --
>> Ian.
>>
>>
>> On Fri, Nov 28, 2008 at 11:16 AM, Chris Bamford
>> <ch...@scalix.com> wrote:
>>
>>>
>>> Hi
>>>
>>> Can anyone tell me what the latest stable release is?
>>>  http://lucene.apache.org/java/docs/index.html doesn't say.
>>>
>>> Thanks,
>>>
>>> - Chris
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Chris Bamford
> Senior Development Engineer
> *Scalix*
> chris.bamford@scalix.com
> Tel: +44 (0)1344 381814
> www.scalix.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Latest stable release?

Posted by Chris Bamford <ch...@scalix.com>.

Thanks Ian.

Is that the convention - the top of the list on 
http://lucene.apache.org/java/docs/index.html is always the latest 
stable release - or do you know that by some other means?

Cheers,

- Chris


Ian Lea wrote:
> 2.4.0.
>
>
> --
> Ian.
>
>
> On Fri, Nov 28, 2008 at 11:16 AM, Chris Bamford
> <ch...@scalix.com> wrote:
>   
>> Hi
>>
>> Can anyone tell me what the latest stable release is?
>>  http://lucene.apache.org/java/docs/index.html doesn't say.
>>
>> Thanks,
>>
>> - Chris
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


-- 
Chris Bamford
Senior Development Engineer
*Scalix*
chris.bamford@scalix.com
Tel: +44 (0)1344 381814
www.scalix.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Latest stable release?

Posted by Ian Lea <ia...@gmail.com>.

2.4.0.


--
Ian.


On Fri, Nov 28, 2008 at 11:16 AM, Chris Bamford
<ch...@scalix.com> wrote:
> Hi
>
> Can anyone tell me what the latest stable release is?
>  http://lucene.apache.org/java/docs/index.html doesn't say.
>
> Thanks,
>
> - Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene implementation/performance question

Posted by Greg Shackles <gs...@gmail.com>.

The queries I'm doing really aren't anything clever...just searching for
phrases on pages of text, sometimes narrowing results by other words that
must appear on the page, or words that cannot appear on the same page.  I
don't have experience with those span queries so i can't say much about
them.  However, I will say that at present there seems to be no way to make
the PayloadSpanUtil act on just a subset of an index.  I tried just taking
matching documents and putting them into a RAM backed index, but that
doesn't transfer over the payloads so it was pretty much useless for me.  I
hope this is something they can work out in the future.  In the payloads I
store a lot of metadata, including the word as it actually appeared on the
page, with capitalization, punctuation, etc.

I don't think it's really feasible to search on the payload since that isn't
indexed.  If you have things in there that you would want indexed, I would
suggest designing your indexes differently to accomodate for that.

I'm using Lucene 2.4, plus the patch that Mark put out to fix the payload
issues I ran into.  I wouldn't suggest using anything less since it wasn't
very useful before the patch.  It would probably be worth your time just to
upgrade the version of Lucene you are using anyway, for a variety of
reasons.

- Greg

Re: Lucene implementation/performance question

Posted by Eran Sevi <er...@gmail.com>.

Hi Greg,
Thanks for quick and detailed answer.

What kind of queries do you run? Is it going to work for
SpanNearQueries/SpanNotQueries  as well?
Do you also get the word itself at each position?

It would be great if I could search on the content of each payload as well,
but since the payload content is quite complicated and not a simple value I
guess it's too much to ask for.

What version of Lucene are you using? I'm not sure I'll be able to use the
latest fixes.

Thanks again,
Eran.
On Wed, Nov 26, 2008 at 4:47 PM, Greg Shackles <gs...@gmail.com> wrote:

> Sure, I'm happy to give some insight into this.  My index itself has a few
> fields - one that uniquely identifies the page, one that stores all the
> text
> on the page, and then some others to store characteristics.  At indexing
> time, the text field for each document is manually created by concatenating
> each word together, separated by spaces.  Then the IndexWriter runs the
> document through a custom filter that attaches payloads to each token.  The
> payloads here include all the attributes I need regarding that word, and
> most importantly, the index of that word on the page.  The tricky part here
> was that one of my "words" could map to more than one Lucene token, so I
> first create a quick map from my words to which token they should
> correspond
> to, by running each word through an Analyzer (StandardAnalyzer in my case).
> This makes it easy to only attach the payload to the first token for each
> of
> my words.
>
> For searching, I pass the search query to a PayloadSpanUtil which gets the
> payloads for every match throughout the entire index.  I take these results
> and put them into a Collection of custom objects, and then sort them first
> by page identifier, and then by index on the page.  Once I have this list,
> I
> can quickly iterate through it to find the groupings of payloads that match
> the search term (this also helps weed out the occasional bad result that
> comes back).  I wasn't sure initially if this would be a performance hit
> but
> it is very quick.  Basically what I do is tokenize the search string, then
> concatenate all tokens together without spaces into one string.  Then when
> iterating through I see if the word matches the start of the tokenized
> string - if so, chop it off and keep going til the whole string is found.
> Then repeat, and so on.  It's certainly not the most elegant solution but I
> didn't see a better way since PSU doesn't group or sort on its own.
>
> One other solution I might try if I have time is to take each document from
> the original search, put them one at a time into a MemoryIndex and then let
> PSU act on that.  I'm not sure if this would help/hurt performance but
> might
> be worth trying.  I will also say to make sure you apply Mark's latest
> patch
> (see the case here: https://issues.apache.org/jira/browse/LUCENE-1465)
> since
> it fixed some important bugs I had come across.
>
> I hope this made sense, I haven't finished my morning coffee yet so I can't
> be too sure : )  Let me know if you have any more questions.
>
> - Greg
>
>
>
> On Wed, Nov 26, 2008 at 3:19 AM, Eran Sevi <er...@gmail.com> wrote:
>
> > Hi,
> > Can you please shed some light on how your final architecture looks like?
> > Do you manually use the PayloadSpanUtil for each document separately?
> > How did you solve the problem with phrase results?
> > Thanks in advance for your time,
> > Eran.
> > On Tue, Nov 25, 2008 at 10:30 PM, Greg Shackles <gs...@gmail.com>
> > wrote:
> >
> > > Just wanted to post a little follow-up here now that I've gotten
> through
> > > implementing the system using payloads.  Execution times are
> phenomenal!
> > > Things that took over a minute to run in my old system take fractions
> of
> > a
> > > second to run now.  I would also like to thank Mark for being very
> > > responsive in fixing/patching some bugs I encountered along the way.
> > >
> > > - Greg
> > >
> >
>

Re: Lucene implementation/performance question

Posted by Greg Shackles <gs...@gmail.com>.

Sure, I'm happy to give some insight into this.  My index itself has a few
fields - one that uniquely identifies the page, one that stores all the text
on the page, and then some others to store characteristics.  At indexing
time, the text field for each document is manually created by concatenating
each word together, separated by spaces.  Then the IndexWriter runs the
document through a custom filter that attaches payloads to each token.  The
payloads here include all the attributes I need regarding that word, and
most importantly, the index of that word on the page.  The tricky part here
was that one of my "words" could map to more than one Lucene token, so I
first create a quick map from my words to which token they should correspond
to, by running each word through an Analyzer (StandardAnalyzer in my case).
This makes it easy to only attach the payload to the first token for each of
my words.

For searching, I pass the search query to a PayloadSpanUtil which gets the
payloads for every match throughout the entire index.  I take these results
and put them into a Collection of custom objects, and then sort them first
by page identifier, and then by index on the page.  Once I have this list, I
can quickly iterate through it to find the groupings of payloads that match
the search term (this also helps weed out the occasional bad result that
comes back).  I wasn't sure initially if this would be a performance hit but
it is very quick.  Basically what I do is tokenize the search string, then
concatenate all tokens together without spaces into one string.  Then when
iterating through I see if the word matches the start of the tokenized
string - if so, chop it off and keep going til the whole string is found.
Then repeat, and so on.  It's certainly not the most elegant solution but I
didn't see a better way since PSU doesn't group or sort on its own.

One other solution I might try if I have time is to take each document from
the original search, put them one at a time into a MemoryIndex and then let
PSU act on that.  I'm not sure if this would help/hurt performance but might
be worth trying.  I will also say to make sure you apply Mark's latest patch
(see the case here: https://issues.apache.org/jira/browse/LUCENE-1465) since
it fixed some important bugs I had come across.

I hope this made sense, I haven't finished my morning coffee yet so I can't
be too sure : )  Let me know if you have any more questions.

- Greg

On Wed, Nov 26, 2008 at 3:19 AM, Eran Sevi <er...@gmail.com> wrote:

> Hi,
> Can you please shed some light on how your final architecture looks like?
> Do you manually use the PayloadSpanUtil for each document separately?
> How did you solve the problem with phrase results?
> Thanks in advance for your time,
> Eran.
> On Tue, Nov 25, 2008 at 10:30 PM, Greg Shackles <gs...@gmail.com>
> wrote:
>
> > Just wanted to post a little follow-up here now that I've gotten through
> > implementing the system using payloads.  Execution times are phenomenal!
> > Things that took over a minute to run in my old system take fractions of
> a
> > second to run now.  I would also like to thank Mark for being very
> > responsive in fixing/patching some bugs I encountered along the way.
> >
> > - Greg
> >
>

Re: Lucene implementation/performance question

Posted by Eran Sevi <er...@gmail.com>.

Hi,
Can you please shed some light on how your final architecture looks like?
Do you manually use the PayloadSpanUtil for each document separately?
How did you solve the problem with phrase results?
Thanks in advance for your time,
Eran.
On Tue, Nov 25, 2008 at 10:30 PM, Greg Shackles <gs...@gmail.com> wrote:

> Just wanted to post a little follow-up here now that I've gotten through
> implementing the system using payloads.  Execution times are phenomenal!
> Things that took over a minute to run in my old system take fractions of a
> second to run now.  I would also like to thank Mark for being very
> responsive in fixing/patching some bugs I encountered along the way.
>
> - Greg
>

Re: Lucene implementation/performance question

Posted by Greg Shackles <gs...@gmail.com>.

Just wanted to post a little follow-up here now that I've gotten through
implementing the system using payloads.  Execution times are phenomenal!
Things that took over a minute to run in my old system take fractions of a
second to run now.  I would also like to thank Mark for being very
responsive in fixing/patching some bugs I encountered along the way.

- Greg

Re: Lucene implementation/performance question

Posted by Greg Shackles <gs...@gmail.com>.

Thanks for the update, Mark.  I guess that means I'll have to do the sorting
myself - that shouldn't be too hard, but the annoying part would just be
knowing where one result ends and the next begins since there's no guarantee
that they'll always be the same. Let me know if you find any information on
the discussion about it.

- Greg

On Thu, Nov 20, 2008 at 11:43 AM, Mark Miller <ma...@gmail.com> wrote:

> Yeah, discussion came up on order and I believe we punted - its up to you
> to track order and sort at the moment. I think that was to prevent those
> that didnt need it from paying the sort cost, but I have to go find that
> discussion again (maybe its in the issue?) I'll look at the whole idea again
> though.
>
>

Re: Lucene implementation/performance question

Posted by Mark Miller <ma...@gmail.com>.

Yeah, discussion came up on order and I believe we punted - its up to 
you to track order and sort at the moment. I think that was to prevent 
those that didnt need it from paying the sort cost, but I have to go 
find that discussion again (maybe its in the issue?) I'll look at the 
whole idea again though.

Greg Shackles wrote:
> On Wed, Nov 19, 2008 at 12:33 PM, Greg Shackles <gs...@gmail.com> wrote:
>
>   
>> In the searching phase, I would run the search across all page documents,
>> and then for each of those pages, do a search with
>> PayloadSpanUtil.getPayloadsForQuery that made it so it only got payloads for
>> each page at a time.  The function returns a Collection of Payloads as far
>> as I can tell, so is there any way of knowing which payloads go together?
>> That is to say, if you were to do a search for "lucene rocks" on the page
>> and it appeared 3 times, you would get back 6 payloads in total.  Is there a
>> quick way of knowing how to group them in the collection?
>>
>>     
>
> Just a follow-up on my post now that I was able to see what the real data
> looks like when it comes back from PayloadSpanUtil.  The order of payload
> terms in the collection doesn't seem useful, as I suspect it is somehow
> related to the order they are stored in the index itself.  Because of that,
> grouping them is going to be difficult as I suspected, but this seems like
> something Lucene should be able to do for me.  Is that not correct?  I'd
> like to keep as much of the logic as possible out of my own implementation
> for the sake of performance so if there is some way to do this, I would love
> to know.  Thanks!
>
> By the way, the Payloads feature is really cool! Definitely way better than
> how I was doing things originally.  : )
>
> - Greg
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Latest stable release?

Posted by Chris Bamford <ch...@scalix.com>.

Hi

Can anyone tell me what the latest stable release is?  
http://lucene.apache.org/java/docs/index.html doesn't say.

Thanks,

- Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene implementation/performance question

Posted by Greg Shackles <gs...@gmail.com>.

On Wed, Nov 19, 2008 at 12:33 PM, Greg Shackles <gs...@gmail.com> wrote:

> In the searching phase, I would run the search across all page documents,
> and then for each of those pages, do a search with
> PayloadSpanUtil.getPayloadsForQuery that made it so it only got payloads for
> each page at a time.  The function returns a Collection of Payloads as far
> as I can tell, so is there any way of knowing which payloads go together?
> That is to say, if you were to do a search for "lucene rocks" on the page
> and it appeared 3 times, you would get back 6 payloads in total.  Is there a
> quick way of knowing how to group them in the collection?
>

Just a follow-up on my post now that I was able to see what the real data
looks like when it comes back from PayloadSpanUtil.  The order of payload
terms in the collection doesn't seem useful, as I suspect it is somehow
related to the order they are stored in the index itself.  Because of that,
grouping them is going to be difficult as I suspected, but this seems like
something Lucene should be able to do for me.  Is that not correct?  I'd
like to keep as much of the logic as possible out of my own implementation
for the sake of performance so if there is some way to do this, I would love
to know.  Thanks!

By the way, the Payloads feature is really cool! Definitely way better than
how I was doing things originally.  : )

- Greg

Re: Lucene implementation/performance question

Posted by Greg Shackles <gs...@gmail.com>.

I have a couple quick questions...it might just be because I haven't looked
at this in a week now (got pulled away onto some other stuff that had to
take priority).

In the searching phase, I would run the search across all page documents,
and then for each of those pages, do a search with
PayloadSpanUtil.getPayloadsForQuery that made it so it only got payloads for
each page at a time.  The function returns a Collection of Payloads as far
as I can tell, so is there any way of knowing which payloads go together?
That is to say, if you were to do a search for "lucene rocks" on the page
and it appeared 3 times, you would get back 6 payloads in total.  Is there a
quick way of knowing how to group them in the collection?

Also, I need a way of seeing the words that came before or after a match on
the page.  The quick answer would be to store the next and previous word in
the meta-data but this isn't scalable and would mean reindexing everything
if I wanted to change the number of words stored.  My thought was to have
another index of words that store the text, page information, and the index
of that word on the page.  Then that index can be used in the meta-data in
the first index, so if we know we got back word 5, we can run queries to get
words 4 and 6 from the page.  Does this make sense, or is there a way that
would be better for performance?

Thanks,
Greg

Re: Lucene implementation/performance question

Posted by Eran Sevi <er...@gmail.com>.

Hi,
I have the same need - to obtain "attributes" for terms stored in some
field. I also need all the results and can't take just the first few docs.

I'm using an older version of lucene and the method i'm using right now is
this:
1. Store the words as usual in some field.
2. Store the attributesof each word in another field aligned with the words.
If you have several types of information, create a field for each type.
3. In order to retrieve the information, run a span query of the words
field.
4. use the positions of the spans to "jump" directly to the stored metadata
field.

This seems cumbersome but it works quite well and quite fast. You don't have
to do several passes - just iterate on the spans. you can easily add as many
attributes to the word as you want as long as they are aligned to the words
fields. The problems are that you are restricted to span queries and that
you have get all the attributes fields for each document and split the
fields before being able to jump to the right position.

However, Mark's suggestion to use PayloadSpanUtil might greatly improve my
solution if for each span we could have the payload information without
running another search. I saw some discussions about extending the Spans
class to include the combined payloads of the terms in each span, but I
don't think it was implemented.

Eran.

On Wed, Nov 12, 2008 at 9:53 PM, Greg Shackles <gs...@gmail.com> wrote:

> >
> > Right, sounds like you have it spot on. That second * from 3 looks like a
> > possible tricky part.
>
>
> I agree that it will be the tricky part but I think as long as I'm careful
> with counting as I iterate through it should be ok (I probably just doomed
> myself by saying that...)
>
> Right...you'd do it essentially how Highlighting works...you do the search
> > to get the docs of interest, and then redo the search somewhat to get the
> > highlights/payloads for an individual doc at a time. You are redoing some
> > work, but if you think about, getting that info for every match (there
> could
> > be tons) doesn't make much since when someone might just look at the top
> > couple results, or say 10 at a time. Depends on your usecase if its
> feasible
> > or not though. Most find it efficient enough to do highlighting with, so
> I'm
> > assuming it should be good enough here.
>
>
> In my case I actually need all of the results, no matter how many there
> are.  I imagine my use case is somewhat different than what Lucene is
> typically used for, but if performance is good for payloads then I would
> suspect this method would prove to be pretty quick overall (especially
> compared to the old way I was doing it).  I think this is the route I'm
> going to go, so I will report back with progress as it goes.
>
> - Greg
>

Re: Lucene implementation/performance question

Posted by Greg Shackles <gs...@gmail.com>.

>
> Right, sounds like you have it spot on. That second * from 3 looks like a
> possible tricky part.


I agree that it will be the tricky part but I think as long as I'm careful
with counting as I iterate through it should be ok (I probably just doomed
myself by saying that...)

Right...you'd do it essentially how Highlighting works...you do the search
> to get the docs of interest, and then redo the search somewhat to get the
> highlights/payloads for an individual doc at a time. You are redoing some
> work, but if you think about, getting that info for every match (there could
> be tons) doesn't make much since when someone might just look at the top
> couple results, or say 10 at a time. Depends on your usecase if its feasible
> or not though. Most find it efficient enough to do highlighting with, so I'm
> assuming it should be good enough here.


In my case I actually need all of the results, no matter how many there
are.  I imagine my use case is somewhat different than what Lucene is
typically used for, but if performance is good for payloads then I would
suspect this method would prove to be pretty quick overall (especially
compared to the old way I was doing it).  I think this is the route I'm
going to go, so I will report back with progress as it goes.

- Greg

Re: Lucene implementation/performance question

Posted by Mark Miller <ma...@gmail.com>.

Greg Shackles wrote:
> Thanks!  This all actually sounds promising, I just want to make sure I'm
> thinking about this correctly.  Does this make sense?
>
> Indexing process:
>
> 1) Get list of all words for a page and their attributes, stored in some
> sort of data structure
> 2) Concatenate the text from those words (space separated) into a string
> that represents the entire page
> 3) When adding the page document to the index, run it through a custom
> analyzer that attaches the payloads to the tokens
>   * this would have to follow along in the word list from #1 to get the
> payload information for each token
>   * would also have to tokenize the word we are storing to see how many
> Lucene tokens it would translate to (to make sure the right payloads go with
> the right tokens)
>   
Right, sounds like you have it spot on. That second * from 3 looks like 
a possible tricky part.
> I haven't totally analyzed the searching process yet since I want to get my
> head around the storage part first, but I imagine that would be the easier
> part anyway.  Does this approach sound reasonable?
>   
Sounds good.
> My other concern is your comment about isolating results.  If I'm reading it
> correctly, it means that I'd have to do the search in multiple passes, one
> to get the individual docs containing the matches, and then one query for
> each of those to get the payloads within them?
>   
Right...you'd do it essentially how Highlighting works...you do the 
search to get the docs of interest, and then redo the search somewhat to 
get the highlights/payloads for an individual doc at a time. You are 
redoing some work, but if you think about, getting that info for every 
match (there could be tons) doesn't make much since when someone might 
just look at the top couple results, or say 10 at a time. Depends on 
your usecase if its feasible or not though. Most find it efficient 
enough to do highlighting with, so I'm assuming it should be good enough 
here.
> Thanks again for your help on this one.
>
> - Greg
>
>
> On Wed, Nov 12, 2008 at 12:52 PM, Mark Miller <ma...@gmail.com> wrote:
>
>   
>> Here is a great power point on payloads from Michael Busch:
>> www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt.
>> Essentially, you can store metadata at each term position, so its an
>> excellent place to store attributes of the term - they are very fast to
>> load, efficient, etc.
>>
>> You can check out the spans test classes for a small example using the
>> PayloadSpanUtil...its actually fairly simple and short, and the main reason
>> I consider it experimental is that it hasn't really been used too much to my
>> knowledge (who knows though). If you have a problem, you'll know quickly and
>> I'll fix quickly. It should work fine though. Overall, the approach wouldn't
>> take that much code, so I don't think youd be out a lot of time.
>>
>> The PayloadSpanUtil takes an IndexReader and a query and returns the
>> payloads for the terms in the IndexReader that match the query. If you end
>> up with multiple docs in the IndexReader, be sure to isolate the query down
>> to the exact doc you want the payloads from (the Span scoring mode of the
>> highlighter actually puts the doc in a fast MemoryIndex which only holds one
>> doc, and uses an IndexReader from the MemoryIndex).
>>
>>
>> Greg Shackles wrote:
>>
>>     
>>> Hey Mark,
>>>
>>> This sounds very interesting.  Is there any documentation or examples I
>>> could see?  I did a quick search but didn't really find much.  It might
>>> just
>>> be that I don't know how payloads work in Lucene, but I'm not sure how I
>>> would see this actually doing what I need.  My reasoning is this...you'd
>>> have an index that stores all the text for a particular page.  Would you
>>> be
>>> able to attach payload information to individual words on that page?  In
>>> my
>>> head it seems like that would be the job of a second index, which is
>>> exactly
>>> why I added the word index.
>>>
>>> Any details you can give would be great as I need to keep moving on this
>>> project quickly.  I will also say that I'm somewhat wary of using an
>>> experimental class since this is a really important project that really
>>> won't be able to wait on a lot of development cycles to get the class
>>> fully
>>> working.  That said, if it can give me serious speed improvements it's
>>> definitely worth considering.
>>>
>>> - Greg
>>>
>>>
>>> On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>>       
>>>> If your new to Lucene, this might be a little much (and maybe I am not
>>>> fully understand the problem), but you might try:
>>>>
>>>> Add the attributes to the words in a payload with a PayloadAnalyzer. Do
>>>> searching as normal. Use the new PayloadSpanUtil class to get the
>>>> payloads
>>>> for the matching words. (Think of the PayloadSpanUtil as a highlighter -
>>>> you
>>>> give it a query, it gives you the payloads to the terms that match). The
>>>> PayloadSpanUtil class is a bit experimental, but I'll fix anything you
>>>> run
>>>> into with it.
>>>>
>>>> - Mark
>>>>
>>>>
>>>> Greg Shackles wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> Hi Erick,
>>>>>
>>>>> Thanks for the response, sorry that I was somewhat vague in the
>>>>> reasoning
>>>>> for my implementation in the first post.  I should have mentioned that
>>>>> the
>>>>> word details are not details of the Lucene document, but are attributes
>>>>> about the word that I am storing.  Some examples are position on the
>>>>> actual
>>>>> page, color, size, bold/italic/underlined, and most importantly, the
>>>>> text
>>>>> as
>>>>> it appeared on the page.  The reason the last one matters is that things
>>>>> like punctuation, spacing and capitalization can vary between the result
>>>>> and
>>>>> the search term, and can affect how I need to process the results
>>>>> afterwords.  I am certainly open to the idea of a new approach if it
>>>>> would
>>>>> improve on things, I admit I am new to Lucene so if there are options
>>>>> I'm
>>>>> unaware of I'd love to learn about them.
>>>>>
>>>>> Just to sum it up with an example, let's say we have a page of text that
>>>>> stores "This is a page of text."  We want to search for the text "of
>>>>> text",
>>>>> which would span multiple words in the word index.  The final result
>>>>> would
>>>>> need to contain "of" and "text", along with the details about each as
>>>>> described before.  I hope this is more helpful!
>>>>>
>>>>> - Greg
>>>>>
>>>>> On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <
>>>>> erickerickson@gmail.com
>>>>>
>>>>>
>>>>>           
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>>           
>>>>>> If I may suggest, could you expand upon what you're trying to
>>>>>> accomplish? Why do you care about the detailed information
>>>>>> about each word? The reason I'm suggesting this is "the XY
>>>>>> problem". That is, people often ask for details about a specific
>>>>>> approach when what they really need is a different approach
>>>>>>
>>>>>> There are TermFrequencies, TermPositions,
>>>>>> TermVectorOffsetInfo and a bunch of other stuff that I don't
>>>>>> know the details of that may work for you if we had
>>>>>> a better idea of what it is you're trying to accomplish...
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gs...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> I hope this isn't a dumb question or anything, I'm fairly new to
>>>>>>> Lucene
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> so
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> I've been picking it up as I go pretty much.  Without going into too
>>>>>>> much
>>>>>>> detail, I need to store pages of text, and for each word on each page,
>>>>>>> store
>>>>>>> detailed information about it.  To do this, I have 2 indexes:
>>>>>>>
>>>>>>> 1) pages: this stores the full text of the page, and identifying
>>>>>>> information
>>>>>>> about it
>>>>>>> 2) words: this stores a single word, along with the page it was on and
>>>>>>> is
>>>>>>> stored in the order they appear on the page
>>>>>>>
>>>>>>> When doing a search, not only do I need to return the page it was
>>>>>>> found
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> on,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> but also the details of the matching words.  Since I couldn't think of
>>>>>>> a
>>>>>>> better way to do it, I first search the pages index and find any
>>>>>>> matching
>>>>>>> pages.  Then I iterate the words on those pages to find where the
>>>>>>> match
>>>>>>> occurred.  Obviously this is costly as far as execution time goes, but
>>>>>>> at
>>>>>>> least it only has to get done for matching pages rather than every
>>>>>>> page.
>>>>>>> Searches still take way longer than I'd like though, and the
>>>>>>> bottleneck
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> is
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> almost entirely in the code to find the matches on the page.
>>>>>>>
>>>>>>> One simple optimization I can think of is store the pages in smaller
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> blocks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> so that the scope of the iteration is made smaller.  This is not
>>>>>>> really
>>>>>>> ideal, since I also need the ability to narrow down results based on
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> other
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> words that can/can't appear on the same page which would mean storing
>>>>>>> 3
>>>>>>> full
>>>>>>> copies of every word on every page (one in each of the 3 resulting
>>>>>>> indexes).
>>>>>>>
>>>>>>> I know this isn't a Java performance forum so I'll try to keep this
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> Lucene
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> related, but has anyone done anything similar to this, or have any
>>>>>>> comments/ideas on how to improve it?  I'm in the process of trying to
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> speed
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> things up since I need to perform many searches often over very large
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> sets
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> of pages.  Thanks!
>>>>>>>
>>>>>>> - Greg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene implementation/performance question

Posted by Greg Shackles <gs...@gmail.com>.

Thanks!  This all actually sounds promising, I just want to make sure I'm
thinking about this correctly.  Does this make sense?

Indexing process:

1) Get list of all words for a page and their attributes, stored in some
sort of data structure
2) Concatenate the text from those words (space separated) into a string
that represents the entire page
3) When adding the page document to the index, run it through a custom
analyzer that attaches the payloads to the tokens
  * this would have to follow along in the word list from #1 to get the
payload information for each token
  * would also have to tokenize the word we are storing to see how many
Lucene tokens it would translate to (to make sure the right payloads go with
the right tokens)

I haven't totally analyzed the searching process yet since I want to get my
head around the storage part first, but I imagine that would be the easier
part anyway.  Does this approach sound reasonable?

My other concern is your comment about isolating results.  If I'm reading it
correctly, it means that I'd have to do the search in multiple passes, one
to get the individual docs containing the matches, and then one query for
each of those to get the payloads within them?

Thanks again for your help on this one.

- Greg


On Wed, Nov 12, 2008 at 12:52 PM, Mark Miller <ma...@gmail.com> wrote:

> Here is a great power point on payloads from Michael Busch:
> www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt.
> Essentially, you can store metadata at each term position, so its an
> excellent place to store attributes of the term - they are very fast to
> load, efficient, etc.
>
> You can check out the spans test classes for a small example using the
> PayloadSpanUtil...its actually fairly simple and short, and the main reason
> I consider it experimental is that it hasn't really been used too much to my
> knowledge (who knows though). If you have a problem, you'll know quickly and
> I'll fix quickly. It should work fine though. Overall, the approach wouldn't
> take that much code, so I don't think youd be out a lot of time.
>
> The PayloadSpanUtil takes an IndexReader and a query and returns the
> payloads for the terms in the IndexReader that match the query. If you end
> up with multiple docs in the IndexReader, be sure to isolate the query down
> to the exact doc you want the payloads from (the Span scoring mode of the
> highlighter actually puts the doc in a fast MemoryIndex which only holds one
> doc, and uses an IndexReader from the MemoryIndex).
>
>
> Greg Shackles wrote:
>
>> Hey Mark,
>>
>> This sounds very interesting.  Is there any documentation or examples I
>> could see?  I did a quick search but didn't really find much.  It might
>> just
>> be that I don't know how payloads work in Lucene, but I'm not sure how I
>> would see this actually doing what I need.  My reasoning is this...you'd
>> have an index that stores all the text for a particular page.  Would you
>> be
>> able to attach payload information to individual words on that page?  In
>> my
>> head it seems like that would be the job of a second index, which is
>> exactly
>> why I added the word index.
>>
>> Any details you can give would be great as I need to keep moving on this
>> project quickly.  I will also say that I'm somewhat wary of using an
>> experimental class since this is a really important project that really
>> won't be able to wait on a lot of development cycles to get the class
>> fully
>> working.  That said, if it can give me serious speed improvements it's
>> definitely worth considering.
>>
>> - Greg
>>
>>
>> On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>>
>>
>>
>>> If your new to Lucene, this might be a little much (and maybe I am not
>>> fully understand the problem), but you might try:
>>>
>>> Add the attributes to the words in a payload with a PayloadAnalyzer. Do
>>> searching as normal. Use the new PayloadSpanUtil class to get the
>>> payloads
>>> for the matching words. (Think of the PayloadSpanUtil as a highlighter -
>>> you
>>> give it a query, it gives you the payloads to the terms that match). The
>>> PayloadSpanUtil class is a bit experimental, but I'll fix anything you
>>> run
>>> into with it.
>>>
>>> - Mark
>>>
>>>
>>> Greg Shackles wrote:
>>>
>>>
>>>
>>>> Hi Erick,
>>>>
>>>> Thanks for the response, sorry that I was somewhat vague in the
>>>> reasoning
>>>> for my implementation in the first post.  I should have mentioned that
>>>> the
>>>> word details are not details of the Lucene document, but are attributes
>>>> about the word that I am storing.  Some examples are position on the
>>>> actual
>>>> page, color, size, bold/italic/underlined, and most importantly, the
>>>> text
>>>> as
>>>> it appeared on the page.  The reason the last one matters is that things
>>>> like punctuation, spacing and capitalization can vary between the result
>>>> and
>>>> the search term, and can affect how I need to process the results
>>>> afterwords.  I am certainly open to the idea of a new approach if it
>>>> would
>>>> improve on things, I admit I am new to Lucene so if there are options
>>>> I'm
>>>> unaware of I'd love to learn about them.
>>>>
>>>> Just to sum it up with an example, let's say we have a page of text that
>>>> stores "This is a page of text."  We want to search for the text "of
>>>> text",
>>>> which would span multiple words in the word index.  The final result
>>>> would
>>>> need to contain "of" and "text", along with the details about each as
>>>> described before.  I hope this is more helpful!
>>>>
>>>> - Greg
>>>>
>>>> On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <
>>>> erickerickson@gmail.com
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>> If I may suggest, could you expand upon what you're trying to
>>>>> accomplish? Why do you care about the detailed information
>>>>> about each word? The reason I'm suggesting this is "the XY
>>>>> problem". That is, people often ask for details about a specific
>>>>> approach when what they really need is a different approach
>>>>>
>>>>> There are TermFrequencies, TermPositions,
>>>>> TermVectorOffsetInfo and a bunch of other stuff that I don't
>>>>> know the details of that may work for you if we had
>>>>> a better idea of what it is you're trying to accomplish...
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gs...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I hope this isn't a dumb question or anything, I'm fairly new to
>>>>>> Lucene
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> so
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I've been picking it up as I go pretty much.  Without going into too
>>>>>> much
>>>>>> detail, I need to store pages of text, and for each word on each page,
>>>>>> store
>>>>>> detailed information about it.  To do this, I have 2 indexes:
>>>>>>
>>>>>> 1) pages: this stores the full text of the page, and identifying
>>>>>> information
>>>>>> about it
>>>>>> 2) words: this stores a single word, along with the page it was on and
>>>>>> is
>>>>>> stored in the order they appear on the page
>>>>>>
>>>>>> When doing a search, not only do I need to return the page it was
>>>>>> found
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> on,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> but also the details of the matching words.  Since I couldn't think of
>>>>>> a
>>>>>> better way to do it, I first search the pages index and find any
>>>>>> matching
>>>>>> pages.  Then I iterate the words on those pages to find where the
>>>>>> match
>>>>>> occurred.  Obviously this is costly as far as execution time goes, but
>>>>>> at
>>>>>> least it only has to get done for matching pages rather than every
>>>>>> page.
>>>>>> Searches still take way longer than I'd like though, and the
>>>>>> bottleneck
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> is
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> almost entirely in the code to find the matches on the page.
>>>>>>
>>>>>> One simple optimization I can think of is store the pages in smaller
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> blocks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> so that the scope of the iteration is made smaller.  This is not
>>>>>> really
>>>>>> ideal, since I also need the ability to narrow down results based on
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> other
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> words that can/can't appear on the same page which would mean storing
>>>>>> 3
>>>>>> full
>>>>>> copies of every word on every page (one in each of the 3 resulting
>>>>>> indexes).
>>>>>>
>>>>>> I know this isn't a Java performance forum so I'll try to keep this
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Lucene
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> related, but has anyone done anything similar to this, or have any
>>>>>> comments/ideas on how to improve it?  I'm in the process of trying to
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> speed
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> things up since I need to perform many searches often over very large
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> sets
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> of pages.  Thanks!
>>>>>>
>>>>>> - Greg
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene implementation/performance question

Posted by Mark Miller <ma...@gmail.com>.

Here is a great power point on payloads from Michael Busch: 
www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt. 
Essentially, you can store metadata at each term position, so its an 
excellent place to store attributes of the term - they are very fast to 
load, efficient, etc.

You can check out the spans test classes for a small example using the 
PayloadSpanUtil...its actually fairly simple and short, and the main 
reason I consider it experimental is that it hasn't really been used too 
much to my knowledge (who knows though). If you have a problem, you'll 
know quickly and I'll fix quickly. It should work fine though. Overall, 
the approach wouldn't take that much code, so I don't think youd be out 
a lot of time.

The PayloadSpanUtil takes an IndexReader and a query and returns the 
payloads for the terms in the IndexReader that match the query. If you 
end up with multiple docs in the IndexReader, be sure to isolate the 
query down to the exact doc you want the payloads from (the Span scoring 
mode of the highlighter actually puts the doc in a fast MemoryIndex 
which only holds one doc, and uses an IndexReader from the MemoryIndex).

Greg Shackles wrote:
> Hey Mark,
>
> This sounds very interesting.  Is there any documentation or examples I
> could see?  I did a quick search but didn't really find much.  It might just
> be that I don't know how payloads work in Lucene, but I'm not sure how I
> would see this actually doing what I need.  My reasoning is this...you'd
> have an index that stores all the text for a particular page.  Would you be
> able to attach payload information to individual words on that page?  In my
> head it seems like that would be the job of a second index, which is exactly
> why I added the word index.
>
> Any details you can give would be great as I need to keep moving on this
> project quickly.  I will also say that I'm somewhat wary of using an
> experimental class since this is a really important project that really
> won't be able to wait on a lot of development cycles to get the class fully
> working.  That said, if it can give me serious speed improvements it's
> definitely worth considering.
>
> - Greg
>
>
> On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <ma...@gmail.com> wrote:
>
>   
>> If your new to Lucene, this might be a little much (and maybe I am not
>> fully understand the problem), but you might try:
>>
>> Add the attributes to the words in a payload with a PayloadAnalyzer. Do
>> searching as normal. Use the new PayloadSpanUtil class to get the payloads
>> for the matching words. (Think of the PayloadSpanUtil as a highlighter - you
>> give it a query, it gives you the payloads to the terms that match). The
>> PayloadSpanUtil class is a bit experimental, but I'll fix anything you run
>> into with it.
>>
>> - Mark
>>
>>
>> Greg Shackles wrote:
>>
>>     
>>> Hi Erick,
>>>
>>> Thanks for the response, sorry that I was somewhat vague in the reasoning
>>> for my implementation in the first post.  I should have mentioned that the
>>> word details are not details of the Lucene document, but are attributes
>>> about the word that I am storing.  Some examples are position on the
>>> actual
>>> page, color, size, bold/italic/underlined, and most importantly, the text
>>> as
>>> it appeared on the page.  The reason the last one matters is that things
>>> like punctuation, spacing and capitalization can vary between the result
>>> and
>>> the search term, and can affect how I need to process the results
>>> afterwords.  I am certainly open to the idea of a new approach if it would
>>> improve on things, I admit I am new to Lucene so if there are options I'm
>>> unaware of I'd love to learn about them.
>>>
>>> Just to sum it up with an example, let's say we have a page of text that
>>> stores "This is a page of text."  We want to search for the text "of
>>> text",
>>> which would span multiple words in the word index.  The final result would
>>> need to contain "of" and "text", along with the details about each as
>>> described before.  I hope this is more helpful!
>>>
>>> - Greg
>>>
>>> On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <erickerickson@gmail.com
>>>       
>>>> wrote:
>>>>         
>>>
>>>       
>>>> If I may suggest, could you expand upon what you're trying to
>>>> accomplish? Why do you care about the detailed information
>>>> about each word? The reason I'm suggesting this is "the XY
>>>> problem". That is, people often ask for details about a specific
>>>> approach when what they really need is a different approach
>>>>
>>>> There are TermFrequencies, TermPositions,
>>>> TermVectorOffsetInfo and a bunch of other stuff that I don't
>>>> know the details of that may work for you if we had
>>>> a better idea of what it is you're trying to accomplish...
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gs...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> I hope this isn't a dumb question or anything, I'm fairly new to Lucene
>>>>>
>>>>>
>>>>>           
>>>> so
>>>>
>>>>
>>>>         
>>>>> I've been picking it up as I go pretty much.  Without going into too
>>>>> much
>>>>> detail, I need to store pages of text, and for each word on each page,
>>>>> store
>>>>> detailed information about it.  To do this, I have 2 indexes:
>>>>>
>>>>> 1) pages: this stores the full text of the page, and identifying
>>>>> information
>>>>> about it
>>>>> 2) words: this stores a single word, along with the page it was on and
>>>>> is
>>>>> stored in the order they appear on the page
>>>>>
>>>>> When doing a search, not only do I need to return the page it was found
>>>>>
>>>>>
>>>>>           
>>>> on,
>>>>
>>>>
>>>>         
>>>>> but also the details of the matching words.  Since I couldn't think of a
>>>>> better way to do it, I first search the pages index and find any
>>>>> matching
>>>>> pages.  Then I iterate the words on those pages to find where the match
>>>>> occurred.  Obviously this is costly as far as execution time goes, but
>>>>> at
>>>>> least it only has to get done for matching pages rather than every page.
>>>>> Searches still take way longer than I'd like though, and the bottleneck
>>>>>
>>>>>
>>>>>           
>>>> is
>>>>
>>>>
>>>>         
>>>>> almost entirely in the code to find the matches on the page.
>>>>>
>>>>> One simple optimization I can think of is store the pages in smaller
>>>>>
>>>>>
>>>>>           
>>>> blocks
>>>>
>>>>
>>>>         
>>>>> so that the scope of the iteration is made smaller.  This is not really
>>>>> ideal, since I also need the ability to narrow down results based on
>>>>>
>>>>>
>>>>>           
>>>> other
>>>>
>>>>
>>>>         
>>>>> words that can/can't appear on the same page which would mean storing 3
>>>>> full
>>>>> copies of every word on every page (one in each of the 3 resulting
>>>>> indexes).
>>>>>
>>>>> I know this isn't a Java performance forum so I'll try to keep this
>>>>>
>>>>>
>>>>>           
>>>> Lucene
>>>>
>>>>
>>>>         
>>>>> related, but has anyone done anything similar to this, or have any
>>>>> comments/ideas on how to improve it?  I'm in the process of trying to
>>>>>
>>>>>
>>>>>           
>>>> speed
>>>>
>>>>
>>>>         
>>>>> things up since I need to perform many searches often over very large
>>>>>
>>>>>
>>>>>           
>>>> sets
>>>>
>>>>
>>>>         
>>>>> of pages.  Thanks!
>>>>>
>>>>> - Greg
>>>>>
>>>>>
>>>>>
>>>>>           
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene implementation/performance question

Posted by Greg Shackles <gs...@gmail.com>.

Hey Mark,

This sounds very interesting.  Is there any documentation or examples I
could see?  I did a quick search but didn't really find much.  It might just
be that I don't know how payloads work in Lucene, but I'm not sure how I
would see this actually doing what I need.  My reasoning is this...you'd
have an index that stores all the text for a particular page.  Would you be
able to attach payload information to individual words on that page?  In my
head it seems like that would be the job of a second index, which is exactly
why I added the word index.

Any details you can give would be great as I need to keep moving on this
project quickly.  I will also say that I'm somewhat wary of using an
experimental class since this is a really important project that really
won't be able to wait on a lot of development cycles to get the class fully
working.  That said, if it can give me serious speed improvements it's
definitely worth considering.

- Greg


On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <ma...@gmail.com> wrote:

> If your new to Lucene, this might be a little much (and maybe I am not
> fully understand the problem), but you might try:
>
> Add the attributes to the words in a payload with a PayloadAnalyzer. Do
> searching as normal. Use the new PayloadSpanUtil class to get the payloads
> for the matching words. (Think of the PayloadSpanUtil as a highlighter - you
> give it a query, it gives you the payloads to the terms that match). The
> PayloadSpanUtil class is a bit experimental, but I'll fix anything you run
> into with it.
>
> - Mark
>
>
> Greg Shackles wrote:
>
>> Hi Erick,
>>
>> Thanks for the response, sorry that I was somewhat vague in the reasoning
>> for my implementation in the first post.  I should have mentioned that the
>> word details are not details of the Lucene document, but are attributes
>> about the word that I am storing.  Some examples are position on the
>> actual
>> page, color, size, bold/italic/underlined, and most importantly, the text
>> as
>> it appeared on the page.  The reason the last one matters is that things
>> like punctuation, spacing and capitalization can vary between the result
>> and
>> the search term, and can affect how I need to process the results
>> afterwords.  I am certainly open to the idea of a new approach if it would
>> improve on things, I admit I am new to Lucene so if there are options I'm
>> unaware of I'd love to learn about them.
>>
>> Just to sum it up with an example, let's say we have a page of text that
>> stores "This is a page of text."  We want to search for the text "of
>> text",
>> which would span multiple words in the word index.  The final result would
>> need to contain "of" and "text", along with the details about each as
>> described before.  I hope this is more helpful!
>>
>> - Greg
>>
>> On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <erickerickson@gmail.com
>> >wrote:
>>
>>
>>
>>> If I may suggest, could you expand upon what you're trying to
>>> accomplish? Why do you care about the detailed information
>>> about each word? The reason I'm suggesting this is "the XY
>>> problem". That is, people often ask for details about a specific
>>> approach when what they really need is a different approach
>>>
>>> There are TermFrequencies, TermPositions,
>>> TermVectorOffsetInfo and a bunch of other stuff that I don't
>>> know the details of that may work for you if we had
>>> a better idea of what it is you're trying to accomplish...
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gs...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>>> I hope this isn't a dumb question or anything, I'm fairly new to Lucene
>>>>
>>>>
>>> so
>>>
>>>
>>>> I've been picking it up as I go pretty much.  Without going into too
>>>> much
>>>> detail, I need to store pages of text, and for each word on each page,
>>>> store
>>>> detailed information about it.  To do this, I have 2 indexes:
>>>>
>>>> 1) pages: this stores the full text of the page, and identifying
>>>> information
>>>> about it
>>>> 2) words: this stores a single word, along with the page it was on and
>>>> is
>>>> stored in the order they appear on the page
>>>>
>>>> When doing a search, not only do I need to return the page it was found
>>>>
>>>>
>>> on,
>>>
>>>
>>>> but also the details of the matching words.  Since I couldn't think of a
>>>> better way to do it, I first search the pages index and find any
>>>> matching
>>>> pages.  Then I iterate the words on those pages to find where the match
>>>> occurred.  Obviously this is costly as far as execution time goes, but
>>>> at
>>>> least it only has to get done for matching pages rather than every page.
>>>> Searches still take way longer than I'd like though, and the bottleneck
>>>>
>>>>
>>> is
>>>
>>>
>>>> almost entirely in the code to find the matches on the page.
>>>>
>>>> One simple optimization I can think of is store the pages in smaller
>>>>
>>>>
>>> blocks
>>>
>>>
>>>> so that the scope of the iteration is made smaller.  This is not really
>>>> ideal, since I also need the ability to narrow down results based on
>>>>
>>>>
>>> other
>>>
>>>
>>>> words that can/can't appear on the same page which would mean storing 3
>>>> full
>>>> copies of every word on every page (one in each of the 3 resulting
>>>> indexes).
>>>>
>>>> I know this isn't a Java performance forum so I'll try to keep this
>>>>
>>>>
>>> Lucene
>>>
>>>
>>>> related, but has anyone done anything similar to this, or have any
>>>> comments/ideas on how to improve it?  I'm in the process of trying to
>>>>
>>>>
>>> speed
>>>
>>>
>>>> things up since I need to perform many searches often over very large
>>>>
>>>>
>>> sets
>>>
>>>
>>>> of pages.  Thanks!
>>>>
>>>> - Greg
>>>>
>>>>
>>>>
>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene implementation/performance question

Posted by Mark Miller <ma...@gmail.com>.

If your new to Lucene, this might be a little much (and maybe I am not 
fully understand the problem), but you might try:

Add the attributes to the words in a payload with a PayloadAnalyzer. Do 
searching as normal. Use the new PayloadSpanUtil class to get the 
payloads for the matching words. (Think of the PayloadSpanUtil as a 
highlighter - you give it a query, it gives you the payloads to the 
terms that match). The PayloadSpanUtil class is a bit experimental, but 
I'll fix anything you run into with it.

- Mark

Greg Shackles wrote:
> Hi Erick,
>
> Thanks for the response, sorry that I was somewhat vague in the reasoning
> for my implementation in the first post.  I should have mentioned that the
> word details are not details of the Lucene document, but are attributes
> about the word that I am storing.  Some examples are position on the actual
> page, color, size, bold/italic/underlined, and most importantly, the text as
> it appeared on the page.  The reason the last one matters is that things
> like punctuation, spacing and capitalization can vary between the result and
> the search term, and can affect how I need to process the results
> afterwords.  I am certainly open to the idea of a new approach if it would
> improve on things, I admit I am new to Lucene so if there are options I'm
> unaware of I'd love to learn about them.
>
> Just to sum it up with an example, let's say we have a page of text that
> stores "This is a page of text."  We want to search for the text "of text",
> which would span multiple words in the word index.  The final result would
> need to contain "of" and "text", along with the details about each as
> described before.  I hope this is more helpful!
>
> - Greg
>
> On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <er...@gmail.com>wrote:
>
>   
>> If I may suggest, could you expand upon what you're trying to
>> accomplish? Why do you care about the detailed information
>> about each word? The reason I'm suggesting this is "the XY
>> problem". That is, people often ask for details about a specific
>> approach when what they really need is a different approach
>>
>> There are TermFrequencies, TermPositions,
>> TermVectorOffsetInfo and a bunch of other stuff that I don't
>> know the details of that may work for you if we had
>> a better idea of what it is you're trying to accomplish...
>>
>> Best
>> Erick
>>
>> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gs...@gmail.com>
>> wrote:
>>
>>     
>>> I hope this isn't a dumb question or anything, I'm fairly new to Lucene
>>>       
>> so
>>     
>>> I've been picking it up as I go pretty much.  Without going into too much
>>> detail, I need to store pages of text, and for each word on each page,
>>> store
>>> detailed information about it.  To do this, I have 2 indexes:
>>>
>>> 1) pages: this stores the full text of the page, and identifying
>>> information
>>> about it
>>> 2) words: this stores a single word, along with the page it was on and is
>>> stored in the order they appear on the page
>>>
>>> When doing a search, not only do I need to return the page it was found
>>>       
>> on,
>>     
>>> but also the details of the matching words.  Since I couldn't think of a
>>> better way to do it, I first search the pages index and find any matching
>>> pages.  Then I iterate the words on those pages to find where the match
>>> occurred.  Obviously this is costly as far as execution time goes, but at
>>> least it only has to get done for matching pages rather than every page.
>>> Searches still take way longer than I'd like though, and the bottleneck
>>>       
>> is
>>     
>>> almost entirely in the code to find the matches on the page.
>>>
>>> One simple optimization I can think of is store the pages in smaller
>>>       
>> blocks
>>     
>>> so that the scope of the iteration is made smaller.  This is not really
>>> ideal, since I also need the ability to narrow down results based on
>>>       
>> other
>>     
>>> words that can/can't appear on the same page which would mean storing 3
>>> full
>>> copies of every word on every page (one in each of the 3 resulting
>>> indexes).
>>>
>>> I know this isn't a Java performance forum so I'll try to keep this
>>>       
>> Lucene
>>     
>>> related, but has anyone done anything similar to this, or have any
>>> comments/ideas on how to improve it?  I'm in the process of trying to
>>>       
>> speed
>>     
>>> things up since I need to perform many searches often over very large
>>>       
>> sets
>>     
>>> of pages.  Thanks!
>>>
>>> - Greg
>>>
>>>       
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene implementation/performance question

Posted by Greg Shackles <gs...@gmail.com>.

Hi Erick,

Thanks for the response, sorry that I was somewhat vague in the reasoning
for my implementation in the first post.  I should have mentioned that the
word details are not details of the Lucene document, but are attributes
about the word that I am storing.  Some examples are position on the actual
page, color, size, bold/italic/underlined, and most importantly, the text as
it appeared on the page.  The reason the last one matters is that things
like punctuation, spacing and capitalization can vary between the result and
the search term, and can affect how I need to process the results
afterwords.  I am certainly open to the idea of a new approach if it would
improve on things, I admit I am new to Lucene so if there are options I'm
unaware of I'd love to learn about them.

Just to sum it up with an example, let's say we have a page of text that
stores "This is a page of text."  We want to search for the text "of text",
which would span multiple words in the word index.  The final result would
need to contain "of" and "text", along with the details about each as
described before.  I hope this is more helpful!

- Greg

On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <er...@gmail.com>wrote:

> If I may suggest, could you expand upon what you're trying to
> accomplish? Why do you care about the detailed information
> about each word? The reason I'm suggesting this is "the XY
> problem". That is, people often ask for details about a specific
> approach when what they really need is a different approach
>
> There are TermFrequencies, TermPositions,
> TermVectorOffsetInfo and a bunch of other stuff that I don't
> know the details of that may work for you if we had
> a better idea of what it is you're trying to accomplish...
>
> Best
> Erick
>
> On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gs...@gmail.com>
> wrote:
>
> > I hope this isn't a dumb question or anything, I'm fairly new to Lucene
> so
> > I've been picking it up as I go pretty much.  Without going into too much
> > detail, I need to store pages of text, and for each word on each page,
> > store
> > detailed information about it.  To do this, I have 2 indexes:
> >
> > 1) pages: this stores the full text of the page, and identifying
> > information
> > about it
> > 2) words: this stores a single word, along with the page it was on and is
> > stored in the order they appear on the page
> >
> > When doing a search, not only do I need to return the page it was found
> on,
> > but also the details of the matching words.  Since I couldn't think of a
> > better way to do it, I first search the pages index and find any matching
> > pages.  Then I iterate the words on those pages to find where the match
> > occurred.  Obviously this is costly as far as execution time goes, but at
> > least it only has to get done for matching pages rather than every page.
> > Searches still take way longer than I'd like though, and the bottleneck
> is
> > almost entirely in the code to find the matches on the page.
> >
> > One simple optimization I can think of is store the pages in smaller
> blocks
> > so that the scope of the iteration is made smaller.  This is not really
> > ideal, since I also need the ability to narrow down results based on
> other
> > words that can/can't appear on the same page which would mean storing 3
> > full
> > copies of every word on every page (one in each of the 3 resulting
> > indexes).
> >
> > I know this isn't a Java performance forum so I'll try to keep this
> Lucene
> > related, but has anyone done anything similar to this, or have any
> > comments/ideas on how to improve it?  I'm in the process of trying to
> speed
> > things up since I need to perform many searches often over very large
> sets
> > of pages.  Thanks!
> >
> > - Greg
> >
>

Re: Lucene implementation/performance question

Posted by Erick Erickson <er...@gmail.com>.

If I may suggest, could you expand upon what you're trying to
accomplish? Why do you care about the detailed information
about each word? The reason I'm suggesting this is "the XY
problem". That is, people often ask for details about a specific
approach when what they really need is a different approach

There are TermFrequencies, TermPositions,
TermVectorOffsetInfo and a bunch of other stuff that I don't
know the details of that may work for you if we had
a better idea of what it is you're trying to accomplish...

Best
Erick

On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <gs...@gmail.com> wrote:

> I hope this isn't a dumb question or anything, I'm fairly new to Lucene so
> I've been picking it up as I go pretty much.  Without going into too much
> detail, I need to store pages of text, and for each word on each page,
> store
> detailed information about it.  To do this, I have 2 indexes:
>
> 1) pages: this stores the full text of the page, and identifying
> information
> about it
> 2) words: this stores a single word, along with the page it was on and is
> stored in the order they appear on the page
>
> When doing a search, not only do I need to return the page it was found on,
> but also the details of the matching words.  Since I couldn't think of a
> better way to do it, I first search the pages index and find any matching
> pages.  Then I iterate the words on those pages to find where the match
> occurred.  Obviously this is costly as far as execution time goes, but at
> least it only has to get done for matching pages rather than every page.
> Searches still take way longer than I'd like though, and the bottleneck is
> almost entirely in the code to find the matches on the page.
>
> One simple optimization I can think of is store the pages in smaller blocks
> so that the scope of the iteration is made smaller.  This is not really
> ideal, since I also need the ability to narrow down results based on other
> words that can/can't appear on the same page which would mean storing 3
> full
> copies of every word on every page (one in each of the 3 resulting
> indexes).
>
> I know this isn't a Java performance forum so I'll try to keep this Lucene
> related, but has anyone done anything similar to this, or have any
> comments/ideas on how to improve it?  I'm in the process of trying to speed
> things up since I need to perform many searches often over very large sets
> of pages.  Thanks!
>
> - Greg
>