You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Johannes Lerch <le...@googlemail.com> on 2010/09/09 10:01:30 UTC

Performance problems on retrieving fields

Hi,

i am working on a search for stacktraces. To do this i implemented my own
Query, Weight and Scorer. I save exception, method and the frames as fields
in the index and am able to pick relevant documents by matching those fields
with my query stacktrace (using IndexReader.termDocs()). I implemented my
own scoring which is calculated pairwise for stacktraces (the one of the
query and each of the relevant documents). For this scoring i calculate a
similarity between both traces by comparing the frames if they exist in both
and also check for ordering. This works similar as diff on text/source code.
My problem is, that i need all frames contained in both stacktraces, so i
have to retrieve all frame fields of the stored stacktraces. For now i do
this with:
Document document = reader.document(doc, new FieldSelector() {
            @Override
            public FieldSelectorResult accept(String fieldName) {
                if(Indexer.FIELD_FRAMES.equals(fieldName))
                    return FieldSelectorResult.LAZY_LOAD;
                else
                    return FieldSelectorResult.NO_LOAD;
            }
        });
Fieldable[] fieldables = document.getFieldables(Indexer.FIELD_FRAMES);

But this call really decreases performance to something which is not
agreeable for me (>10 times slower on 100000 stacktraces in index). So my
question is, are there are other ways to get stored fields or do you have
ideas for workarounds. Would it be better to store all stacktraces in a
database and retrieve them from there? If so how do i get the docId of
stacktraces i wrote to the index?

Regards,
Johannes

Re: Performance problems on retrieving fields

Posted by Ted Dunning <te...@gmail.com>.
Can you define an approximate score that will give you  a small  
candidate set that you can score in detail?

Likewise can you restate your scoring algo using stack frame pairs?   
Using ngrams is often used as a very good surrogate for edit distance  
scores such as you are trying to build.

Sent from my iPhone

On Sep 9, 2010, at 3:36 AM, Johannes Lerch <lerch.johannes@googlemail.com 
 > wrote:

> As my tests show about 1/4 documents are relevant for scoring per  
> query. So
> for my example with 100000 stacktraces in the index i need to score  
> 25000
> documents. I have a native implementation of the scoring algorithm  
> which
> scores all 100000. That needs about 20ms. The lucene implementation  
> needs
> for the same query >100ms what really sucks. Without retrieving  
> fields it
> needs about 6ms - thats also what my target should be.
>
> I tried without LAZY_LOAD, but there is no real difference. How can  
> i sort
> by docIds first?
>
> FieldCache.DEFAULT.getStrings ist not a possibility cause of to the  
> memory
> problem.
> This is how i store frames:
> for(StacktraceFrame frame : stacktrace.getFrames()) {
>  doc.add(new Field(FIELD_FRAMES,
> frame.getClassName()+"."+frame.getMethod(), Store.YES,  
> Index.NOT_ANALYZED));
> }
>
>
>
> 2010/9/9 Michael McCandless <lu...@mikemccandless.com>
>
>> What a neat search engine!  (Searching stack traces).
>>
>> Unfortunately, loading stored fields is slowish -- it entails 2 disk
>> seeks under the hood.  Really you should retrieve at most a page  
>> worth
>> of docs, in the serial path of a query.  How many are you retrieving
>> per query?
>>
>> That said, you shouldn't use LAZY_LOAD if you know you will need the
>> value.  Also, it's possible that sorting the docIDs (ascending) first
>> may get you better performance since your load is then a single scan
>> of the 2 files in the index.
>>
>> You may want to use FieldCache.DEFAULT.getStrings instead -- this
>> gives you a very fast String[], but, may suck up tons of memory
>> depending on how many unique frames there are (how do you index each
>> frame?).
>>
>> Mike
>>
>> On Thu, Sep 9, 2010 at 4:01 AM, Johannes Lerch
>> <le...@googlemail.com> wrote:
>>> Hi,
>>>
>>> i am working on a search for stacktraces. To do this i implemented  
>>> my own
>>> Query, Weight and Scorer. I save exception, method and the frames as
>> fields
>>> in the index and am able to pick relevant documents by matching  
>>> those
>> fields
>>> with my query stacktrace (using IndexReader.termDocs()). I  
>>> implemented my
>>> own scoring which is calculated pairwise for stacktraces (the one  
>>> of the
>>> query and each of the relevant documents). For this scoring i  
>>> calculate a
>>> similarity between both traces by comparing the frames if they  
>>> exist in
>> both
>>> and also check for ordering. This works similar as diff on text/ 
>>> source
>> code.
>>> My problem is, that i need all frames contained in both  
>>> stacktraces, so i
>>> have to retrieve all frame fields of the stored stacktraces. For  
>>> now i do
>>> this with:
>>> Document document = reader.document(doc, new FieldSelector() {
>>>           @Override
>>>           public FieldSelectorResult accept(String fieldName) {
>>>               if(Indexer.FIELD_FRAMES.equals(fieldName))
>>>                   return FieldSelectorResult.LAZY_LOAD;
>>>               else
>>>                   return FieldSelectorResult.NO_LOAD;
>>>           }
>>>       });
>>> Fieldable[] fieldables = document.getFieldables 
>>> (Indexer.FIELD_FRAMES);
>>>
>>> But this call really decreases performance to something which is not
>>> agreeable for me (>10 times slower on 100000 stacktraces in  
>>> index). So my
>>> question is, are there are other ways to get stored fields or do  
>>> you have
>>> ideas for workarounds. Would it be better to store all stacktraces  
>>> in a
>>> database and retrieve them from there? If so how do i get the  
>>> docId of
>>> stacktraces i wrote to the index?
>>>
>>> Regards,
>>> Johannes
>>>
>>

Re: Performance problems on retrieving fields

Posted by Johannes Lerch <le...@googlemail.com>.
As my tests show about 1/4 documents are relevant for scoring per query. So
for my example with 100000 stacktraces in the index i need to score 25000
documents. I have a native implementation of the scoring algorithm which
scores all 100000. That needs about 20ms. The lucene implementation needs
for the same query >100ms what really sucks. Without retrieving fields it
needs about 6ms - thats also what my target should be.

I tried without LAZY_LOAD, but there is no real difference. How can i sort
by docIds first?

FieldCache.DEFAULT.getStrings ist not a possibility cause of to the memory
problem.
This is how i store frames:
for(StacktraceFrame frame : stacktrace.getFrames()) {
  doc.add(new Field(FIELD_FRAMES,
frame.getClassName()+"."+frame.getMethod(), Store.YES, Index.NOT_ANALYZED));
}



2010/9/9 Michael McCandless <lu...@mikemccandless.com>

> What a neat search engine!  (Searching stack traces).
>
> Unfortunately, loading stored fields is slowish -- it entails 2 disk
> seeks under the hood.  Really you should retrieve at most a page worth
> of docs, in the serial path of a query.  How many are you retrieving
> per query?
>
> That said, you shouldn't use LAZY_LOAD if you know you will need the
> value.  Also, it's possible that sorting the docIDs (ascending) first
> may get you better performance since your load is then a single scan
> of the 2 files in the index.
>
> You may want to use FieldCache.DEFAULT.getStrings instead -- this
> gives you a very fast String[], but, may suck up tons of memory
> depending on how many unique frames there are (how do you index each
> frame?).
>
> Mike
>
> On Thu, Sep 9, 2010 at 4:01 AM, Johannes Lerch
> <le...@googlemail.com> wrote:
> > Hi,
> >
> > i am working on a search for stacktraces. To do this i implemented my own
> > Query, Weight and Scorer. I save exception, method and the frames as
> fields
> > in the index and am able to pick relevant documents by matching those
> fields
> > with my query stacktrace (using IndexReader.termDocs()). I implemented my
> > own scoring which is calculated pairwise for stacktraces (the one of the
> > query and each of the relevant documents). For this scoring i calculate a
> > similarity between both traces by comparing the frames if they exist in
> both
> > and also check for ordering. This works similar as diff on text/source
> code.
> > My problem is, that i need all frames contained in both stacktraces, so i
> > have to retrieve all frame fields of the stored stacktraces. For now i do
> > this with:
> > Document document = reader.document(doc, new FieldSelector() {
> >            @Override
> >            public FieldSelectorResult accept(String fieldName) {
> >                if(Indexer.FIELD_FRAMES.equals(fieldName))
> >                    return FieldSelectorResult.LAZY_LOAD;
> >                else
> >                    return FieldSelectorResult.NO_LOAD;
> >            }
> >        });
> > Fieldable[] fieldables = document.getFieldables(Indexer.FIELD_FRAMES);
> >
> > But this call really decreases performance to something which is not
> > agreeable for me (>10 times slower on 100000 stacktraces in index). So my
> > question is, are there are other ways to get stored fields or do you have
> > ideas for workarounds. Would it be better to store all stacktraces in a
> > database and retrieve them from there? If so how do i get the docId of
> > stacktraces i wrote to the index?
> >
> > Regards,
> > Johannes
> >
>

Re: Performance problems on retrieving fields

Posted by Michael McCandless <lu...@mikemccandless.com>.
What a neat search engine!  (Searching stack traces).

Unfortunately, loading stored fields is slowish -- it entails 2 disk
seeks under the hood.  Really you should retrieve at most a page worth
of docs, in the serial path of a query.  How many are you retrieving
per query?

That said, you shouldn't use LAZY_LOAD if you know you will need the
value.  Also, it's possible that sorting the docIDs (ascending) first
may get you better performance since your load is then a single scan
of the 2 files in the index.

You may want to use FieldCache.DEFAULT.getStrings instead -- this
gives you a very fast String[], but, may suck up tons of memory
depending on how many unique frames there are (how do you index each
frame?).

Mike

On Thu, Sep 9, 2010 at 4:01 AM, Johannes Lerch
<le...@googlemail.com> wrote:
> Hi,
>
> i am working on a search for stacktraces. To do this i implemented my own
> Query, Weight and Scorer. I save exception, method and the frames as fields
> in the index and am able to pick relevant documents by matching those fields
> with my query stacktrace (using IndexReader.termDocs()). I implemented my
> own scoring which is calculated pairwise for stacktraces (the one of the
> query and each of the relevant documents). For this scoring i calculate a
> similarity between both traces by comparing the frames if they exist in both
> and also check for ordering. This works similar as diff on text/source code.
> My problem is, that i need all frames contained in both stacktraces, so i
> have to retrieve all frame fields of the stored stacktraces. For now i do
> this with:
> Document document = reader.document(doc, new FieldSelector() {
>            @Override
>            public FieldSelectorResult accept(String fieldName) {
>                if(Indexer.FIELD_FRAMES.equals(fieldName))
>                    return FieldSelectorResult.LAZY_LOAD;
>                else
>                    return FieldSelectorResult.NO_LOAD;
>            }
>        });
> Fieldable[] fieldables = document.getFieldables(Indexer.FIELD_FRAMES);
>
> But this call really decreases performance to something which is not
> agreeable for me (>10 times slower on 100000 stacktraces in index). So my
> question is, are there are other ways to get stored fields or do you have
> ideas for workarounds. Would it be better to store all stacktraces in a
> database and retrieve them from there? If so how do i get the docId of
> stacktraces i wrote to the index?
>
> Regards,
> Johannes
>