You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/03/18 18:16:14 UTC

contrib/benchmark questions

I'm using contrib/benchmark to do some tests for my ApacheCon talk  
and have some questions.

1. In looking at micro-standard.alg, it seems like not all braces are  
closed.  Is a line ending a separator too?
2. Is there anyway to dump out what params are supported by the  
various tasks?  I am esp. uncertain on the Search related tasks.
3. Is there anyway to dump out the stats as a CSV file or something?   
Would I implement a Task for this?  Ultimately, I want to be able to  
create a graph in Excel that shows tradeoffs between speed and memory.
4. Is there a way to set how many tabs occur between columns in the  
final report?  They merge and buffer factors get hard to read for  
larger values.
5. Below is my "alg" file, any tips?  What I am trying to do is show  
the tradeoffs of merge factor and max buffered and how it relates to  
memory and indexing time.  I want to process all the documents in the  
Reuters benchmark collection, not the 2000 in the micro-standard.  I  
don't want any pauses and for now I am happy doing things in serial.   
I think it is doing what I want, but am not 100% certain.

-----------  alg file --------

#last value is more than all the docs in reuters
merge.factor=mrg:10:100:1000:5000:10:10:10:10:100:1000
max.buffered=buf:10:10:10:10:100:1000:10000:21580:21580:21580
compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=1000

docs.dir=reuters-out
#docs.dir=reuters-111

#doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker

#query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=true
#  
------------------------------------------------------------------------ 
-------------

{ "Rounds"

     ResetSystemErase

     { "Populate"
         CreateIndex
         { "MAddDocs" AddDoc > : 22000
         Optimize
         CloseIndex
     }

     OpenReader
     { "SearchSameRdr" Search > : 5000
     CloseReader

     { "WarmNewRdr" Warm > : 50

     { "SrchNewRdr" Search > : 500

     { "SrchTrvNewRdr" SearchTrav > : 300

     { "SrchTrvRetNewRdr" SearchTravRet > : 100

     NewRound

} : 10

RepSumByName
RepSumByPrefRound MAddDocs


Thanks,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: contrib/benchmark questions

Posted by Doron Cohen <DO...@il.ibm.com>.

Hi Grant, I think you resolved the question already, but just to
make sure...

Grant Ingersoll <gs...@apache.org> wrote on 22/03/2007 20:41:27:

>
> On Mar 22, 2007, at 11:21 PM, Grant Ingersoll wrote:
>
> > I think I see in the ReadTask that it is the res var that is being
> > incremented and would have to be altered.  I guess I can go by
> > elapsed time, but even that seems slightly askew.  I think this is
> > due to the withRetrieve() function overhead inside the for loop.  I
> > have moved it out and will submit that change, too.
> >
>
> Moving it out of the loop made little diff. so I guess it is mostly
> just due to it being late and me being tired and not thinking
> clearly.  B/c if I were, I would just realize that those operations
> are also retrieving documents...

Seems the cause for confusion is that #recs means different things for
different tasks. For all tasks, it means (at least) the number of times
that task executed. For warm, it adds one for each document retrieved.
For traverse, adds one for each doc id "traveresed", and for
traverseAndRetrieve, also adds one for each doc being retrieved.

I'll update the javadocs with this clarification.

Moving the call out of the loop is the right thing of course, changed
the time only, not the #recs, right?

Regards,
Doron

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: contrib/benchmark questions

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 22, 2007, at 11:21 PM, Grant Ingersoll wrote:

> I think I see in the ReadTask that it is the res var that is being  
> incremented and would have to be altered.  I guess I can go by  
> elapsed time, but even that seems slightly askew.  I think this is  
> due to the withRetrieve() function overhead inside the for loop.  I  
> have moved it out and will submit that change, too.
>

Moving it out of the loop made little diff. so I guess it is mostly  
just due to it being late and me being tired and not thinking  
clearly.  B/c if I were, I would just realize that those operations  
are also retrieving documents...

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: contrib/benchmark questions

Posted by Grant Ingersoll <gr...@gmail.com>.

OK, Doron (and other benchmarkers!), on to search:

Here's my alg file:

#Indexing declaration up here

OpenReader
     { "SrchSameRdr" Search > : 5000

     { "SrchTrvSameRdr" SearchTrav > : 5000
     { "SrchTrvSameRdrTopTen" SearchTrav(10) > : 5000
     { "SrchTrvRetLoadAllSameRdr" SearchTravRet > : 5000

#Skip bytes and body
     { "SrchTrvRetLoadSomeSameRdr" SearchTravRetLoadFieldSelector 
(docid,docname,docdate,doctitle) > : 5000
     CloseReader


Never mind the last task, I will be submitting a patch shortly that  
will make sense out of it.  Essentially, it specifies what fields to  
load for the document

Here are the 	results:
                 Operation                      round merge  
max.buffered   runCnt   recsPerRun        rec/s  elapsedSec     
avgUsedMem    avgTotalMem
      [java] OpenReader -  -  -  -  -  -  -  -  0 -  10 -  -  -   10  
-  -   1 -  -  -  - 1 -  -   125.0 -  -   0.01 -   5,385,600 -  -  
9,965,568
      [java] SrchSameRdr_5000                   0    10            
10        1         5000      1,184.3        4.22     5,805,120       
9,965,568
      [java] SrchTrvSameRdr_5000 -  -  -  -  -  0 -  10 -  -  -   10  
-  -   1 -  -  427500 -   71,776.4 -  -   5.96 -   5,806,144 -  -  
9,965,568
      [java] SrchTrvSameRdrTopTen_5000          0    10            
10        1       427500     62,001.4        6.89     5,766,584       
9,965,568
      [java] SrchTrvRetLoadAllSameRdr_5000 -  - 0 -  10 -  -  -   10  
-  -   1 -  -  850000 -  - 7,226.4 -  - 117.62 -   6,161,728 -  -  
9,965,568
      [java] SrchTrvRetLoadSomeSameRdr_5000     0    10            
10        1       850000     10,334.0       82.25     6,162,752       
9,965,568
      [java] CloseReader -  -  -  -  -  -  -  - 0 -  10 -  -  -   10  
-  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -   5,921,856 -  -  
9,965,568

The line I'm a bit confused by is the recsPerRun
For the tasks that are doing the traversal and the retrieval, why so  
many recsPerRun?  Is it counting the hits, the traversals and the  
retrievals each as one record?

What I am trying to do is compare:
Search
Search plus traversal of all hits
Search plus traversal of top ten
Search plus traversal and retrieval of all documents and all fields  
on the document
Search plus traversal and retrieval of all documents and some fields  
on the document

I think I see in the ReadTask that it is the res var that is being  
incremented and would have to be altered.  I guess I can go by  
elapsed time, but even that seems slightly askew.  I think this is  
due to the withRetrieve() function overhead inside the for loop.  I  
have moved it out and will submit that change, too.

Am I interpreting this correctly?

-Grant

On Mar 19, 2007, at 5:11 PM, Doron Cohen wrote:

> Grant Ingersoll <gs...@apache.org> wrote on 19/03/2007 13:10:16:
>
>> So, if I am understanding correctly:
>>
>>>> "SearchSameRdr" Search > : 5000
>>
>> means don't collect indiv. stats fur SearchSameRdr, but do whatever
>> that task does 5000 times, right?
>
> Almost...
>
> It should be btw
>    { "SearchSameRdr" Search > : 5000
> and it means: run Search 5000 times, sequentially, 5000 times,  
> assign the
> name "SearchSameRdr" to that sequence of 5000, and do not collect
> individual stats for the individual tasks making that sequence.
>
> If it was just
>   { Search > : 5000
> it would still mean the same, just that a name was assigned to this  
> for
> you, something like: "Seq_Search_5000".
>
> If it was:
>    { "SearchSameRdr" Search } : 5000
> it would be the same as your example, just that stas would be  
> collected not
> only for the entire elapsed sequence, but also breaking it down for  
> each of
> the 5000 calls to "Search".
>
> Similar logic with
>   [ .. ]
> and
>   [ .. >
> just that the tasks making the (parallel) sequence are executed in
> parallel, each in a separate thread.
>
>>
>>>
>>>> 3. Is there anyway to dump out the stats as a CSV file or  
>>>> something?
>>>> Would I implement a Task for this?  Ultimately, I want to be  
>>>> able to
>>>> create a graph in Excel that shows tradeoffs between speed and
>>>> memory.
>>>
>>> Yes, implementing a report task would be the way.
>>> ... but when I look at how I implemented these reports, all the
>>> work is
>>> done in the class Points. Seems it should be modified a little with
>>> more
>>> thought of making it easiert to extend reports.
>>
>> I may take a crack at it, but deadline for the talk is looming
>
> I'll take a look too, let you know if I have anything.
>
>>> - Being intetested in memory stats - the thing that all the rounds
>>> run in a
>>> single program, same JVM run, usually means what you see is very  
>>> much
>>> dependent in the GC behavior of the specific VM you are using. If
>>> it does
>>> not release memory (most likely) to the OS you would not be able to
>>> notice
>>> that round i+1 used less memory than round i. It would probably
>>> better for
>>> something like this to put the "round" logic in an ant script,
>>> invoking
>>> each round in a separate new exec. But then things get more
>>> complicated for
>>> having a final stats report containing all rounds. What do you
>>> think about
>>> this?
>>
>> Good to know.  Perhaps a GarbageCollectionTask is needed?
>
> ResetSystemSoft and ResetSystemErase both call GC;
> Is this sufficient, task wise?
> The concern is that this is not enough gc/mem wise, because the JVM  
> already
> has some memory, that the OS is not going to reclaim.
>
>> So, I should wrap those task in an OpenReader/CloseReader?
>
> Yes, if you want the same reader object to be used by all these.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: contrib/benchmark questions

Posted by Doron Cohen <DO...@il.ibm.com>.

Grant Ingersoll <gs...@apache.org> wrote on 19/03/2007 13:10:16:

> So, if I am understanding correctly:
>
> >> "SearchSameRdr" Search > : 5000
>
> means don't collect indiv. stats fur SearchSameRdr, but do whatever
> that task does 5000 times, right?

Almost...

It should be btw
   { "SearchSameRdr" Search > : 5000
and it means: run Search 5000 times, sequentially, 5000 times, assign the
name "SearchSameRdr" to that sequence of 5000, and do not collect
individual stats for the individual tasks making that sequence.

If it was just
  { Search > : 5000
it would still mean the same, just that a name was assigned to this for
you, something like: "Seq_Search_5000".

If it was:
   { "SearchSameRdr" Search } : 5000
it would be the same as your example, just that stas would be collected not
only for the entire elapsed sequence, but also breaking it down for each of
the 5000 calls to "Search".

Similar logic with
  [ .. ]
and
  [ .. >
just that the tasks making the (parallel) sequence are executed in
parallel, each in a separate thread.

>
> >
> >> 3. Is there anyway to dump out the stats as a CSV file or something?
> >> Would I implement a Task for this?  Ultimately, I want to be able to
> >> create a graph in Excel that shows tradeoffs between speed and
> >> memory.
> >
> > Yes, implementing a report task would be the way.
> > ... but when I look at how I implemented these reports, all the
> > work is
> > done in the class Points. Seems it should be modified a little with
> > more
> > thought of making it easiert to extend reports.
>
> I may take a crack at it, but deadline for the talk is looming

I'll take a look too, let you know if I have anything.

> > - Being intetested in memory stats - the thing that all the rounds
> > run in a
> > single program, same JVM run, usually means what you see is very much
> > dependent in the GC behavior of the specific VM you are using. If
> > it does
> > not release memory (most likely) to the OS you would not be able to
> > notice
> > that round i+1 used less memory than round i. It would probably
> > better for
> > something like this to put the "round" logic in an ant script,
> > invoking
> > each round in a separate new exec. But then things get more
> > complicated for
> > having a final stats report containing all rounds. What do you
> > think about
> > this?
>
> Good to know.  Perhaps a GarbageCollectionTask is needed?

ResetSystemSoft and ResetSystemErase both call GC;
Is this sufficient, task wise?
The concern is that this is not enough gc/mem wise, because the JVM already
has some memory, that the OS is not going to reclaim.

> So, I should wrap those task in an OpenReader/CloseReader?

Yes, if you want the same reader object to be used by all these.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: contrib/benchmark questions

Posted by Grant Ingersoll <gs...@apache.org>.

Thanks for the reply, Doron.  I knew this email was targeted for you,  
but thought it would be good to add to the user record.

On Mar 19, 2007, at 2:30 PM, Doron Cohen wrote:

> Grant Ingersoll <gs...@apache.org> wrote on 18/03/2007 10:16:14:
>
>> I'm using contrib/benchmark to do some tests for my ApacheCon talk
>> and have some questions.
>>
>> 1. In looking at micro-standard.alg, it seems like not all braces are
>> closed.  Is a line ending a separator too?
>
> '>' can replace as a closing character (alternatively) either '}'  
> or ']'
> with the semantics: "do not collect/report separate statistics for the
> contained tasks. See "Statistic recording elimination" in
> http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/ 
> byTask/package-summary.html

So, if I am understanding correctly:

>> "SearchSameRdr" Search > : 5000

means don't collect indiv. stats fur SearchSameRdr, but do whatever  
that task does 5000 times, right?


>
>> 2. Is there anyway to dump out what params are supported by the
>> various tasks?  I am esp. uncertain on the Search related tasks.
>
> Search related tasks do not take args. Perhaps the task should  
> throw an
> exception if a params is set but not supported. I think I'll add that.
> Currently only AdDoc, DeleteDoc and SetProp take args. The section  
> "Command
> parameter" in
> http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/ 
> byTask/package-summary.html
>  which describes this is incomplete - I will fix it to reflect that.
>
> Which query arguments do you have in mind?

Never mind, I was confused by the : XXXX parameters after the >

>
>> 3. Is there anyway to dump out the stats as a CSV file or something?
>> Would I implement a Task for this?  Ultimately, I want to be able to
>> create a graph in Excel that shows tradeoffs between speed and  
>> memory.
>
> Yes, implementing a report task would be the way.
> ... but when I look at how I implemented these reports, all the  
> work is
> done in the class Points. Seems it should be modified a little with  
> more
> thought of making it easiert to extend reports.

I may take a crack at it, but deadline for the talk is looming


>
>> 4. Is there a way to set how many tabs occur between columns in the
>> final report?  They merge and buffer factors get hard to read for
>> larger values.
>
> There's no general tabbing control, can be added if required, - but  
> for the
> automatically added columns this is not requireed - just modify the  
> name of
> the column and it would fit, e.g. use "merge:10:100" to get a 5  
> charactres
> column, or "merging:10:100" for 7, etc. (Also see "Index work  
> parameters"
> under "Benchmark properties" in
> http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/ 
> byTask/package-summary.html
>
>> 5. Below is my "alg" file, any tips?  What I am trying to do is show
>> the tradeoffs of merge factor and max buffered and how it relates to
>> memory and indexing time.  I want to process all the documents in the
>> Reuters benchmark collection, not the 2000 in the micro-standard.  I
>> don't want any pauses and for now I am happy doing things in serial.
>> I think it is doing what I want, but am not 100% certain.
>>
>
> Yes, it seems correct to me. What I usually do to verify a new alg  
> is to
> run it first with very small numbers - e.g. 10 instead of 22000,  
> etc., and
> examine the log. Few comments:
> - you can specify a larger number than 22000 and the Docmaker will  
> iterate
> and created new docs from same input again.
> - Being intetested in memory stats - the thing that all the rounds  
> run in a
> single program, same JVM run, usually means what you see is very much
> dependent in the GC behavior of the specific VM you are using. If  
> it does
> not release memory (most likely) to the OS you would not be able to  
> notice
> that round i+1 used less memory than round i. It would probably  
> better for
> something like this to put the "round" logic in an ant script,  
> invoking
> each round in a separate new exec. But then things get more  
> complicated for
> having a final stats report containing all rounds. What do you  
> think about
> this?

Good to know.  Perhaps a GarbageCollectionTask is needed?


> - Seems you are only inrerested in the indexing performance, so you  
> can
> remove (or comment out) the search part.
> - If you are intrerested also in the search part, note that as  
> written, the
> four last search related tasks always use a new reader (opening/ 
> closing 950
> readers in this test).

OK, search is the second part, just focused on indexing first.   
Trying to address common questions/issues people have with  
performance in these two areas.

So, I should wrap those task in an OpenReader/CloseReader?

We may also want to consider making this an XML based type  
configuration...

Thanks for your help.  I will probably have a few more questions over  
the next few days.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: contrib/benchmark questions

Posted by Doron Cohen <DO...@il.ibm.com>.

Grant Ingersoll <gs...@apache.org> wrote on 18/03/2007 10:16:14:

> I'm using contrib/benchmark to do some tests for my ApacheCon talk
> and have some questions.
>
> 1. In looking at micro-standard.alg, it seems like not all braces are
> closed.  Is a line ending a separator too?

'>' can replace as a closing character (alternatively) either '}' or ']'
with the semantics: "do not collect/report separate statistics for the
contained tasks. See "Statistic recording elimination" in
http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/byTask/package-summary.html

> 2. Is there anyway to dump out what params are supported by the
> various tasks?  I am esp. uncertain on the Search related tasks.

Search related tasks do not take args. Perhaps the task should throw an
exception if a params is set but not supported. I think I'll add that.
Currently only AdDoc, DeleteDoc and SetProp take args. The section "Command
parameter" in
http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/byTask/package-summary.html
 which describes this is incomplete - I will fix it to reflect that.

Which query arguments do you have in mind?

> 3. Is there anyway to dump out the stats as a CSV file or something?
> Would I implement a Task for this?  Ultimately, I want to be able to
> create a graph in Excel that shows tradeoffs between speed and memory.

Yes, implementing a report task would be the way.
... but when I look at how I implemented these reports, all the work is
done in the class Points. Seems it should be modified a little with more
thought of making it easiert to extend reports.

> 4. Is there a way to set how many tabs occur between columns in the
> final report?  They merge and buffer factors get hard to read for
> larger values.

There's no general tabbing control, can be added if required, - but for the
automatically added columns this is not requireed - just modify the name of
the column and it would fit, e.g. use "merge:10:100" to get a 5 charactres
column, or "merging:10:100" for 7, etc. (Also see "Index work parameters"
under "Benchmark properties" in
http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/byTask/package-summary.html

> 5. Below is my "alg" file, any tips?  What I am trying to do is show
> the tradeoffs of merge factor and max buffered and how it relates to
> memory and indexing time.  I want to process all the documents in the
> Reuters benchmark collection, not the 2000 in the micro-standard.  I
> don't want any pauses and for now I am happy doing things in serial.
> I think it is doing what I want, but am not 100% certain.
>

Yes, it seems correct to me. What I usually do to verify a new alg is to
run it first with very small numbers - e.g. 10 instead of 22000, etc., and
examine the log. Few comments:
- you can specify a larger number than 22000 and the Docmaker will iterate
and created new docs from same input again.
- Being intetested in memory stats - the thing that all the rounds run in a
single program, same JVM run, usually means what you see is very much
dependent in the GC behavior of the specific VM you are using. If it does
not release memory (most likely) to the OS you would not be able to notice
that round i+1 used less memory than round i. It would probably better for
something like this to put the "round" logic in an ant script, invoking
each round in a separate new exec. But then things get more complicated for
having a final stats report containing all rounds. What do you think about
this?
- Seems you are only inrerested in the indexing performance, so you can
remove (or comment out) the search part.
- If you are intrerested also in the search part, note that as written, the
four last search related tasks always use a new reader (opening/closing 950
readers in this test).


> -----------  alg file --------
>
> #last value is more than all the docs in reuters
> merge.factor=mrg:10:100:1000:5000:10:10:10:10:100:1000
> max.buffered=buf:10:10:10:10:100:1000:10000:21580:21580:21580
> compound=true
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> directory=FSDirectory
> #directory=RamDirectory
>
> doc.stored=true
> doc.tokenized=true
> doc.term.vector=false
> doc.add.log.step=1000
>
> docs.dir=reuters-out
> #docs.dir=reuters-111
>
> #doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>
> #query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>
> # task at this depth or less would print when they start
> task.max.depth.log=2
>
> log.queries=true
> #
> ------------------------------------------------------------------------
> -------------
>
> { "Rounds"
>
>      ResetSystemErase
>
>      { "Populate"
>          CreateIndex
>          { "MAddDocs" AddDoc > : 22000
>          Optimize
>          CloseIndex
>      }
>
>      OpenReader
>      { "SearchSameRdr" Search > : 5000
>      CloseReader
>
>      { "WarmNewRdr" Warm > : 50
>
>      { "SrchNewRdr" Search > : 500
>
>      { "SrchTrvNewRdr" SearchTrav > : 300
>
>      { "SrchTrvRetNewRdr" SearchTravRet > : 100
>
>      NewRound
>
> } : 10
>
> RepSumByName
> RepSumByPrefRound MAddDocs
>
>
> Thanks,
> Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org