You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Jason Rutherglen (JIRA)" <ji...@apache.org> on 2009/02/10 19:49:00 UTC

[jira] Created: (LUCENE-1539) Improve Benchmark

Improve Benchmark
-----------------

                 Key: LUCENE-1539
                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/benchmark
    Affects Versions: 2.4
            Reporter: Jason Rutherglen
            Priority: Minor
             Fix For: 2.9


Benchmark can be improved by incorporating recent suggestions posted
on java-dev. M. McCandless' Python scripts that execute multiple
rounds of tests can either be incorporated into the codebase or
converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721705#action_12721705 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

Where are we assuming/requiring the path be relative?

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697117#action_12697117 ] 

Shai Erera commented on LUCENE-1539:
------------------------------------

bq. Can you open a new issue?

Will do.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1539:
-------------------------------------

    Attachment: LUCENE-1539.patch

Fixed the above mentioned problems.  When LUCENE-1516 is in should we add the near realtime benchmarks here?

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718578#action_12718578 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

Right, I think deleteDocsByPercent should 1) determine how many docs to delete (deletePct * reader.numDocs()), and then 2) random select ones to delete, counting how many actually were deleted, and stopping when it reaches the target.  To avoid this taking excessively long when too many deletions are requested, you should probably invert if the %tg is > 50?  Ie, choose instead the docs NOT to delete, and then make a linear sweep to delete any docs not chosen?

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722738#action_12722738 ] 

Jason Rutherglen commented on LUCENE-1539:
------------------------------------------

Took a look at Lucene in Action at Borders and learned the -Dproperty passed in overrides what's in the build.xml.  

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721586#action_12721586 ] 

Jason Rutherglen commented on LUCENE-1539:
------------------------------------------

I think it would be convenient to allow passing in the data files' absolute path, instead of assuming they're in a relative path.  

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674075#action_12674075 ] 

jasonrutherglen edited comment on LUCENE-1539 at 2/16/09 2:36 PM:
-------------------------------------------------------------------

bq. couldn't we create a .alg that makes multiple copies of a Wikipedia index w/ different pctg deletes, instead of a static main java tool? 

We'll need a new DeletesTask that deletes based on a percentage?  

bq. use multiple commits in the same index

This sounds good.  

      was (Author: jasonrutherglen):
    bq. couldn't we create a .alg that makes multiple copies of a Wikipedia index w/ different pctg deletes, instead of a static main java tool? 

We'll need a new DeletesTask that deletes based a on a percentage?  

bq. use multiple commits in the same index

This sounds good.  
  
> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1539:
------------------------------------------

    Assignee: Michael McCandless

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674075#action_12674075 ] 

Jason Rutherglen commented on LUCENE-1539:
------------------------------------------

bq. couldn't we create a .alg that makes multiple copies of a Wikipedia index w/ different pctg deletes, instead of a static main java tool? 

We'll need a new DeletesTask that deletes based a on a percentage?  

bq. use multiple commits in the same index

This sounds good.  

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1539:
-------------------------------------

    Attachment: LUCENE-1539.patch

Implemented the changes.  Wasn't sure how to floor it.  

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718575#action_12718575 ] 

Jason Rutherglen commented on LUCENE-1539:
------------------------------------------

It would be good to get done, we need the deletes to randomly delete, or maybe just delete only docs that aren't already deleted?  (i.e. the loop tries to delete at a pos, if it's already deleted, try the next spot, etc).

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722738#action_12722738 ] 

Jason Rutherglen edited comment on LUCENE-1539 at 6/22/09 11:17 AM:
--------------------------------------------------------------------

Took a look at ANT in Action at Borders and learned the -Dproperty passed in overrides what's in the build.xml.  

      was (Author: jasonrutherglen):
    Took a look at Lucene in Action at Borders and learned the -Dproperty passed in overrides what's in the build.xml.  
  
> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1539:
-------------------------------------

    Attachment: sortCollate2.py
                sortBench2.py

Python scripts attached.  

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718957#action_12718957 ] 

Jason Rutherglen commented on LUCENE-1539:
------------------------------------------

The only small thing that came to mind is if the user decides to
subsequently (in the .alg) delete a lesser percentage of docs
than the what exists in the reader. Does that mean we should
undelete docs?

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675336#action_12675336 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

bq. In looking over the code, to do the multiple commits using IR we'll need to add a IR.flush(String userData) method?

Yes, we should.  Can you open a new issue + patch?

We also have to fix contrib/benchmark to allow specification of a Deletion Policy, and then allow openReader task to take a string (userData) to specific which commit to open.

But: it'd be best if, within a single alg, we could specify a series of commits to open, so that we can iterate over the different commit points.  I don't think a param to the task allows this?  (But I'm not sure).  If we made it a config option then I believe we could specify a sequence which each round would advance through.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718907#action_12718907 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

bq. When existing deletes are over 50%, we loop through termdocs instead.

OK good, except it's deleting too aggressively when > 50% deletions are already present (using nextBoolean()).  Can you change that to target a certain deletion rate?  Ie if you need to delete 20%, then do random.nextDouble() < 0.20 to do the delete?  But then I guess put a floor on that rate so that it doesn't get too slow on the "tail"?  It won't be perfectly random when it hits that tail but I think that's OK.


> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697057#action_12697057 ] 

Shai Erera commented on LUCENE-1539:
------------------------------------

Is it also interesting to add extensions to EnwikiDocMaker, WriteLineDoc and LineDocMaker which can read/write the content in a bzip format?
I downloaded the latest Enwiki dump, 4.5 GB in bzip format. Extracted XML size is 17GB. I thought to myslef that I don't have a real reason to extract it - I can read the content directly from the bzip-type file.

So I looked around and found out that in ant.jar there are two classes which can read/write that format. Just to compare, I gzipped the XML file and the result was 5.1GB file (~13% larger). The general measurements on the web also show bzip is superior to gzip, although it probably runs a bit slower.

I then ran the WriteLineDoc task, to produce the one-line-per-document text file, and stopped when it reache 228MB. Again, I zipped, gzipped and bzipped the file, and the bzip format was smaller by ~20%.

So I was wondering - besides the speed of writing from a compressed archive, which is slwoer than reading from a plain XML or TXT file, is there a reason why we don't use bzip/gzip when reading content? It will save a lot of space and I'm not sure that part of the indexing is what's most important.
However, I'm aware that some people might find it better to read from plain files, so I suggest we just have extensions which can read/write the compressed format.
The question is, assuming you agree to it, should we use bzip (which requires external library) or gzip which is in the JDK, does not compress as good as bzip, but might have better performance (I can give it some measurements if needed, but the main question I have is whether we want to introduce a dependency on another library).

If this belongs in a separate issue, let me know.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697088#action_12697088 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

Enabling bzip compression sounds like a win; the added dependency to contrib/benchmark seems fine (it already has several external dependencies).

Can you open a new issue?

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696971#action_12696971 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

I think DeleteByPercentTask.java is missing?

Also: I think you're missing the ability to set the deletion policy for the reader or writer?  Without that, only the last commit is retained.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680276#action_12680276 ] 

Jason Rutherglen commented on LUCENE-1539:
------------------------------------------

For a performing simultaneous indexing and searching, how should we
best represent this in the .alg file? We have an example
indexing-multithreaded.alg so I suppose we can simply spawn another
set of threads after the "[{ "MAddDocs" AddDoc } : 5000] : 4" line
that performs searches? Just gathering opinions as I don't feel
completely familiar with the benchmark suite yet.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1539:
---------------------------------------

    Attachment: LUCENE-1539.patch

Added undelete all if you try to delete to an absolute pct less than the current deletions.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1539:
-------------------------------------

    Attachment: LUCENE-1539.patch

* Added deletepercent.alg as an example of these tasks
* CommitIndexTask commits an IndexWriter using a commit name
* OpenReaderTask opens a specific commit point by name
* FlushReaderTask flushes a reader using a commit name
* DeleteByPercentTask a percentage of reader documents


> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718440#action_12718440 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

Jason this patch seems close... are you gonna have time/itch to finish this soonish?

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1539.
----------------------------------------

    Resolution: Fixed

Thanks Jason!

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718899#action_12718899 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

Thanks Jason, getting close:

  * Can you add contrib/benchmark/CHANGES entry?

  * The new source files need a copyright header

  * Can you remove the undeleteAll?  I don't think the
    DeleteByPercentTask should do that.

  * Can you make its param a real percent, ie so DeleteByPercent(25)
    deletes 25% of the remaining docs.

  * The random-pick is going to be too slow once too many docs are
    deleted (I mentioned this above, too).  How to fix?


> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675196#action_12675196 ] 

Jason Rutherglen commented on LUCENE-1539:
------------------------------------------

In looking over the code, to do the multiple commits using IR we'll need to add a IR.flush(String userData) method?

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718961#action_12718961 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

Or... maybe we should just do undeleteAll all that case?  I'll take that approach instead.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697468#action_12697468 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

This patch still has some noise, eg the unused *Property additions to PerfRunData, the nocommit "first" logic in ReadTask.

On DeleteTaskByPercentTask: should it delete a pctg of the undeleted (numDocs()) docs or of the total (maxDoc()) doc space?  Right now its implementation is dangerous, eg, if I delete 5% of the index and then 10%, that 10% delete will do nothing, since the docs it deletes will fall onto the exact docs that the 5% had deleted.

{quote}
It seems a bit awkward that DeleteByPercentTask needs to call
IR.undeleteAll before executing the deletes.
{quote}

Oh, I see.  I don't think it should do that?  I think it should mean "delete XXX% of the remaining undeleted docs"?

{quote}
Also that
subsequent delete by percent calls in deletepercent.alg need to
open the latest version of the index rather than the original
(which does not have deletes)
{quote}

This seems correct?  Ie the purpose of this task is "open the latest commit on the index, delete XXX% of its undeleted docs".

{quote}
This is due to
DirectoryIndexReader.acquireWriteLock checking to insure the
latest version of the index is locked. Perhaps we can relax
this? I would rather be able to open a commit point and delete
from the reader, then flush as the latest version.
{quote}
I don't think we can relax that.  This (single transaction (writer) open at once) is a core assumption in Lucene.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1539:
---------------------------------------

    Attachment: LUCENE-1539.patch

Attached new patch; fixed a bunch of silly issues (eg we had broken
parsing of the readOnly option to OpenReaderTask; the
deletepercent.alg was opening readOnly readers to do the deletes; the
readOnly option was ignored if you specified userData; etc.).

I also switched the default for autoCommit to false, when creating an
IndexWriter.

I think it's ready to commit... I'll commit soon.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1539:
-------------------------------------

    Attachment: LUCENE-1539.patch

The patch adds CreateWikiIndex which creates enwiki indexes with
multiple percentages of deletes. It probably needs to be made into a
task or multiple tasks along with an alg file. One goal is to evolve
this patch to enable concurrent indexing and searching. 

I can see the elegance of using Python scripts because it's easy to
edit, and the pickling is nice. Equivalent Java code could be fairly
lengthy. However since this is a Java project and we have a framework
with the .alg files for defining some level of external operations,
it seems we may want to figure out a way to put the Python script
functionality into tasks and defined by .alg files. 


> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1539:
-------------------------------------

    Attachment: LUCENE-1539.patch

Keeps previous deletes (doesn't call undeleteall).  When existing deletes are over 50%, we loop through termdocs instead.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1539:
-------------------------------------

    Attachment: LUCENE-1539.patch

Changed the deletes to be random, cleaned up the code.

Multiple passes of deletePercent.alg fails, I may have time to figure out why, as is though the patch works.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718960#action_12718960 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

I'd say we don't allow that now.  EG one can easily save & open a past commit point, with less deletions?

But maybe we should throw an exception if you attempt this, so you don't falsely think it worked.  I'll make that change.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1539:
-------------------------------------

    Attachment: LUCENE-1539.patch

Above mentioned issues fixed.

It seems a bit awkward that DeleteByPercentTask needs to call
IR.undeleteAll before executing the deletes. Also that
subsequent delete by percent calls in deletepercent.alg need to
open the latest version of the index rather than the original
(which does not have deletes). This is due to
DirectoryIndexReader.acquireWriteLock checking to insure the
latest version of the index is locked. Perhaps we can relax
this? I would rather be able to open a commit point and delete
from the reader, then flush as the latest version.

Perhaps in flexible indexing we can have more customizability
with the versioning? 

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673596#action_12673596 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

Jason, couldn't we create a .alg that makes multiple copies of a Wikipedia index w/ different pctg deletes, instead of a static main java tool?

In fact.... one cleaner way to achieve this would be to use multiple commits in the same index.  So instead of making a full copy of the wiki index for each pctg of deletes, make a new commit.  You end up w/ a single index that has N commits, one for each pctg you need to test.  Then we'd just need a way to tell an alg which commit to open.  Since a commit can contain an optional string commitUserData, we could use to tell the alg which commit to open.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702436#action_12702436 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------


{quote}
Yeah? Ok. So the deleteDocsByPercent method needs to somehow
take into account whether it's deleted before by adjusting the
doc nums it's deleting?
{quote}

How about randomly choosing docs to delete instead of every N?  Then
you don't need to keep track?

{quote}
> I don't think we can relax that. This (single transaction
> (writer) open at once) is a core assumption in Lucene.

True, however doesn't mean we have to stick with it, especially
internally. Hopefully we can move to a more componentized model
someone could change this if they wanted. Perhaps in the
flexible indexing revamp
{quote}

We'd need to figure out how to get multiple writers to properly
"cooperate".  Actually Marvin is working on something like this (for
KS/Lucy), where one "lightweight" writer can do adds/deletes/small
merges, and a separate "heavyweight" writer does large merges.


> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701768#action_12701768 ] 

Jason Rutherglen commented on LUCENE-1539:
------------------------------------------

{quote}
I think it should mean "delete XXX% of the remaining
undeleted docs"?
{quote}

Yeah? Ok. So the deleteDocsByPercent method needs to somehow
take into account whether it's deleted before by adjusting the
doc nums it's deleting?

{quote}
I don't think we can relax that. This (single transaction
(writer) open at once) is a core assumption in Lucene.
{quote}

True, however doesn't mean we have to stick with it, especially
internally. Hopefully we can move to a more componentized model
someone could change this if they wanted. Perhaps in the
flexible indexing revamp?





> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695462#action_12695462 ] 

Michael McCandless commented on LUCENE-1539:
--------------------------------------------

This patch looks good -- some questions:

  * Is CreateWikiIndex intended to be committed?  I thought not?  Ie I
    though the goal w/ this issue is add the necessary tasks so that
    CreateWikiIndex would be done as an alg.

  * I think we shouldn't bump to Java 1.5 -- it's only CreateWikiIndex
    that needs it anyway (in only 2 places).

  * PrintReaderTask never closes the reader.

  * Not sure why you needed to relax private -> protected in AddDocTask?


> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1539) Improve Benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1539:
---------------------------------------

    Attachment: LUCENE-1539.patch

Updated patch:

  * Switched to TermDocs to pick the deletes; I think this is sufficient (no floor is needed)

  * Beefed up CHANGES

  * Added a few more copyrights

I think it's ready to commit!  I'll wait a day or two...

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org