You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/07/02 15:42:04 UTC

[jira] Created: (LUCENE-947) Some improvements to contrib/benchmark

Some improvements to contrib/benchmark
--------------------------------------

                 Key: LUCENE-947
                 URL: https://issues.apache.org/jira/browse/LUCENE-947
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/benchmark
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor


I've made some small improvements to the contrib/benchmark, mostly
merging in the ad-hoc benchmarking code I've been using in LUCENE-843:

  - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat

  - Print the props in sorted order

  - Added new config "autocommit=true|false" to CreateIndexTask

  - Added new config "ram.flush.mb=int" to AddDocTask

  - Added new configs "doc.term.vector.positions=true|false" and
    "doc.term.vector.offsets=true|false" to BasicDocMaker

  - Added WriteLineDocTask.java, so you can make an alg that uses this
    to build up a single file containing one document per line in a
    single file.  EG this alg converts the reuters-out tree into a
    single file that has ~1000 bytes per body field, saved to
    work/reuters.1000.txt:

      docs.dir=reuters-out
      doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
      line.file.out=work/reuters.1000.txt
      doc.maker.forever=false
      {WriteLineDoc(1000)}: *

    Each line has tab-separted TITLE, DATE, BODY fields.

  - Created feeds/LineDocMaker.java that creates documents read from
    the file created by WriteLineDocTask.java.  EG this alg indexes
    all documents created above:

      analyzer=org.apache.lucene.analysis.SimpleAnalyzer
      directory=FSDirectory
      doc.add.log.step=500

      docs.file=work/reuters.1000.txt
      doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
      doc.tokenized=true
      doc.maker.forever=false

      ResetSystemErase
      CreateIndex
      {AddDoc}: *
      CloseIndex

      RepSumByPref AddDoc

I'll attach initial patch shortly.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514936 ] 

Michael McCandless commented on LUCENE-947:
-------------------------------------------

This looks great Doron, thanks.

I have one more mod, which is to prefix the log prints from AddDocTask with the net elapsed time since startup (I like to see net elapsed time as algo is running to get a sense of performance difference before full task finishes...).  I will attach a new patch.

> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch, LUCENE-947.take3.patch, LUCENE-947.take4.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514494 ] 

Doron Cohen commented on LUCENE-947:
------------------------------------

I missed that one, thanks for the reminder - just a few comments:

1. TestPerfTasksParse - why do you prevent the testing of parsing of WriteLineDoc? 
     I disabled the special handling of this and the test works as supposed.

2. Documentation of new properties is missing:
     - In CreateIndexTask: ram.flush.mb [0],  autocommit [true]
     - In byTask.package.html (same 2 props).

3. run.flush & aotoCommit should be added & used & documented also in openIndexTask (currently only used in createIndexTask).

4. AddDocTask:  flushAtRAMUsage - unused?

5. buil.xml - 1024m as default for running a benchmark seems too much?
    I mean, one of the nice things about Lucene is that it can run for you even if you only have few MB of RAM to spare. For someone with a low level machine, say 512M only, the JVM might fail to even start, right?

6. I like your change of factoring some of the field names into consts. We should probably do the same for the rest.

7. I didn' t try the new WriteLineDocTask and LineDocMaker feed. Partly because there was no ready to use alg for that under conf/, and also no test for that. Do you think we should add at least one of these two (preferably both)?  - I can help with this.


> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-947:
--------------------------------------

    Attachment: LUCENE-947.take3.patch

Next iteration of the patch (with Doron's suggested fixes).

> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch, LUCENE-947.take3.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-947:
--------------------------------------

    Attachment: LUCENE-947.take5.patch

New rev, I think ready to commit!  I added elapsed time print to AddDocTask and also fixed the new LineDoc unit test to work properly when run ("ant test-contrib") from main Lucene directory.   I also added the comment about re-using Document & Field instances in LineDocMaker.

> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch, LUCENE-947.take3.patch, LUCENE-947.take4.patch, LUCENE-947.take5.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-947:
-------------------------------

    Attachment: LUCENE-947.take4.patch

take4 has the last comments fixed...
(one thing I can't seem to get rid of is that my svn client started assigning "executable" to new files...)

> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch, LUCENE-947.take3.patch, LUCENE-947.take4.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515099 ] 

Michael McCandless commented on LUCENE-947:
-------------------------------------------

> Mmm one test issue in Windows - the reading of linFile in
> testLineDocFile treats the '\' as control char, so "\temp" reads as
> "TAB est" and the test fails. If you tested this in Linux you had
> '/' as separator and wouldn't see this failure.

Wooooops!  Thanks for catching this.

> To fix this you can change twice in testLineDocFile() from
>      "line.file.out=" + lineFile,
> to
>       "line.file.out=" + lineFile.getAbsolutePath().replace('\\','/'),
>
> This works for me in Windows, should also work in Linux and such - but I don't have one to check on... 

OK I'll make that change & test cross platform and then commit!


> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch, LUCENE-947.take3.patch, LUCENE-947.take4.patch, LUCENE-947.take5.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-947:
--------------------------------------

    Attachment: LUCENE-947.patch

First cut patch.

> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515087 ] 

Doron Cohen commented on LUCENE-947:
------------------------------------

It looks good Michael! (I like the time printing.)

Mmm one test issue in Windows - the reading of linFile in testLineDocFile treats the '\' as control char, so "\temp" reads as "TAB est" and the test fails. If you tested this in Linux you had '/' as separator and wouldn't see this failure. 

To fix this you can change twice in testLineDocFile() from
     "line.file.out=" + lineFile,
to
      "line.file.out=" + lineFile.getAbsolutePath().replace('\\','/'),

This works for me in Windows, should also work in Linux and such - but I don't have one to check on...


> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch, LUCENE-947.take3.patch, LUCENE-947.take4.patch, LUCENE-947.take5.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514600 ] 

Michael McCandless commented on LUCENE-947:
-------------------------------------------

Thanks for the review Doron!

> 1. TestPerfTasksParse - why do you prevent the testing of parsing of WriteLineDoc? 
>     I disabled the special handling of this and the test works as supposed.

Hmmm ... I was seeing a failure if I didn't do that because
WriteLineDoc requires "line.file.out" Config to be set and that test
didn't know to do so.  I'll put it back into the test but add
"line.file.out" for this task.

> 2. Documentation of new properties is missing:
>      - In CreateIndexTask: ram.flush.mb [0],  autocommit [true]
>      - In byTask.package.html (same 2 props).

OK, I'll add this and also for "doc.term.vector.{offsets,positions}"
to BasicDocMaker.

> 3. run.flush & aotoCommit should be added & used & documented also in openIndexTask (currently only used in createIndexTask).

OK, I'll add this.

> 4. AddDocTask:  flushAtRAMUsage - unused?

Yup, this was leftover from pre LUCENE-843 where you had to check RAM
usage after each doc and then flush.  I'll remove it and actually just revert
to current AddDocTask.java (I don't need any mods here).

> 5. buil.xml - 1024m as default for running a benchmark seems too much?
>     I mean, one of the nice things about Lucene is that it can run for you even if you only have few MB of RAM to spare. For someone with a low level machine, say 512M only, the JVM might fail to even start, right?

Woops... I didn't mean to put this change in.  I'll leave it where it
was (140 MB) and remove the "-server" jvmarg as well.  I was hitting
OOM on some Wikipedia algs.

> 6. I like your change of factoring some of the field names into consts. We should probably do the same for the rest.

OK I'll pull out the remaining ones...

> 7. I didn' t try the new WriteLineDocTask and LineDocMaker feed. Partly because there was no ready to use alg for that under conf/, and also no test for that. Do you think we should add at least one of these two (preferably both)?  - I can help with this.

OK I'll do both of these.

> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514776 ] 

Doron Cohen commented on LUCENE-947:
------------------------------------

Thanks for fixing this Michael, and as usual so fast! 

I was able to run the new alg files and the new tests.

Few more comments: 

WriteLineDocTask has all the work done in Setup(). This is a bit wrong(?) Usually only preparation is done in the Setup(), but real work (things we measure) should be in doLogic().  Mmm... would probably make more sense to move the file handling code from the constructor to setup(), and the doc creation code (except for docMaker extraction) from setup() to doLogic(). This should also prevent the error in TestPerfTasksParse (I think no changes would then be required in this test.) 

Unused imports and dateFormat in 
 - LineDocMaker
 - WriteLineDocTask 

For LineDocMaker, I was puzzled why you chose not to implement getNextDocData() and not base on BasicDocMaker to create the next doc for you. I now understand this is for reusing the Document and Field objects that BasicDocMaker does not support. I would add a comment on that. 

The new consts in BasicDocMaker can now be used in few more places....


> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch, LUCENE-947.take3.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-947.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.3
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch, LUCENE-947.take3.patch, LUCENE-947.take4.patch, LUCENE-947.take5.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-947) Some improvements to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-947:
--------------------------------------

    Attachment: LUCENE-947.take2.patch

Another rev of this patch: small changes based on fixes for LUCENE-843
and LUCENE-963.  I plan to commit in a day or two...

> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch, LUCENE-947.take2.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
>   - Added new config "ram.flush.mb=int" to AddDocTask
>   - Added new configs "doc.term.vector.positions=true|false" and
>     "doc.term.vector.offsets=true|false" to BasicDocMaker
>   - Added WriteLineDocTask.java, so you can make an alg that uses this
>     to build up a single file containing one document per line in a
>     single file.  EG this alg converts the reuters-out tree into a
>     single file that has ~1000 bytes per body field, saved to
>     work/reuters.1000.txt:
>       docs.dir=reuters-out
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>       line.file.out=work/reuters.1000.txt
>       doc.maker.forever=false
>       {WriteLineDoc(1000)}: *
>     Each line has tab-separted TITLE, DATE, BODY fields.
>   - Created feeds/LineDocMaker.java that creates documents read from
>     the file created by WriteLineDocTask.java.  EG this alg indexes
>     all documents created above:
>       analyzer=org.apache.lucene.analysis.SimpleAnalyzer
>       directory=FSDirectory
>       doc.add.log.step=500
>       docs.file=work/reuters.1000.txt
>       doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>       doc.tokenized=true
>       doc.maker.forever=false
>       ResetSystemErase
>       CreateIndex
>       {AddDoc}: *
>       CloseIndex
>       RepSumByPref AddDoc
> I'll attach initial patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org