You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/07/26 20:10:03 UTC

[jira] Created: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Add "tokenize documents only" task to contrib/benchmark
-------------------------------------------------------

                 Key: LUCENE-967
                 URL: https://issues.apache.org/jira/browse/LUCENE-967
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/benchmark
    Affects Versions: 2.3
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor
             Fix For: 2.3
         Attachments: LUCENE-967.patch

I've been looking at performance improvements to tokenization by
re-using Tokens, and to help benchmark my changes I've added a new
task called ReadTokens that just steps through all fields in a
document, gets a TokenStream, and reads all the tokens out of it.

EG this alg just reads all Tokens for all docs in Reuters collection:

  doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
  doc.maker.forever=false
  {ReadTokens > : *


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516747 ] 

Michael McCandless commented on LUCENE-967:
-------------------------------------------

I plan to commit this soon...

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516806 ] 

Doron Cohen commented on LUCENE-967:
------------------------------------

I'm reviewing it now...

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516945 ] 

Michael McCandless commented on LUCENE-967:
-------------------------------------------

> Also, I think the addition of printing of elapsed time is redundant, 
> because you get it anyhow as the elapsed time reported for the 
> outermost task sequence. (?)

Duh, right :)  I will remove that.

>  1) in ReadTokensTask change doLogic() to return the number of tokens 
>       processed in that specific call to doLogic() (differs from tokensCount 
>       which aggregates all calls).

Ahh good idea!

>  2) in TestPerfTaskLogic the comment in testReadTokens seems 
>      copy/pasted from testLineDocFile and should be changed. 

Woops, will fix.

>      - Also (I am not sure if it is worth your time, but) to really test it, you 
>      could open a reader against the created index and verify the number 
>      of docs, and also the index sum-of-DF comparing to the total tokens 
>      counts numbers in ReadTokensTask. 

OK I added this too.  Will submit new patch shortly.

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516809 ] 

Doron Cohen commented on LUCENE-967:
------------------------------------

Applies cleanly and all test pass (running from contrib/benchmark.)

I like the efficiency changes.

A few suggestions:

  1) in ReadTokensTask change doLogic() to return the number of tokens 
       processed in that specific call to doLogic() (differs from tokensCount 
       which aggregates all calls).

  2) in TestPerfTaskLogic the comment in testReadTokens seems 
      copy/pasted from testLineDocFile and should be changed. 

      - Also (I am not sure if it is worth your time, but) to really test it, you 
      could open a reader against the created index and verify the number 
      of docs, and also the index sum-of-DF comparing to the total tokens 
      counts numbers in ReadTokensTask. 

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517031 ] 

Doron Cohen commented on LUCENE-967:
------------------------------------

Thanks for fixing this Michael, looks perfect to me now.

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516837 ] 

Doron Cohen commented on LUCENE-967:
------------------------------------

Also, I think the addition of printing of elapsed time is redundant, 
because you get it anyhow as the elapsed time reported for the 
outermost task sequence. (?)

For instance, if you add to tokenize.alg this line:
     RepSumByName
You get this output:
     Operation   round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
     Seq_Exhaust     0        1        21578        638.2       33.81    15,694,368     20,447,232
     Net elapsed time: 33.809 sec
So the total elapsed time is actually printed twice now - do we need this?


> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-967:
--------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-967.
---------------------------------------

       Resolution: Fixed
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-967:
--------------------------------------

    Attachment: LUCENE-967.take3.patch

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517035 ] 

Michael McCandless commented on LUCENE-967:
-------------------------------------------

Thank you for reviewing!  I will commit shortly.

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-967:
--------------------------------------

    Attachment: LUCENE-967.patch

Attached patch that adds ReadTokensTask.java.  I also added change to print net elapsed time of the algorithm.

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-967:
--------------------------------------

    Attachment: LUCENE-967.take2.patch

New rev of this patch.  Only real change is to reduce overhead
(slightly) of benchmark framework by pre-building array of PerfTasks
instead of creating new iterator for each document.


> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
>                 Key: LUCENE-967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-967
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org