You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/07/26 20:10:03 UTC
[jira] Created: (LUCENE-967) Add "tokenize documents only" task to
contrib/benchmark
Add "tokenize documents only" task to contrib/benchmark
-------------------------------------------------------
Key: LUCENE-967
URL: https://issues.apache.org/jira/browse/LUCENE-967
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/benchmark
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Fix For: 2.3
Attachments: LUCENE-967.patch
I've been looking at performance improvements to tokenization by
re-using Tokens, and to help benchmark my changes I've added a new
task called ReadTokens that just steps through all fields in a
document, gets a TokenStream, and reads all the tokens out of it.
EG this alg just reads all Tokens for all docs in Reuters collection:
doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
doc.maker.forever=false
{ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-967) Add "tokenize documents only" task
to contrib/benchmark
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516747 ]
Michael McCandless commented on LUCENE-967:
-------------------------------------------
I plan to commit this soon...
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-967) Add "tokenize documents only" task
to contrib/benchmark
Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516806 ]
Doron Cohen commented on LUCENE-967:
------------------------------------
I'm reviewing it now...
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-967) Add "tokenize documents only" task
to contrib/benchmark
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516945 ]
Michael McCandless commented on LUCENE-967:
-------------------------------------------
> Also, I think the addition of printing of elapsed time is redundant,
> because you get it anyhow as the elapsed time reported for the
> outermost task sequence. (?)
Duh, right :) I will remove that.
> 1) in ReadTokensTask change doLogic() to return the number of tokens
> processed in that specific call to doLogic() (differs from tokensCount
> which aggregates all calls).
Ahh good idea!
> 2) in TestPerfTaskLogic the comment in testReadTokens seems
> copy/pasted from testLineDocFile and should be changed.
Woops, will fix.
> - Also (I am not sure if it is worth your time, but) to really test it, you
> could open a reader against the created index and verify the number
> of docs, and also the index sum-of-DF comparing to the total tokens
> counts numbers in ReadTokensTask.
OK I added this too. Will submit new patch shortly.
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-967) Add "tokenize documents only" task
to contrib/benchmark
Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516809 ]
Doron Cohen commented on LUCENE-967:
------------------------------------
Applies cleanly and all test pass (running from contrib/benchmark.)
I like the efficiency changes.
A few suggestions:
1) in ReadTokensTask change doLogic() to return the number of tokens
processed in that specific call to doLogic() (differs from tokensCount
which aggregates all calls).
2) in TestPerfTaskLogic the comment in testReadTokens seems
copy/pasted from testLineDocFile and should be changed.
- Also (I am not sure if it is worth your time, but) to really test it, you
could open a reader against the created index and verify the number
of docs, and also the index sum-of-DF comparing to the total tokens
counts numbers in ReadTokensTask.
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-967) Add "tokenize documents only" task
to contrib/benchmark
Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517031 ]
Doron Cohen commented on LUCENE-967:
------------------------------------
Thanks for fixing this Michael, looks perfect to me now.
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-967) Add "tokenize documents only" task
to contrib/benchmark
Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516837 ]
Doron Cohen commented on LUCENE-967:
------------------------------------
Also, I think the addition of printing of elapsed time is redundant,
because you get it anyhow as the elapsed time reported for the
outermost task sequence. (?)
For instance, if you add to tokenize.alg this line:
RepSumByName
You get this output:
Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
Seq_Exhaust 0 1 21578 638.2 33.81 15,694,368 20,447,232
Net elapsed time: 33.809 sec
So the total elapsed time is actually printed twice now - do we need this?
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-967) Add "tokenize documents only" task to
contrib/benchmark
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-967:
--------------------------------------
Lucene Fields: [New, Patch Available] (was: [New])
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Resolved: (LUCENE-967) Add "tokenize documents only" task to
contrib/benchmark
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless resolved LUCENE-967.
---------------------------------------
Resolution: Fixed
Lucene Fields: [New, Patch Available] (was: [Patch Available, New])
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-967) Add "tokenize documents only" task to
contrib/benchmark
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-967:
--------------------------------------
Attachment: LUCENE-967.take3.patch
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-967) Add "tokenize documents only" task
to contrib/benchmark
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517035 ]
Michael McCandless commented on LUCENE-967:
-------------------------------------------
Thank you for reviewing! I will commit shortly.
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-967) Add "tokenize documents only" task to
contrib/benchmark
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-967:
--------------------------------------
Attachment: LUCENE-967.patch
Attached patch that adds ReadTokensTask.java. I also added change to print net elapsed time of the algorithm.
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-967) Add "tokenize documents only" task to
contrib/benchmark
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-967:
--------------------------------------
Attachment: LUCENE-967.take2.patch
New rev of this patch. Only real change is to reduce overhead
(slightly) of benchmark framework by pre-building array of PerfTasks
instead of creating new iterator for each document.
> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
> doc.maker.forever=false
> {ReadTokens > : *
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org