You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "DM Smith (JIRA)" <ji...@apache.org> on 2008/07/11 19:20:31 UTC

[jira] Created: (LUCENE-1333) Token implementation needs improvements

Token implementation needs improvements
---------------------------------------

Key: LUCENE-1333
URL: https://issues.apache.org/jira/browse/LUCENE-1333
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Affects Versions: 2.3.1
Environment: All
Reporter: DM Smith
Priority: Minor
Fix For: 2.4

This was discussed in the thread (not sure which place is best to reference so here are two):
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
or to see it all at once:
http://www.gossamer-threads.com/lists/lucene/java-dev/62851

Issues:
1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
5. Some internal optimizations can be done with regard to char[] allocation.
6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622171#action_12622171 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

But how do you cope with reset()?

Consider this:
* step 1: tokens added to the sink. They are cloned, so sink "owns" them.
* step 2: sink is used, until exhausted (until its next(Token) returns null).
* step 3: sink.reset() is invoked.
* step 4: sink is used again, until exhaustion.

Point is that if step 2 did not return clones, the consumer that invoked step 2 might have overridden the tokens it got from step 2. Now in step 4 sink "thinks" it returns clones of original tokens, unaware that the consumer of step 2 modified them.


> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1333:
---------------------------------------

    Attachment: LUCENE-1333.patch

Patch & changes look good, thanks DM.

I attached new patch with tiny changes:

  * Fixed javadoc warnings

  * Removed extra unnecessary comparison in ArrayUtil.getShrinkSize

I plan to commit in a day or two!

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621245#action_12621245 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

bq. Do you see this too?

Hmm, indeed I do see this too -- but this is because Token has never overridden "equals" right? 

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621101#action_12621101 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

OK since you pulled it all together under this issue, I think we should commit this one instead of LUCENE-1350.  I'll review the [massive] patch -- thanks DM!

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621410#action_12621410 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------


I forgot to say (in last patch): I also added Token.reinit() methods.

bq. I think that LUCENE-1350's failing test can be included here? 

OK I'll update this test case.

{quote}
I find the comment in TokenFilter about implementing next() vs. next(Token) confusing.
Perhaps use here the same comment as in Tokenizer.
{quote}

OK I'll update comment in TokenFilter.

bq. I also made the corresponding changes in their generate java files. While it would be good to fix the generation problem, you can compare the jj and java pairs to see that they match.

I ended up doing the same.

bq. Perhaps it would also be good to finish the canonical implementation and provide hashcode?

OK I'll add.

bq. Perhaps an assert in a producer is a good idea. I find it easier to debug a failed assert than a null pointer exception.

Agreed, I'll add.


> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621255#action_12621255 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

I couldn't leave this without seeing that test passing... so here is the code for Token.equals() in case you didn't write that part yet -
{code}
  @Override
  public boolean equals(Object obj) {
    if (this == obj)
      return true;
   
    Token other = (Token) obj;
    
    return
      termLength == other.termLength &&
      startOffset == other.startOffset &&  
      endOffset == other.endOffset &&
      flags == other.flags &&
      positionIncrement == other.positionIncrement && 
      subEqual(termBuffer, other.termBuffer) &&
      subEqual(type, other.type) &&
      subEqual(payload, other.payload);
  }

  private boolean subEqual(Object o1, Object o2) {
    if (o1==null)
      return o2==null; 
    return o1.equals(o2);
  }
{code}

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621242#action_12621242 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

bq. (I think producer tokens probably should check and create a token if it is so. But I don't think that is what they do now.)

Actually I think one should never pass null to next(Token) API -- ie a source token stream need not check for null.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621224#action_12621224 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

Seems start/end offset were lost in ChineseTokenizer.
Adding this in flush(Token) will fix it.
{code}
token.setStartOffset(start);
token.setEndOffset(start+length);
{code}
Possibly true for other places where new Token() that takes start/end offsets was replaced by clear() and setTermBuffer().

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621241#action_12621241 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

bq. Seems start/end offset were lost in ChineseTokenizer.

OK, I'll fold this into my changes to the patch.  Thanks Doron!

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621263#action_12621263 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

bq. here is the code for Token.equals() in case you didn't write that part yet

Thanks Doron; I'll merge into my version :)  Both your version and my version had bugs, so it's great you posted yours!  And I hope the merged result has no bugs :)

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-wordnet.patch

for contrib wordnet

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621384#action_12621384 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

All tests pass here. 
131 files were modified - I reviewed core and part of contrib, with minor comments only:

* CompoundWordTokenFilterBase - createToken() calls clone() which deep 
   copies the char array, and then calls also setTermBuffer() which iterates the chars again.
   This can come to play in some analyzers. Usually I would ignore things like this, but this 
   issue is so much about reusing and avoiding unneeded reallocation that I decided to 
   bring this up. Actually there is no reallocation, just re-copying, so maybe it is not too bad?
** Similarly in NgramTokenFilter - seems termBuffer would be copied twice for each "secondary" token.

* Cloning: behavior regarding cloning is modified in few places. 
**   In SingleTokenTokenStream - adding cloning - I think it is correct.
**   In TokenTypeSinkTokenizer.add(Token) cloning was removed because it is
      taken care of in super.add(). I first thought it a bug but no, patch is correct.

* I find the comment in TokenFilter about implementing next() vs. next(Token) confusing.
  Perhaps use here the same comment as in Tokenizer.


> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621248#action_12621248 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

bq. just need to clarify this in TokenStream.nextToken(Token)'s javadocs.

I agree; I'll do this in next patch iteration...

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-miscellaneous.patch

for contrib misc

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1333:
---------------------------------------

    Attachment: LUCENE-1333.patch

Woops, you're right -- new patch attached.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1333:
---------------------------------------

    Attachment: LUCENE-1333.patch

Attached new patch w/ above changes.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333.patch

* Added reinit(Token ...) methods to initialize one token from another.
* Improved hashCode.
* Made Token next(Token) have a final argument everywhere it is implemented to clarify the "best practice" for reuse and did the same for helper methods in various classes. As part of this, I renamed the token retrieved by next(Token) to be nextToken.

The typical pattern I used is for TokenFilters:
{code}
public Token next(final Token reusableToken) {
  assert reusableToken != null;
  Token nextToken = input.next(reusableToken);
   if (nextToken != null)
       return null;
   .... Do something with nextToken ....
   return nextToken;
}
{code}
and for other TokenStreams:
{code}
public Token next(final Token reusableToken) {
  assert reusableToken != null;
   .... Do something with reusableToken ....
   return reusableToken;
}
{code}

and for looping over a TokenStream:
{code}
final Token reusableToken = new Token();
for (Token nextToken = stream.next(reusableToken); nextToken != null; nextToken = stream.next(reusableToken)) {
    .... Do something with nextToken ....
}
{code}
* Improved Payload.clone() to avoid new if possible.
* Removed last remaining calls to termText() in a *.jj file.



> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621130#action_12621130 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

DM, one pattern that makes me nervous is this, from QueryTermVector.java:

{code}
          for (Token next = stream.next(new Token()); next != null; next = stream.next(next)) {
{code}

I don't think you should be "recycling" that next and passing it back in the next time you call stream.next, because a TokenStream is not *required* to use the Token you had passed in and so suddenly you are potentially asking it to re-use a token it had previously returned, which it may not expect.  Likely it won't matter but I think this is still safer:

{code}
          final Token result = new Token();
          for (Token next = stream.next(result); next != null; next = stream.next(result)) {
{code}

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622168#action_12622168 ] 

Grant Ingersoll commented on LUCENE-1333:
-----------------------------------------

My point is it is already cloned when it is added to the list, so now it is being cloned twice, and I think the second clone is extraneous.  In other words, it already is returning a clone from next.  The only way that Token could be changed is by doing it via the getTokens() method, which would have to be done outside of the Analysis process, which I would argue does not violate the reusability process.

Cloning Tokens is not cheap, as I recall.  In fact, I seem to recall testing that it was cheaper to do a new.  Now, maybe that is fixed in this issue, but I am not sure.  

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621231#action_12621231 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

{quote}
But still would like to clarify on what can the TokenStream assume. I think
TokenStream cannot assume anything about the token it gets as input, and,
once it returned a token, it cannot assume anything about how that token
is used. So why should it not expect being passed the token it just returned?
{quote}

{quote}
The upshot of all of this, Producers don't care which token they reuse.
{quote}

I agree -- technically speaking, whenever a Token is returned from a source/filter's next(Token) method, *anything* is allowed to happen do it (including any & all changes, and subsequent reuse in future calls to next(Token)) and so the current pattern will run correctly if all sources & filters are implemented correctly. This is the contract in the reuse API.

It's just that it looks spooky, when you are consuming tokens, not to create & re-use your own reusable token.  I think it's also possible (but not sure) that the JRE can compile/run the "single reusable token" pattern more efficienctly, since you are making many method calls with a constant (for the life time of the for loop) single argument, but this is pure speculation on my part...

I think from a code-smell standpoint I'd still like to use the "single re-use" pattern when applicable.  DM I'll make this change & post a new patch.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621823#action_12621823 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

Wait... 
"ant  clean  test-comtrib" was a bad idea, because there's a missing dependency of contrib-test in test-compile.
So "ant clean compile-test  test-contrib"  doesn't have the compile errors above.
The errors that got me started on this were in contrib/analyzers but with the new patch and after clean, all contrib tests passed!



> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621715#action_12621715 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

{quote}
I also made the corresponding changes in their generate java files. While it would be good to fix the generation problem, you can compare the jj and java pairs to see that they match.
{quote}

{quote}
I ended up doing the same.
{quote}

There's a patch for this in LUCENE-1353 ...

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1333:
------------------------------------------

    Assignee: Michael McCandless

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621181#action_12621181 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

This 'final' pattern is indeed more clear about reuse.

But still would like to clarify on what can the TokenStream assume. I think 
TokenStream cannot assume anything about the token it gets as input, and, 
once it returned a token, it cannot assume anything about how that token 
is used.  So why should it not expect being passed the token it just returned?


> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-queries.patch

for contrib queries

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-snowball.patch

for contrib snowball

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-snowball.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621969#action_12621969 ] 

DM Smith commented on LUCENE-1333:
----------------------------------

{quote}
The SinkTokenizer changes don't seem right to me. The token is already cloned in the add() method, no need to clone again in the next() call.

At least, that's my understanding based on trying to piece together the changes being proposed here.
{quote}

It is because SinkTokenizer implements reset(). SinkTokenizer needs to ensure that the subsequent times it returns the same token that the token is actually the same.

A token is not immutable, so SinkTokenizer needs to ensure that the tokens in its list are immutable.

The token that is added to the list can be a reusable Token. If it is added without cloning, some other user of the token might change it and thus the list changes.

When the token is removed from the list and returned from next, it is presumed to be a reusable token. (This is true whether next() or next(Token) is called.) If it does not return a clone then some other user of the token might change it and thus the internal list changes. This is only a problem when reset is implemented. This is because a single token can be returned more than once from the TokenStream.

If SinkTokenizer did not implement reset, then next() would not need to create a clone.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622238#action_12622238 ] 

DM Smith commented on LUCENE-1333:
----------------------------------

Regarding the implementation of hashCode:
You are using the following:
{code}
  private static int hashCode(int i) {
    return new Integer(i).hashCode();
  }
{code}

This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary.

While adding Token's integer values in Token's hashCode is perfectly fine, it is not quite optimal. And may cause unnecessary collisions.

It might be better to pretend that Token's integer values are also in an array (using the ArrayUtil algorithm, this could be):
  public int hashCode() {
    initTermBuffer();
    int code = termLength;
   code = code * 31 + startOffset;
   code = code * 31 + endOffset;
   code = code * 31 + flags;
   code = code * 31 + positionIncrement;
   code = code * 31 + type.hashCode();
   code = (payload == null ? code : code * 31 + payload.hashCode());
   code = code * 31 + ArrayUtil.hashCode(termBuffer, 0, termLength);
   return code;
  }

Also, are the reinit methods used? If not, I'd like to work up a patch that uses them. (And I'll include the above in it.)
I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate.


> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619695#action_12619695 ] 

DM Smith commented on LUCENE-1333:
----------------------------------

All the code has been migrated to use the reuse interface. I have run all the tests and they pass. (There is a weird dependency. One test depends upon demo. I hacked that to work.) I did not test to see if the code were any slower or faster. I'm not sure how one would do that. I think it should be faster since it doesn't bounce back and forth with termText and termBuffer.

I did not improve the code to use char[] instead of String. The places that can be improved call Token.setTermBuffer(String). I don't think these are necessary to this issue being resolved.

A couple of other minor opportunities for char[] manipulation via o.a.l.util.ArrayUtil.

Here and there, there is a need to remove leading and trailing whitespace from input before (sub-)tokenizing it. Currently this is done with String.trim(), even when working with char[]. It is sub-optimal as marshalling it into a String involves allocation and copying. Likewise, so does getting it out. It would be better to have int trim(char[]) which shifts leading spaces off the front and returns the length of the string without trailing spaces.

There is a "randomize" routine that shuffles an array. While this is only used in one place, it appears to be general purpose array manipulation.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622204#action_12622204 ] 

Grant Ingersoll commented on LUCENE-1333:
-----------------------------------------

I think the main performance issue with clone in the old Token was that it had to do the initTermBuffer before, even though clone already knows what size the buffer should be, etc.  It looks like this is fixed now, so it may be a non-issue.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621132#action_12621132 ] 

DM Smith commented on LUCENE-1333:
----------------------------------

I'll give my analysis here. Feel free to make the change or kick it back to me to make it, if you think your pattern is best. (If I do it, it will be after this weekend.)

I've tried to be consistent here. The prior pattern was inconsistent and often was:
{code}
 Token token = null;
  while ((token = input.next()) != null) {
{code}

There were other variants including "forever loops". As you noticed, I replace
There are two basic implementations of Token next(Token):
1) Producer: These create tokens from input. Their pattern is to take their argument and call clear on it and then set startOffset, endOffset and type appropriately. Their assumption is that they have to start with a pristine token and that other than space, there is nothing about the token that is passed in that can be reused.

2) Consumer: These "filter" their argument. Their only assumption is that in the call chain that there was a producer that created the token that they need to reuse. In this case, they typically will preserve startOffset and endOffset because those are to represent the position of the token in the input. They may refine type, flags and payload, but otherwise have to preserve them. Most typically, they will set the termBuffer. There are a few types of consumers. Here are some:
a) Transformational Filters: They take their argument and transform it's termBuffer.
b) Splitting Filters: They take their argument and split the token into several. Sometimes they will return the original; other times just the parts. When creating these tokens, calling clone() on the prototype will preserve flags, payloads, start and end offsets and type. These clones are sometimes stored in a buffer, but sometimes are incrementally computed with each call to next(Token). With the latter, they will typically cache a clone of the passed in token. I think that, when possible, incremental computation is preferable, but at the cost of a less obvious implementation.
c) Caching Filter: If their buffer is empty, they repeatedly call result = input.next(token), clone and buffer cache their result in some collection. Once full, they will return their buffer's content. If, the caching filter is resettable, they must return clones of their content. Otherwise, down stream consumers may change their arguments, disastrously.

Callers of Token next(Token) have the responsibility of never calling with a null token. (I think producer tokens probably should check and create a token if it is so. But I don't think that is what they do now.)

The upshot of all of this, Producers don't care which token they reuse. If it was from the original loop, or from the result of the last call to token = stream.next(token), both are equally good. The token pre-existed and needs to be fully reset. Consumers presume that the token was produced (or at least appropriately re-initialized and filled in) by a producer.

Your form of the loop is very advisable in a few places. Most typically with a loop within a loop, with the inner looping over all the tokens in a stream. In this case, the final Token would be created outside the outer loop. Using your pattern, there would encourage maximal reuse. Using mine, the programmer would have to figure out when it was appropriate to do one or the other.

The other value to your pattern is that next(Token) is always called with a non-null Token.

I think that calling the token "result" is not the best. It is a bit confusing as it is not the result of calling next(Token). Perhaps, to make reuse acutely obvious:
{code}
 final Token reusableToken = new Token();
 for (Token token = stream.next(reusableToken); token != null; token = stream.next(reusableToken)) {
{code}



> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621244#action_12621244 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

Thanks Mike!

One more...

While looking into the failure of TestSingleTokenTokenFilter I saw that if these lines are executed:
{code}
Token t1 = new Token();
Token t2 = t1.clone();
boolean isEqual = t1.equals(t2)
{code}
Surprisingly,   *isEqual = false*

Do you see this too?



> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621262#action_12621262 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

I'm unable to get javacc to run cleanly on PrecedenceQueryParser.jj.  It produces this:

{code}
bash-3.2$ javacc ./src/java/org/apache/lucene/queryParser/precedence/PrecedenceQueryParser.jj
Java Compiler Compiler Version 4.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file ./src/java/org/apache/lucene/queryParser/precedence/PrecedenceQueryParser.jj . . .
org.javacc.parser.ParseException: Encountered "<<" at line 658, column 3.
Was expecting one of:
    <STRING_LITERAL> ...
    "<" ...
    
Detected 1 errors and 0 warnings.
{code}

Does anyone else hit this?

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621762#action_12621762 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

bq. I'm unable to get javacc to run cleanly on PrecedenceQueryParser.jj.

I figured this out: you have to use javacc 3.2 (not 4.0 or beyond) for contrib/misc, then "ant javacc" runs fine (with the patch from LUCENE-1353).

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-lucli.patch

for contrib lucli

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621800#action_12621800 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

I diffed current patch and previous one and all seems correct to me. 

One tiny thing in Token - in two locations there's this code:
{code}
setTermBuffer(newTermBuffer, newTermOffset, newTermLength);
setTermLength(newTermLength);
{code}
Second call is redundant since first call already sets termLength.



> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333.patch

Better comments based upon migrating all of core and contrib. This patch replaces the prior two.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621417#action_12621417 ] 

DM Smith commented on LUCENE-1333:
----------------------------------

Caching flies in the face of reuse. I think that a comment needs to be some where to that effect.
Putting Tokens into a collection requires that the reusable token be copied. I.e. via new or clone. One cannot directly store the reusable Token, i.e. the argument from Token next(Token), nor the value to be returned from it.

If a caching TokenStream is also resettable, then that means that the tokens coming from it need to be protected from being changed. This means that they need to return a copy. (Perhaps comment reset() to that effect?)

The only value I see in caching is if the computation of the token stream is so expensive that re-using it has a significant savings.

(The current code does not take such care and is a bug. This patch fixes it. It would have become obvious if the cache were used in the context of reuse.)

Some TokenStreams cache Tokens internally (as a detail of their implementation) and then return them incrementally. Many of these can be rewritten to compute the next Token when next(Token) is called. This would improve both time and space usage.

(I felt that such changes were outside the scope of this patch.)

All this leads to my response to the NGram filter.

The NGram filters could be improved in this fashion. This would eliminate the clone() problem noted above.

But failing that, a variant of clone to solve the intermediate problem would work. So would using new Token(...). The problem with using new Token() is that it requires manual propagation of flags, payloads, offsets and types and is not resilient to future fields.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622201#action_12622201 ] 

DM Smith commented on LUCENE-1333:
----------------------------------

{quote}
Cloning Tokens is not cheap, as I recall. In fact, I seem to recall testing that it was cheaper to do a new. Now, maybe that is fixed in this issue, but I am not sure.
{quote}

I was going on hearsay when I uniformly used clone() rather than new when dealing with creating a deep copy of an existing token. I was under the impression that clone was faster than new to do equivalent work.

The test is rather simple and worthy of doing before accepting this issue. I don't think I have time to do it today.

The equivalent of clone is (done from memory, so this is close):
Token token = new Token(oldToken.startOffset(), oldToken.endOffset(), oldToken.getFlags(), oldToken.type());
token.setPositionIncrement(oldToken.positionIncrement());
if (oldToken.getPayload() != null) {
 Payload p = new Payload(....); // Create a new Payload with a deep copy of the payload
}

While this might be faster, there are two flaws with this that clone avoids, clone has direct access to the parts and avoids method calls and also is future proof. If a new field is added to Token, it will automatically be carried forward.

There are a couple of places in the code where:
public Token(Token prototype) // only if new is faster
and
public void copyFrom(Token prototype)
would beneficially solve these maintenance issues.

{quote}
But how do you cope with reset()?
{quote}
This problem is a bug in the existing code. Today, one can create a chain of TokenFilters, each of which calls input.next() or input.next(token), and any one of which modifies the return value. It does not matter which is invoked. If the token returned is held in a cache then the cache is corrupted. Every cache of Tokens needs to ensure that it's cache is immutable on creation. It also needs to ensure that it is immutable on usage if the tokens can be served more than once.

Two personal opinions:
* Caches that don't implement reset should return cache.remove(0) [or equivalent] so it is clear that the cache can only be used once.
* Caches should not be used except when it gives a clear performance advantage.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1333:
---------------------------------------

    Attachment: LUCENE-1333.patch


This patch looks good; thanks DM!

I made a few small changes & attached a new rev of the patch:

  * Fixed Token.setTermLength to throw IllegalArgumentException if you
    pass in a length > termBuffer.length

  * Changed Token.growTermBuffer to use oal.util.ArrayUtil (it has
    that same growth logic)

  * Changed if statements in Token.growTermBuffer to first handle the
    [I think most frequent] case where termBuffer is already
    allocated.

  * Javadoc/whitespace


> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-analyzers.patch

This is for contrib analyzers.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621220#action_12621220 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

The new patch applies cleanly.
A few tests are failing though - TestChineseTokenizer for one.
I'm looking into it, might be that a problem is in the test.


> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-highlighter.patch

for contrib highlighter

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-snowball.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621250#action_12621250 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

{quote}
Hmm, indeed I do see this too - but this is because Token has never overridden "equals" right?
{quote}

Yes you're right. I was under the impression for a moment Object's equals() works like clone() and goes in one layer only... that's stupid of course, it just compares the object references (or nulls).  I wonder how this test ever passed before... Oh I see it now, - trunk's of SingleTokenTokenStream never called Token.clone() while patched version calls it twice. So definitely, Token should implement equals().

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-xml-query-parser.patch

for contrib xml-query-parser

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621940#action_12621940 ] 

Grant Ingersoll commented on LUCENE-1333:
-----------------------------------------

The SinkTokenizer changes don't seem right to me.   The token is already cloned in the add() method, no need to clone again in the next() call.

At least, that's my understanding based on trying to piece together the changes being proposed here.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621267#action_12621267 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

bq. For PrecedenceQueryParser, Is there a way to use ant to run javacc on contrib/misc? 

I couldn't find a way... there is a TODO comment in it's build.xml saying something about it though!

bq. ok now I'm curious cause I still can't see that bug...

OK I'll let you diff :)  And please look closely to see if there are any other bugs!  These equals() methods are tricky!

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621818#action_12621818 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

Still with previous patch - all core tests passed, but then contrib compilation failed. 
I tried ant clean contrib-test (to not repeat core test) and still compilation errors:
{noformat}
    [javac] Compiling 27 source files to build\contrib\analyzers\classes\test
    [javac] contrib\analyzers\src\test\org\apache\lucene\analysis\miscellaneous\TestSingleTokenTokenFilter.java:23: cannot find symbol
    [javac] symbol  : class LuceneTestCase
    [javac] location: package org.apache.lucene.util
    [javac] import org.apache.lucene.util.LuceneTestCase;
    [javac]                              ^
{noformat}
In yesterday's version all tests passed.
Let me check recent patch.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622202#action_12622202 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

I'm not using sink tokenizer (yet) so not sure if the following worths it, but...

One simple possibility to avoid those extra clones is to add a way to allow notifying the sink that reset() will not be called anymore. 

I.e. that for each token to be consumed from now on, this is the last time it s consumed. 

This can be added in the constructor, or a special setter can be added for that, such as: disableRest(). 
There would not be an enableReset().
When disabled, next() would not clone, and reset() will throw an exception.


> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621246#action_12621246 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

{quote}
Actually I think one should never pass null to next(Token) API - ie a source token stream need not check for null.
{quote}
Funny I was about to comment on the same thing... 

In fact when the reuse API was introduced I believe null was suppose to mean - "nothing to be reused, please just create your own". 

In case of a producer that cannot reuse - say it creates its own implementation of Token - then there is no point in creating tokens by the consumer that will never be reused. 

But this also meant that in all the common cases, all tokenizers would need an additional if() to verify that the reusable token is not null. Not so nice. 

So yes, I agree with you, just need to clarify this in TokenStream.nextToken(Token)'s javadocs.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1333:
---------------------------------------

    Attachment: LUCENE-1333.patch


OK new patch attached:

  * Created Token.equals and Payloads.equals -- this fixed
    TestSingleTokenTokenFilter.

  * Switched to the "final reusableToken" pattern.

  * Added to TokenStream.next javadoc stating that parameter should
    never be null

  * Fixed test failures (all tests should pass now)


> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333a.txt

The following has been addressed in this patch:
1. JavaDoc is improved (as always, there is still room for improvement. For example, it says the field type is interned, but it is not.)

2. Deprecated the Token constructors taking a String.

3. Changed the allocation policy to be less aggressive.

5. Optimized the growing of the internal termBuffer immediately followed by storing a new term. In doing this added, setTermBuffer(String) and setTermBuffer(String, int, int). Setting from a string is roughly the same cost as setting from a char[].

6. TokenStream has next() has been deprecated. The javadoc has been updated to recommend next(Token) over next().

7. Rather than modifying Term to take a Token, public String term() has been added. With termText() still deprecated, this gives upgraders a clean choice to use term() or termBuffer(), with the knowledge of the performance differences.

I also updated TestToken to test all the changes.

Left to do: (I'd like to get a nod of whether I need to make further changes to the supplied patch before doing #4)
4. Changing of the remainder of core and contrib to remove calls to deprecated Token and TokenStream methods, i.e. to use the reuse API.



> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621393#action_12621393 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

I think that LUCENE-1350's failing test can be included here?

TestSnowball.testFilterTokens() then changes to:

{code}
  public void testFilterTokens() throws Exception {
    final Token tok = new Token(2, 7, "wrd");
    tok.setTermBuffer("accents");
    tok.setPositionIncrement(3);
    Payload tokPayload = new Payload(new byte[]{0,1,2,3});
    tok.setPayload(tokPayload);
    int tokFlags = 77;
    tok.setFlags(tokFlags);

    SnowballFilter filter = new SnowballFilter(
        new TokenStream() {
          public Token next(Token token) {
            return tok;
          }
        },
        "English"
    );

    Token newtok = filter.next(new Token());

    assertEquals("accent", newtok.term());
    assertEquals(2, newtok.startOffset());
    assertEquals(7, newtok.endOffset());
    assertEquals("wrd", newtok.type());
    assertEquals(3, newtok.getPositionIncrement());
    assertEquals(tokFlags, newtok.getFlags());
    assertEquals(tokPayload, newtok.getPayload());
  }
{code}


> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-wikipedia.patch

for contrib wikipedia

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622195#action_12622195 ] 

Grant Ingersoll commented on LUCENE-1333:
-----------------------------------------

Sorry, you guys are right.  My bad.  Does look like we improved some of the cloning costs, though.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621233#action_12621233 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

I'm also seeing the test failures.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622238#action_12622238 ] 

dmsmith edited comment on LUCENE-1333 at 8/13/08 12:04 PM:
------------------------------------------------------------

Regarding the implementation of hashCode:
You are using the following:
{code}
  private static int hashCode(int i) {
    return new Integer(i).hashCode();
  }
{code}

This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary.

While adding Token's integer values in Token's hashCode is perfectly fine, it is not quite optimal. And may cause unnecessary collisions.

It might be better to pretend that Token's integer values are also in an array (using the ArrayUtil algorithm, this could be):
{code}
  public int hashCode() {
    initTermBuffer();
    int code = termLength;
   code = code * 31 + startOffset;
   code = code * 31 + endOffset;
   code = code * 31 + flags;
   code = code * 31 + positionIncrement;
   code = code * 31 + type.hashCode();
   code = (payload == null ? code : code * 31 + payload.hashCode());
   code = code * 31 + ArrayUtil.hashCode(termBuffer, 0, termLength);
   return code;
  }
{code}

Also, are the reinit methods used? If not, I'd like to work up a patch that uses them. (And I'll include the above in it.)
(never mind. I see that they are! super! But I'm working up a patch for this and a couple of minor optimizations that affect Token)

I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate.


      was (Author: dmsmith):
    Regarding the implementation of hashCode:
You are using the following:
{code}
  private static int hashCode(int i) {
    return new Integer(i).hashCode();
  }
{code}

This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary.

While adding Token's integer values in Token's hashCode is perfectly fine, it is not quite optimal. And may cause unnecessary collisions.

It might be better to pretend that Token's integer values are also in an array (using the ArrayUtil algorithm, this could be):
  public int hashCode() {
    initTermBuffer();
    int code = termLength;
   code = code * 31 + startOffset;
   code = code * 31 + endOffset;
   code = code * 31 + flags;
   code = code * 31 + positionIncrement;
   code = code * 31 + type.hashCode();
   code = (payload == null ? code : code * 31 + payload.hashCode());
   code = code * 31 + ArrayUtil.hashCode(termBuffer, 0, termLength);
   return code;
  }

Also, are the reinit methods used? If not, I'd like to work up a patch that uses them. (And I'll include the above in it.)
I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate.

  
> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621252#action_12621252 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

{quote}
This is technically a break in backwards compatibility, but I think it's OK?
{quote}
I think so. 
I feel good about this whole change, it will make reuse chain more clear, especially once all deprecations can be removed.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-analysis.patch

I've broken up the changes to core and contrib into small patches. My reasoning for this is that it is a lot easier for me to keep up with other people's changes and re-issue a small patch.

Apply LUCENE-1333.patch before any of the following. I don't know if there is an order dependency for any of the following.

This one is for core in o.a.l.analysis but not the files contained in LUCENE-1333.patch

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1333:
---------------------------------------

    Attachment: LUCENE-1333.patch

Attached new patch:

  * Fixed comments about caching & cloning

  * Added partial clone method, that clones all except replaces termBuffer & start/end offset; fixed NGramTokenFilter and CompoundWordTokenFilterBase to use this method

  * Fixed bug in PrecedenceQueryParser.jj that became obvious once I ran javacc

I think we are close!

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-instantiated.patch

For contrib instantiated

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-snowball.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621264#action_12621264 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

ok now I'm curious cause I still can't see that bug... :-)

For PrecedenceQueryParser, Is there a way to use ant to run javacc on contrib/misc? 

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621268#action_12621268 ] 

Doron Cohen commented on LUCENE-1333:
-------------------------------------

{quote}
For PrecedenceQueryParser... there is a TODO comment in it's build.xml saying something about it though!
{quote}
Yes I saw that, but since I don't have a javacc executable.
Maybe I'll look at that build.xml later.

{quote}
OK I'll let you diff
{quote}
OK I can see them now... you were right as usual :-)

Patch applies cleanly and I'm running the test now, so far all looks good but my PC is not that fast and it will take some more time to complete. I hope to review later tonight or tomorrow morning.




> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1333.
----------------------------------------

    Resolution: Fixed

I just committed this.  Thanks DM!  This was one humongous patch -- almost 10K lines.  I sure hope we didn't break anything :)

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333.patch

This patch includes all the previous ones.

Note: It includes the functionality solving  LUCENE-1350. If this patch is applied before LUCENE-1350, then that issue is resolved. If it is done after then the patch will need to be rebuilt.

I did not do the "reuse" API mentioned in LUCENE-1350.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621251#action_12621251 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

bq. So definitely, Token should implement equals().

I agree.  This is technically a break in backwards compatibility, but I think it's OK?

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621411#action_12621411 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------

{quote}
CompoundWordTokenFilterBase - createToken() calls clone() which deep
copies the char array, and then calls also setTermBuffer() which iterates the chars again.
{quote}
Maybe we need a variant of clone that takes a new term buffer & start/end offsets, and creates a new token but with the new term buffer & start/end offsets you've passed in?  Analagous to reinit, but first making a cloned token.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-memory.patch

for contrib memory

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619667#action_12619667 ] 

dmsmith edited comment on LUCENE-1333 at 8/4/08 1:24 PM:
----------------------------------------------------------

Better comments based upon migrating all of core and contrib. This patch replaces the prior two.

Marked fields as deprecated, with documentation that they should be made private and setters/getters should be used instead.

      was (Author: dmsmith):
    Better comments based upon migrating all of core and contrib. This patch replaces the prior two.
  
> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621319#action_12621319 ] 

DM Smith commented on LUCENE-1333:
----------------------------------

Back from a trip. Would have jumped in to help, but it looks like you've found and fixed a couple of my mistakes/oversights. Thanks.

Regarding the equals implementation. Perhaps it would also be good to finish the canonical implementation and provide hashcode? That would allow for storage in sets and maps. (Not terribly sure how that would be useful. But I can imagine sorting on startOffsets to organize the contents.)

Regarding the changes to the jj files. I also made the corresponding changes in their generate java files. While it would be good to fix the generation problem, you can compare the jj and java pairs to see that they match.

I agree that Token next(Token) should assume a non-null. Perhaps an assert in a producer is a good idea. I find it easier to debug a failed assert than a null pointer exception.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1333) Token implementation needs improvements

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DM Smith updated LUCENE-1333:
-----------------------------

    Attachment: LUCENE-1333-core.patch

This patch covers the rest of core Lucene.

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1333) Token implementation needs improvements

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622371#action_12622371 ] 

Michael McCandless commented on LUCENE-1333:
--------------------------------------------


bq. This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary.

Duh, I didn't realize that's the hashCode function for Integer.  I like you're new hashCode.

bq. I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate.

I like this, but maybe name it reinit(Token)?

bq. Also, are the reinit methods used?

I think we could use it in further places (I didn't search exhaustively).

DM if you can pull together a patch w/ these fixes that'd be great!

> Token implementation needs improvements
> ---------------------------------------
>
>                 Key: LUCENE-1333
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1333
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class.
> 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org