You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Earwin Burrfoot (JIRA)" <ji...@apache.org> on 2009/04/19 17:22:47 UTC

[jira] Created: (LUCENE-1607) String.intern() faster alternative

String.intern() faster alternative
----------------------------------

                 Key: LUCENE-1607
                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Earwin Burrfoot
             Fix For: 2.9


By using our own interned string pool on top of default, String.intern() can be greatly optimized.

On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley resolved LUCENE-1607.
----------------------------------

    Resolution: Fixed

committed.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701626#action_12701626 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

I tried it out. Works a little bit better than simple cache (no stray interns must've paid off), doesn't degrade at all.
I'd like to change starter value to something 256-1024, it works way better for 10-20 fields.

Why h >> 7? I understand that you're sacking collision-guilty bits, but why not exact amount that was used (have to store it?), or a whole byte or two?

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-1607:
---------------------------------

    Attachment: LUCENE-1607.patch

latest patch - could use a multi-threaded testcase to ensure no exceptions are thrown and that intern() always returns the same instance.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718188#action_12718188 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

I think so... but I was waiting for some kind of feedback if people in general thought it was the right approach.  It introduces another static, and people tend to not like that.  I accidentally didn't upload the latest version with the javadoc + helper method.  I'll do that now.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-1607:
---------------------------------

    Attachment: LUCENE-1607.patch

Here's another version that uses closed hashing and table resizes to cache all interned strings requested.  It remains lock free and memory barrier free on the read side.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-1607:
---------------------------------

    Attachment: LUCENE-1607.patch

Here is another alternative that is limited in size to prevent unbounded growth, and should still handle collisions acceptably.  It's completely lockless (even for updates) and uses open hashing.  Each bucket is an insertion-order cache (inserts at the head) and is limited to a certain length.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700604#action_12700604 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

bq. What was the field count? Is it still a considerable speedup with hundreds of fields without slowing anything else down ?
The field count was 1 :) I rewrote the benchmark, with extra code Java6 speedup on single already interned field became 12.5x.
Results for three java varieties, and different sets of keys follow:

Java 6 (64, server):

  1 key
interned 12.47x
uninterned 3.76x

  10 keys
interned 8.03x
uninterned 3.08x

  100 keys
interned 6.58x
uninterned 2.55x

  1000 keys
interned 5.39x
uninterned 2.69x

Java 5 (64, server):

  1 key
interned 9.84x
uninterned 5.03x

  10 keys
interned 7.00x
uninterned 4.61x

  100 keys
interned 6.61x
uninterned 2.28x

  1000 keys
interned 4.73x
uninterned 2.73x

Java 4 (32, client):

  1 key
interned 4.90x
uninterned 2.88x

  10 keys
interned 4.08x
uninterned 2.67x

  100 keys
interned 3.88x
uninterned 2.52x

  1000 keys
interned 3.44x
uninterned 2.31x


> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700931#action_12700931 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

bq. The fastest hash we can get, should have no collisions. This is achievable by resizing on each new collision.

Hmmm, in my quick'n'dirty tests of about 256 unique strings, a smaller hash table was actually quicker (initialized with 32 and let it resize vs starting at 1024).  I imagine that this would be due to a larger part of the table fitting in smaller and faster processor caches.  YMMV.  Collisions should also be very quick to skip by comparing the hash code (which is cached for Strings).



> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720291#action_12720291 ] 

Michael McCandless commented on LUCENE-1607:
--------------------------------------------

bq. I've held off because of a lack of consensus

Silence = consensus!

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704225#action_12704225 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

Mmm.. what's the status of this one?
Should I add a patch with Yonik's last hash impl and all calls to String.intern() replaced to get it moving?

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723352#action_12723352 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

Okay, let's have an extra class and ability to switch impls. I liked that static method could get inlined (at least its short-path), but that's not necessary.

Except I'd like the javadoc demand each impl to be String.intern()-compatible. There's nothing bad in it, as in any decent impl an unique string will be String.intern()'ed one time at most. And the case when you get an infinite flow of unique strings is degenerate anyway, you have to fix something, not deal with it. On the other hand, we can remove "This should never be changed after other Lucene APIs have been used" clause.

rewrite 'for' as 'for (Entry e = first;e != null;e = e.next)' for clarity?
'Entry[] arr = cache;' - this can be skipped? 'cache' is already final and optimizer loves finals. Plus further down the method you use both cache[slot] and arr[slot]. Or am I missing some voodoo?
If check around 'nextToLast = e' can also be removed?
'public String intern(char[] arr, int offset, int len)' - is this needed?

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700928#action_12700928 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

Hehe, ten minute difference. Take over this issue, since you're obviously better at it than I?

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1607:
----------------------------------

    Attachment: LUCENE-1607-contrib.patch

I found one real occurrence of intern() in crontrib and a now obsolete discussion in one comment.

Attached is a simple patch.

What I currently do not like is that there are still a lot of intern()s in the Javadocs/comments elsewhere, maybe this should also be changed (code refactoring is not able to do that, but a simple grep).

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607-contrib.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704315#action_12704315 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

A top bound on cache size will do? If you're fed too much unique strings it'll end up with characteristics of simple cache.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700618#action_12700618 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

The thread safety problem has to do with safe object publication (making an object visible to a different thread).  Mutable objects generally can't be safely published w/o synchronization.... it has to do with CPU caches in multi-CPU systems.

IIRC, current x86 architectures will see fewer problems since read barriers are no-ops.. caches are guaranteed to be coherent.  But even on x86, you aren't guaranteed that instructions and writes won't be reordered...  so the assignment to pool could be visible *before* all the memory changes that newPool.put() could cause.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740638#action_12740638 ] 

Uwe Schindler commented on LUCENE-1607:
---------------------------------------

Committed rev 802095.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607-contrib.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704484#action_12704484 ] 

Shalin Shekhar Mangar commented on LUCENE-1607:
-----------------------------------------------

Yonik, the string is being interned twice in your latest patch
{code}
+      if (e==null) {
+        s = s.intern();
+        arr[slot] = new Entry(s.intern(), h, first);
{code}

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720266#action_12720266 ] 

Mark Miller commented on LUCENE-1607:
-------------------------------------

I assume we can assign this one to you Yonik?

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700931#action_12700931 ] 

Yonik Seeley edited comment on LUCENE-1607 at 4/20/09 2:27 PM:
---------------------------------------------------------------

bq. The fastest hash we can get, should have no collisions. This is achievable by resizing on each new collision.

*edit*: agree, for the first version that was only a cache where collisions invalidate the entry and cause another String.intern() to be called... my comments below are with respect to the second version of my code where interned strings are never dropped from the table.

Hmmm, in my quick'n'dirty tests of about 256 unique strings, a smaller hash table was actually quicker (initialized with 32 and let it resize vs starting at 1024).  I imagine that this would be due to a larger part of the table fitting in smaller and faster processor caches.  YMMV.  Collisions should also be very quick to skip by comparing the hash code (which is cached for Strings).



      was (Author: yseeley@gmail.com):
    bq. The fastest hash we can get, should have no collisions. This is achievable by resizing on each new collision.

Hmmm, in my quick'n'dirty tests of about 256 unique strings, a smaller hash table was actually quicker (initialized with 32 and let it resize vs starting at 1024).  I imagine that this would be due to a larger part of the table fitting in smaller and faster processor caches.  YMMV.  Collisions should also be very quick to skip by comparing the hash code (which is cached for Strings).


  
> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by Mark Miller <ma...@gmail.com>.
Chris Miller wrote:
> but the thread visibility problem seems potentially serious.
Why do you think so? The assignments are atomic and if a set is not 
visible to a thread yet, it just gets an extra intern, right?

I agree that benchmarks will be key though.

-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by Chris Miller <ch...@kbcfp.com>.
This implementation suffers from thread visibility problems too - changes 
to the array's values aren't guaranteed to be visible across threads. In 
addition to that, there's also a problem with hash collisions invalidating 
cache entries which could greatly degrade performance in several common use 
cases. For example, suppose we had a nested loop iterating docs and the doc's 
field names, interning the names as we went. If two fields (F1, F2) both 
hashed to the same array index the cache would never be hit since we'd be 
alternating between interning F1 and F2. Without benchmarking/testing it's 
hard to know how big a problem that would be in practice, but the thread 
visibility problem seems potentially serious.


> [
> https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.j
> ira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Yonik Seeley updated LUCENE-1607:
> ---------------------------------
> Attachment: LUCENE-1607.patch
> 
> Here's a completely lockless and memory barrier free intern() cache.
> 
> This default would be more back compatible since programs may rely on
> String instances being interned via String.intern().
> 
> It does not yet include corresponding Lucene code changes to use the
> StringInterner.
> 
> Thoughts?
> 
>> String.intern() faster alternative
>> ----------------------------------
>> Key: LUCENE-1607
>> URL: https://issues.apache.org/jira/browse/LUCENE-1607
>> Project: Lucene - Java
>> Issue Type: Improvement
>> Reporter: Earwin Burrfoot
>> Fix For: 2.9
>> Attachments: intern.patch, LUCENE-1607.patch
>> 
>> By using our own interned string pool on top of default,
>> String.intern() can be greatly optimized.
>> 
>> On my setup (java 6) this alternative runs ~15.8x faster for already
>> interned strings, and ~2.2x faster for 'new String(interned)'
>> 
>> For java 5 and 4 speedup is lower, but still considerable.
>> 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-1607:
---------------------------------

    Attachment: LUCENE-1607.patch

Here's a completely lockless and memory barrier free intern() cache.
This default would be more back compatible since programs may rely on String instances being interned via String.intern().

It does not yet include corresponding Lucene code changes to use the StringInterner.

Thoughts?

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718650#action_12718650 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

bq. I still have a feeling we should expose a single static method - intern() and hide implementation away

While the default should be beneficial for most users, I'd hate to lock away a users ability to either expand or remove the cache. 

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700784#action_12700784 ] 

Mark Miller commented on LUCENE-1607:
-------------------------------------

bq. The thread safety problem has to do with safe object publication

Oh, great, you mean basic concurrency competency :( Well thats an embarrassing erosion of knowledge. Back to the books. Thread a has no happens before relationship with thread b unless they share the lock. I trained myself to just synchronize or use volatile long ago (and then let the knowledge begin seeping I guess), but even still, every time I see one of these double lock check type tricks I go, eh, this one must work or something.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746845#action_12746845 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

{quote}
isn't it possible to make the call to
public String intern(char[] arr, int offset, int len) {}
not create a String?
{quote}

Yep - that's why I added that method (just didn't get around to implementing it that way).

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607-contrib.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-1607:
---------------------------------

    Attachment: LUCENE-1607.patch

Latest patch attached with multi-threaded test.

Some quick performance tests:
10240 unique strings, 100 threads:  29% faster  (larger than the cache capacity - will lead to evictions)
100 unique strings, 100 threads: 147% faster
100 unique strings, 2 threads: 152% faster

Times were simply the time it took the junit test to run - includes other overhead like random number generation and error checking.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by Chris Miller <ch...@kbcfp.com>.
The implementation of Interner doesn't look threadsafe to me. At the very 
least shouldn't 'pool' be marked volatile, otherwise the result of 'pool 
= newPool' is not guaranteed to be visible to other threads?

> [
> https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.j
> ira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Earwin Burrfoot updated LUCENE-1607:
> ------------------------------------
> Attachment: intern.patch
> 
>> String.intern() faster alternative
>> ----------------------------------
>> Key: LUCENE-1607
>> URL: https://issues.apache.org/jira/browse/LUCENE-1607
>> Project: Lucene - Java
>> Issue Type: Improvement
>> Reporter: Earwin Burrfoot
>> Fix For: 2.9
>> Attachments: intern.patch
>> 
>> By using our own interned string pool on top of default,
>> String.intern() can be greatly optimized.
>> 
>> On my setup (java 6) this alternative runs ~15.8x faster for already
>> interned strings, and ~2.2x faster for 'new String(interned)'
>> 
>> For java 5 and 4 speedup is lower, but still considerable.
>> 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Earwin Burrfoot updated LUCENE-1607:
------------------------------------

    Attachment: intern.patch

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Earwin Burrfoot updated LUCENE-1607:
------------------------------------

    Attachment: LUCENE-1607.patch

This should do.
I replaced a pair of intern()s that appeared on trunk after last patch, initial cache size is 1024 and won't probably go up for most people.
Also renamed some variables while trying to understand the code, can revert them back if it's important for anyone.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704313#action_12704313 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

Is there 'any' benefit of dumping String.intern() compatibility? All those interns happen at startup anyway.
If future brings us something better (which I doubt in this case), we'll just swap the impl, which is all-private.

All in all I consider a practice of system-property-driven pluggable implementations a sickening one. Take MMapDirectory that was impossible to use until now without twiddling with startup keys, take SegmentReader/ReadonlySegmentReader - resulting code is ugly and nobody's going to override defaults anyway, all important dependencies are package-private.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Derek DeMarco (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700629#action_12700629 ] 

Derek DeMarco commented on LUCENE-1607:
---------------------------------------

Good point Yonik.  Making pool volatile should take care of it, but only on JVMs 1.5+, as before that volatile didn't prevent reordering of non-volatile reads/writes around it.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700593#action_12700593 ] 

Mark Miller commented on LUCENE-1607:
-------------------------------------

What was the field count? Is it still a considerable speedup with hundreds of fields without slowing anything else down ?(I would assume so, but would be nice to know considering a new hashmap is made per add - it is a one time hit though and the number of fields is not likely to exceed hundreds at the extreme)?

Also would be great to get the speedup numbers for Java 4 and 5.

I'll relate this to the 3 or 4 other field intern issues out there in a bit.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700827#action_12700827 ] 

Yonik Seeley edited comment on LUCENE-1607 at 4/20/09 8:55 AM:
---------------------------------------------------------------

bq. lack of collision resolve

My version was the most basic... a simple cache that is not guaranteed to always hit.  I also wanted to keep the overhead very low in case of misses (hence no re-probing).  In the best case, I don't think one can get much faster... and in the worst case it won't be much slower than simple String.intern().  There could be other implementations that do resizing of course.

bq. It'll break on non-power-of-two sizes. 

The size is guaranteed to be a power of two by the constructor.

      was (Author: yseeley@gmail.com):
    bq. lack of collision resolve

My version was the most basic... a simple cache that is not guaranteed to always hit.  I also wanted to keep the overhead very low in case of misses (hence no re-probing).  In the best case, I don't think one can get much faster... and in the worst case it won't be much slower.  There could be other implementations of course...

bq. It'll break on non-power-of-two sizes. 

The size is guaranteed to be a power of two by the constructor.
  
> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Patrick Eger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704246#action_12704246 ] 

Patrick Eger commented on LUCENE-1607:
--------------------------------------

As a quick comment on the implementation, i notice that it is possible (with reasonable probability) for hash collisions to result in re-interning a pair of strings multiple times. For example, a codepath that traverses across 32 unique string datapoints (say, in an inner loop somewhere), would have a minimum 3% probability of colliding and re-interning 2 strings every time through the loop. Because of the birthday paradox, it becomes likely to have such a situation (50% probability with ~150 unique values).

These are the probabilities of collision, assuming random distribution and perfect hashing. In real life the distribution will not be so random (string.hashCode() & MASK) so these will be "best case".
2 datapoints: collision prob = 0.006104%
4 datapoints: collision prob = 0.036617%
8 datapoints: collision prob = 0.170779%
16 datapoints: collision prob = 0.729976%
32 datapoints: collision prob = 2.983863%
64 datapoints: collision prob = 11.591861%
128 datapoints: collision prob = 39.188158%
256 datapoints: collision prob = 86.501947%


Practically this may or may not matter. My thought is that some sort of fast LRU structure would be better, but of course creating something like this without locking may be tricky. Another idea might be to support some form of limited hash-chaining or probing in the table, which would mitigate the damage of a collision significantly.


for reference, python code for calculating birthday/hash collisions, shamelessly stolen:
--------------------
def bp(n, d):
	v = 1.0
	for i in range(n):
		v = v * (1 - float(i)/d)
	return 1 - v

for n in [2, 4, 8, 16, 32, 64, 128, 256]:
	print "%i datapoints: collision prob = %f%%" % (n, bp(n, 16*1024)*100)

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731185#action_12731185 ] 

Mark Miller commented on LUCENE-1607:
-------------------------------------

looks like this is so close ...

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700696#action_12700696 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

Okay, you're probably right. It's not that hashmap can be corrupted on subsequent write, it's that reader threads can possibly see not-yet-built hashmap.
And volatile works, while simple sync doesn't, because volatile also orders reads. 
Hm. Hm. Hm.
Than, as you suggested, all we need is our own hash implementation that remains usable even being partially updated. I skipped through your patch and something looks suspicious to me. Like lack of collision resolve and that line:
int slot = h & (cache.length-1);
It'll break on non-power-of-two sizes.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700942#action_12700942 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

bq. In no-collision resolution scheme, if you detect a collision early with hashcode, you still call String.intern(). 

In the old code, yes... that's why I left it commented out there.  In the latest code, we re-probe on a collision w/o calling compareTo (assuming hashCodes are unequal) and only call intern() if it's really a String we haven't seen before.  The new code is a real hash table that will store all the Strings requested through it, unlike the first variant that was a simple cache.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700599#action_12700599 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

Earwin, I took a quick look at your implementation just now, but it doesn't look thread-safe.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704252#action_12704252 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

bq. As a quick comment on the implementation, i notice that it is possible (with reasonable probability) for hash collisions to result in re-interning a pair of strings multiple times.

For the first implementation, yes.  The latest implementation is guaranteed to only intern a string once ( the hashtable does probing and resizing.)


> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Earwin Burrfoot updated LUCENE-1607:
------------------------------------

    Attachment: LUCENE-1607.patch

Okay, I thought more about that. Yonik is amazing.

The fastest hash we can get, should have no collisions. This is achievable by resizing on each new collision. Then we should introduce an upper bound for this process, for it not to blow up. Finally, we can use our upper bound for hash size from the start.

I benchmarked a bit, it works better than HashTable.
Somewhat better for already interned strings, much better for noninterned strings.
"s1==s2 || s1.compareTo(s2) == 0" combo amazingly works faster than s1.equals(s2).
Additional hashcode check makes sparse hash access a bit slower and doesn't really help with crowded hash.
Having a crowded hash degrades performance a lot. 

I updated my patch with Yonik's algorithm. Kept everything in statics (faster), allowed to change cache size through system property for adventurous types, default is 16k (works well for 100 values)

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700935#action_12700935 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

bq. Collisions should also be very quick to skip by comparing the hash code (which is cached for Strings).
In no-collision resolution scheme, if you detect a collision early with hashcode, you still call String.intern(). That kills any benefit gained from hashcode check vs compareTo. And if no collision is detected, introducing this check slows things down a little. Yes, that surprised me too.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700827#action_12700827 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

bq. lack of collision resolve

My version was the most basic... a simple cache that is not guaranteed to always hit.  I also wanted to keep the overhead very low in case of misses (hence no re-probing).  In the best case, I don't think one can get much faster... and in the worst case it won't be much slower.  There could be other implementations of course...

bq. It'll break on non-power-of-two sizes. 

The size is guaranteed to be a power of two by the constructor.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704306#action_12704306 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

The last patch removed the ability to plug a different implementation, but I guess we don't need that until we have another implementation (and it's not clear if the benefits of dumping String.intern() compatability in the future outweigh the disadvantages of breaking back compatibility).

I plan on committing this shortly.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706129#action_12706129 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

Bug in previous algo (unbounded hash):

      // check cached hashCode first, and use compareTo to avoid
      // dynamic type checking in equals().
      if (h == other.hashCode() && s.compareTo(other)==0) {
        return s;          <--- here we should return 'other'
      }


> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718181#action_12718181 ] 

Michael McCandless commented on LUCENE-1607:
--------------------------------------------

Yonik is this ready to go in...?

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746773#action_12746773 ] 

Noble Paul commented on LUCENE-1607:
------------------------------------

isn't it possible to make the call to 
{code}
public String intern(char[] arr, int offset, int len) {
}
{code}
not create a String? 
currently , if I have a char[] it ends up creating a String (which makes a copy of the char[]) before it does the intern() operation.


> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607-contrib.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by Chris Miller <ch...@kbcfp.com>.
> How about benchmarking with eg a ConcurrentHashMap instead?

Scratch that, I forgot about the 1.3/1.4 dependency...




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by Chris Miller <ch...@kbcfp.com>.
> You was. I just wanted to point out that in real apps you're not going
> to see stale data longer than for milliseconds

Agreed, which is why this whole discussion is very theoretical anyway :)

> On cache miss, I re-retrieve pool reference after a lock (HashMap is
> no longer stale), re-read a string, and do the write only if the
> string is still not there.

Ah I see what you're saying now, I had been overlooking the re-retrieval 
of the pool reference. Thanks for clarifying that for me.

> The truth is born in argument. I reread jmm docs to be sure, but I
> can't guarantee I understood them well.

Seems like you have a pretty good understanding to me :)




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by Earwin Burrfoot <ea...@gmail.com>.
> Sorry I wasn't as clear as I could have been - I realise JEE servers use a
> threadpool for handling requests, I was thinking of many other applications
> in the real world I'm aware of that don't (be that good design or
> otherwise...).
You was. I just wanted to point out that in real apps you're not going
to see stale data longer than for milliseconds (unless you're some
processor cache guru doing his black magic). Besides calling intern(),
threads synchronize in a gazillion other ways, and each sync updates
stale stuff.

>  I'm not sure I follow you though when you say "it won't even
> do a write" on a cache miss, it copies the (potentially stale) HashMap in
> that situation does it not?
On cache miss, I re-retrieve pool reference after a lock (HashMap is
no longer stale), re-read a string, and do the write only if the
string is still not there.

> I was only trying to raise the stale/visibility issue but it's clear you've
> already given that plenty of thought.
The truth is born in argument. I reread jmm docs to be sure, but I
can't guarantee I understood them well.

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by Chris Miller <ch...@kbcfp.com>.
Sorry I wasn't as clear as I could have been - I realise JEE servers use 
a threadpool for handling requests, I was thinking of many other applications 
in the real world I'm aware of that don't (be that good design or otherwise...). 
I'm not sure I follow you though when you say "it won't even do a write" 
on a cache miss, it copies the (potentially stale) HashMap in that situation 
does it not?

I still think there are some *theoretical* dangers (eg a thread replacing 
a well populated cache with a stale/empty copy). Just consider me pessimistic 
after having been burnt in the past by similar 'impossibilities' ;-) Overall 
though I'm in agreement that there aren't any practical issues with this 
code and in fact it'll give a good performance boost over String.intern() 
even in the worst case anyway! I was only trying to raise the stale/visibility 
issue but it's clear you've already given that plenty of thought.


> On Sun, Apr 19, 2009 at 23:42, Chris Miller <ch...@kbcfp.com>
> wrote:
> 
>>> As soon as all possible fields are in the pool, we're essentially
>>> readonly.
>>> 
>> The problem is, there's no guarantee we will ever reach this point.
>> For example suppose you have a server app that spawns a new thread
>> per request. Each new thread might have to make all the .intern()
>> calls again because they never see anything in the cache.
>> 
> No sane server app is actually spawning a new thread per request, it
> takes one from pool. A spawned thread inherits visibility of its
> parent thread (at the moment of spawning). Even if a new thread still
> sees stale pool, it'll be updated on first cache miss. It won't even
> do a write. So, no "all the intern calls again", such scenario doesn't
> exist even in theory.
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by Earwin Burrfoot <ea...@gmail.com>.
On Sun, Apr 19, 2009 at 23:42, Chris Miller <ch...@kbcfp.com> wrote:
>> As soon as all possible fields are in the pool, we're essentially
>> readonly.
> The problem is, there's no guarantee we will ever reach this point. For
> example suppose you have a server app that spawns a new thread per request.
> Each new thread might have to make all the .intern() calls again because
> they never see anything in the cache.

No sane server app is actually spawning a new thread per request, it
takes one from pool. A spawned thread inherits visibility of its
parent thread (at the moment of spawning). Even if a new thread still
sees stale pool, it'll be updated on first cache miss. It won't even
do a write. So, no "all the intern calls again", such scenario doesn't
exist even in theory.

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by Chris Miller <ch...@kbcfp.com>.
> As soon as all possible fields are in the pool, we're essentially
> readonly.

The problem is, there's no guarantee we will ever reach this point. For example 
suppose you have a server app that spawns a new thread per request. Each 
new thread might have to make all the .intern() calls again because they 
never see anything in the cache. Having said that, I agree that the code 
will still work correctly regardless, and this is a very unlikely scenario 
anyway - for most practical situations this is never going to be an issue.




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by Earwin Burrfoot <ea...@gmail.com>.
On Sun, Apr 19, 2009 at 23:16, Chris Miller <ch...@kbcfp.com> wrote:
> As far as I can see, both these implementations only suffer from
> threadsafety problems in that they don't guarantee visibility across
> threads, ie it's possible for threads to see stale data.

> So the code should work fine if you can live with the consequences of stale
> data - in this case, the (remote) possibility of large performance
> differences between VMs.

I guarantee just enough visibility for this code to never ever produce
incorrect results.
As soon as all possible fields are in the pool, we're essentially readonly.
Take whatever JVM you have out there, readonly will beat anything
performance-wise :)

> How about benchmarking with eg a ConcurrentHashMap instead?
No Java 5 until 3.0. It will be slower anyway.

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by Chris Miller <ch...@kbcfp.com>.
As far as I can see, both these implementations only suffer from threadsafety 
problems in that they don't guarantee visibility across threads, ie it's 
possible for threads to see stale data. I don't see any prospect of corruption 
or race conditions due to out-of-order execution. So the code should work 
fine if you can live with the consequences of stale data - in this case, 
the (remote) possibility of large performance differences between VMs. Personally 
I tend to avoid such fragile and hard to maintain code unless there's a very 
good reason for it.

How about benchmarking with eg a ConcurrentHashMap instead?


> [
> https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.j
> ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=127
> 00600#action_12700600 ]
> 
> Mark Miller commented on LUCENE-1607:
> -------------------------------------
> bq. Earwin, I took a quick look at your implementation just now, but
> it doesn't look thread-safe.
> 
> That was my first impression too, but I couldnt pin down the issue.
> The access will either be against the old pool, or it will be against
> the new pool, and the instance switch should be atomic? I figured it
> was a clever trick of some kind (though I did wonder about the cost of
> making the new hashmap every add). The HashMaps are read only right
> (once they can be accessed by multiple threads)? And they are popped
> in with an atomic variable assignment?
> 
>> String.intern() faster alternative
>> ----------------------------------
>> Key: LUCENE-1607
>> URL: https://issues.apache.org/jira/browse/LUCENE-1607
>> Project: Lucene - Java
>> Issue Type: Improvement
>> Reporter: Earwin Burrfoot
>> Fix For: 2.9
>> Attachments: intern.patch, LUCENE-1607.patch
>> 
>> By using our own interned string pool on top of default,
>> String.intern() can be greatly optimized.
>> 
>> On my setup (java 6) this alternative runs ~15.8x faster for already
>> interned strings, and ~2.2x faster for 'new String(interned)'
>> 
>> For java 5 and 4 speedup is lower, but still considerable.
>> 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700600#action_12700600 ] 

Mark Miller commented on LUCENE-1607:
-------------------------------------

bq. Earwin, I took a quick look at your implementation just now, but it doesn't look thread-safe. 

That was my first impression too, but I couldnt pin down the issue. The access will either be against the old pool, or it will be against the new pool, and the instance switch should be atomic? I figured it was a clever trick of some kind (though I did wonder about the cost of making the new hashmap every add). The HashMaps are read only right (once they can be accessed by multiple threads)? And they are popped in with an atomic variable assignment?


> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-1607:
---------------------------------

    Attachment: LUCENE-1607.patch

corrected load factor check (the previous code actually calculated 80%, not 75%).

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704247#action_12704247 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

bq. why h >> 7?

Was copied from Solr's hashing of doc ids... we didn't want to throw away too many lower bits since they were likely to be the most random.  In string hashes, the rightmost bits also have the most entropy.

bq. Should I add a patch with Yonik's last hash impl and all calls to String.intern() replaced to get it moving? 

That would be helpful, thanks!

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718198#action_12718198 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

bq. but I was waiting for some kind of feedback if people in general thought it was the right approach. It introduces another static, and people tend to not like that.
Just forgot somehow about this issue.
You're right about static, it's not clear how and when to initialize it, plus you introduce some public classes we'll be unable to change/remove later.
I still have a feeling we should expose a single static method - intern() and hide implementation away, possibly tuning it to be advantageous for <thousands of fields, and degrading to raw String.intern() level if there are more fields.

I'm going to be away from AC power for three days starting now, so I won't be able to reply until then.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704306#action_12704306 ] 

Yonik Seeley edited comment on LUCENE-1607 at 4/29/09 1:23 PM:
---------------------------------------------------------------

The last patch removed the ability to plug a different implementation, but I guess we don't need that until we have another implementation (and it's not clear if the benefits of dumping String.intern() compatability in the future outweigh the disadvantages of breaking back compatibility).

Rethinking what possible problems this could have... I'm not sure it should be committed in it's current form.  One problem is potentially unlimited growth where there was not before.  This could happen in a long running search system where users enter a variety of field names that don't even exist.

A WeakReference based approach would work (as String.intern() uses), but that's more heavyweight and could need synchronization since we are dealing with more complex objects.
Another option is to go back to a simple cache based approach which could still pin otherwise unreferenced Strings in memory, but would have a bounded size.  Perhaps this argues for reinstating the pluggability of the intern implementation.


      was (Author: yseeley@gmail.com):
    The last patch removed the ability to plug a different implementation, but I guess we don't need that until we have another implementation (and it's not clear if the benefits of dumping String.intern() compatability in the future outweigh the disadvantages of breaking back compatibility).

I plan on committing this shortly.
  
> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Patrick Eger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704261#action_12704261 ] 

Patrick Eger commented on LUCENE-1607:
--------------------------------------

Ah I see thanks. Was looking at the most recent patch (by  date).

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated LUCENE-1607:
-------------------------------

    Attachment: LUCENE-1607.patch

I haven't quite tested it. But it is an idea as a patch. 

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607-contrib.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-1607:
---------------------------------

    Attachment: LUCENE-1607.patch

bq. Yonik, the string is being interned twice in your latest patch 

Thanks - I had actually fixed that... but it didn't make it into the patch apparently :-)

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700601#action_12700601 ] 

Earwin Burrfoot commented on LUCENE-1607:
-----------------------------------------

bq. This default would be more back compatible since programs may rely on String instances being interned via String.intern(). 
My version is also String.intern()-compatible

bq. Earwin, I took a quick look at your implementation just now, but it doesn't look thread-safe.
unlock for one thread happens-before lock on the same monitor for the other thread, inside each thread each action happens-before the next one
Since I get pool reference for the second time after the lock, and write pool reference before unlocking, everything's fine. As for the other threads, if they find what they need in the pool, it doesn't matter if they're seeing a stale pool, or not. If they don't find what they need, they hit the lock, re-retrieve pool reference, getting the latest one and either find what they need there, or write.
Correct me if I'm wrong?

bq. though I did wonder about the cost of making the new hashmap every add
It's COSTLY :) But you're going to pay all of it at startup.

I think we can introduce the class (without any interfaces, why should we need one here?), and then try to make it faster by switching storage. I tried GNU Trove THashMap, but on Java 6 it was slower than stock HashMap.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739628#action_12739628 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

bq. Except I'd like the javadoc demand each impl to be String.intern()-compatible.

If *everything* went through the same intern, it wouldn't matter. 

bq. rewrite 'for' as 'for (Entry e = first;e != null;e = e.next)' for clarity?

done.

bq. If check around 'nextToLast = e' can also be removed?

I don't see how...

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1607) String.intern() faster alternative

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller reassigned LUCENE-1607:
-----------------------------------

    Assignee: Yonik Seeley

Or push to 3.1. I have no preference if it goes in as is, just looking for it to be managed one way or another for 2.9 release.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>            Assignee: Yonik Seeley
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720274#action_12720274 ] 

Yonik Seeley commented on LUCENE-1607:
--------------------------------------

I've held off because of a lack of consensus, but I suppose I can do the old "i'll commit in a few days" thing.

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1607) String.intern() faster alternative

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-1607:
---------------------------------

    Attachment: LUCENE-1607.patch

Here's a slightly updated patch (javadoc and StringHelper.intern() convenience method).

The method of plugging a different StringInterner implementation would be to simply assign to the static in StringHelper.

{code}
  /**
   * Expert:
   * The StringInterner implementation used by Lucene.
   * This should never be changed after other Lucene APIs have been used.
   */
  public static StringInterner interner = new SimpleStringInterner(1024,8);
{code}

Are people comfortable with this approach?  How about the defaults?  This should speed up anyone with under thousands of field, and only slightly slow down someone with over thousands of fields (until they plugged in a new interner with a bigger table size).



> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org