You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jason Rutherglen (JIRA)" <ji...@apache.org> on 2010/09/22 07:26:32 UTC

[jira] Created: (LUCENE-2662) BytesHash

BytesHash
---------

                 Key: LUCENE-2662
                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: Realtime Branch
            Reporter: Jason Rutherglen
             Fix For: Realtime Branch


This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914478#action_12914478 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

> BytesRefHash is now final and does not create Entry objects anymore

That's good.

> move ByteBlockPool to o.a.l.utils

Sure why not.

> factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first?

We need to factor it out of THPF otherwise this patch isn't really useful for committing.  Also, it'll get tested through the entirety of the unit tests, ie, it'll get put through the laundry.  

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913642#action_12913642 ] 

Simon Willnauer commented on LUCENE-2662:
-----------------------------------------

bq. Simon, when do you think you'll be posting?

maybe within the next week I have a busy schedule but does this patch keep you from doing any work? You shouldn't just pull out stuff from 1 month old patches especially as you don't even give me time to reply on the orig. discussion. 

Any rush on this?

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2662:
------------------------------------

    Attachment: LUCENE-2662.patch

We are almost there. I factored out ByteRefHash out of TermsHashPerField just having two "nocommit" parts left in the code I need to find a solution for. 

* there needs to be a way to communicate the byte usage up to DocumentsWriter which I haven't explored yet
* textStarts in ParallelPostingsArray needs to be replaced since it is already maintained in ByteRefHash. I will need to look closer into that but suggestions are welcome. One way to do it would be to attach a reference to BRH instead of the textStart - but that is a naive suggestion since I haven't looked into that in more detail.

All tests are passing so far and TermsHashPerField looks somewhat cleaner. I will work on fixing those nocommits and run some indexing perf test against the patch. 



> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915713#action_12915713 ] 

Simon Willnauer commented on LUCENE-2662:
-----------------------------------------

{quote}
How about renaming key back to ord? And then maybe rename values to
bytesStart? And in their decls add comments saying they are indexed
by hash code? And maybe rename addByOffset -> addByBytesStart?
{quote}
I don't like addByBytesStart I would like to keep offset since it really is an offset into the pool. addByPoolOffset?
The names ord and bytesStart are a good compromise :) lets shoot for that.


{quote}
On the nocommit in ByteBlockPool - I think that's fine? It's an
internal class....
{quote}
you refer to this: // nocommit - public arrays are not nice! ?
yeah that more of an style thing but if somebody changes them its their fault for being stupid I guess.

{quote}
The nocommit in BytesRefHash seems wrong? (Ie, compact is used
internally)... though maybe we make it private if it's not used
externally?
{quote}

Ah yeah thats bogus - its from a previous iteration which was wrong as well, I will remove.

{quote}
On the "nocommit factor this out!" in THPF.java... I agree, the
postingsArray.textStarts should go away right? Ie, it's a
[wasteful] copy of what the BytesRefHash is already storing?
{quote}
Yeah that is the reason for that nocommit. Yet, I though about this a little and I have two options for this.
 * we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it.
BytesRefHash would accept this class, don't have a good name for it but lets call it TextStartArray for now, and use it internally. It would call grow() once needed inside BytesRefHash and all the other code would be unchanged since PPA is a subclass. 
* the other way would be to bind the ByteRefHash to the postings array which seems odd to me though.

More ideas?

{quote}
Can we impl BytesRefHash.bytesUsed as an AtomicLong (hmm maybe
AtomicInt - none of these classes can address > 2GB)? Then the
pool would add in blockSize every time it binds a new block. That
method (DW.bytesUsed) is called alot - at least once on every
addDoc.
{quote}

I did exactly that in the not yet uploaded patch. But I figured that it would maybe make more sense to use that AtomicInt in the allocator as well as in THPF or is that what you mean?

{quote}
I'm confused again - when do we use RecyclingByteBlockAllocator
from a single thread...? Ie, why did the sync need to be
conditional for this class, again....? It seems like we always
need it sync'd (both the main pool & per-doc pool need this)? If
so we can simplify and make these methods sync'd?
{quote}

man, I am sorry - I  thought I will use this in LUCENE-2186 in a single threaded env but if so I should change it there if needed. I was one step ahead though.
I will change and maybe have a second one if needed. Agree?

simon








> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916913#action_12916913 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

OK my 2nd indexing test (10M wikipedia docs, flush @ 256 MB ram used) finished and trunk/patch are essentially the same throughput, and, all flushes happened at identical points.  So I think we are good to go...

Nice work Simon!

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913599#action_12913599 ] 

Robert Muir commented on LUCENE-2662:
-------------------------------------

Jason: I am confused... there is no hash impl in TermsHashPerField.

the hashing, and term encoding and other things, is the responsibility of the analysis chain (TermToBytesRefAttribute):
{code}
    // Get the text & hash of this term.
    int code = termAtt.toBytesRef(utf8);
{code}

this way, implementations can 'hash-as-they-go' like we do when encoding unicode char[] -> byte[],
or they can simply return BytesRef.hashCode() if they don't have an optimized implementation.


> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916767#action_12916767 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

Simon, looks good.

Are we using:
{code}
public int add(BytesRef bytes, int code)
{code}

yet?

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2662:
------------------------------------

    Attachment: LUCENE-2662.patch

Next iteration - seems to be very close!

I have applied the following changes:

* introduces a AtomicLong to track bytesUsed in DocumetnsWriter, TermsHashPerField, ByteRefHash and RecyclingByteBlockAllocator
* Factored out  a BytesStartArray class from BytesRefHash that manages the int[] holding the bytesStart offsets. TermsHashPerField subclasses and manages the ParallelPostingsArray through it. 
* remove remaining no-commits
* made RecyclingbyteBlockAllocator synced by default (we use synchronized methods for it now)

I run a quick Wikipedia 100k docs benchmark against trunk vs. LUCENE-2662 and the results are promising.
|version|rec/sec|elapsed sec|avgUsedMem|
|LUCENE-2662|717.30|139.41|536,682,592|
|trunk| 682.66|146.49|546,065,344|

I will run the 10M benchmark once I get back to this.


> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916875#action_12916875 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

In RecyclingByteBlockAllocator.recycleByteBlocks, if we cannot recycle all of the blocks (ie because it exceeds maxBufferedBlocks), we are failing to null out the entries in the incoming array?

Also maybe rename pos -> freeCount?  (pos is a little too generic?)

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916872#action_12916872 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

I indexed 10M 1KB wikipedia docs, single threaded, and also see things a bit faster w/ the patch (10,353 docs/sec vs 10,182 docs/sec).  Nice to have a refactor improve performance for a change, heh.

The avgUsedMem was quite a bit higher (1.5GB vs 1.0GB), but, I'm not sure this stat is trustworthy.... I'll re-run w/ infoStream enabled to see if anything looks suspicious (eg, we are somehow not tracking bytes used correctly).

Still, the resulting indices had identical structure (ie we seem to flush at exactly the same points), so I think bytes used is properly tracked.

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914888#action_12914888 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

An API change to BBP that would be useful is instead of passing in the "size in bytes" to newSlice, it'd be more useful if the level and/or the size were passed in.  In fact, throughout the codebase, a level, specifically the first, is all that is passed into the newSlice method.  The utility of this change is, I'm recording the level of the last slice for the skip list in LUCENE-2312.

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2662:
------------------------------------

    Attachment: LUCENE-2662.patch

This patch contains a slightly different version of BytesHash (renamed it to BytesRefHash but that is to be discussed - while writing this I actually think BytesHash is the better name).  BytesRefHash is now final and does not create Entry objects anymore. Internally it maintains two integer arrays one acting as the hash buckets and the other one contain the bytes-start offset in the ByteBlockPool. Each added entry is assigned to an increasing ordinal since this is what Entry is used in almost all use-cases (in CSF though). For TermsHashPerField this is also "native" since is uses the same kind of referencing system.

These changes keep this class as efficient as possible, keeping GC costs low and allows JIT to do better optimizations. IMO this class is super performance critical and since we recently refactored indexing towards parallel arrays adding another "object" array might not be the way to go anyway.

I also incorporated robers comments - thanks for the review anyway. I guess that is the first step towards factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first?

comments / review welcome!!

One more thing, I did not move ByteBlockPool to o.a.l.utils but I thing it belongs there, thoughts?

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2662:
-------------------------------------

    Attachment: LUCENE-2662.patch

We need unit tests and a base implementation as BytesHash is abstract...

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917188#action_12917188 ] 

Simon Willnauer commented on LUCENE-2662:
-----------------------------------------

Committed to trunk in rev. 1003790

@Jason: do you need that merged into Realtime-Branch or is buschmi going to do that? Otherwise I can help too

I will keep it open until this is merged into Realtime Branch

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914623#action_12914623 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

{quote}
bq. make sure JIT doesn't play nasty tricks with us again.

What would we do if this happens?
{quote}

Cry?

Or... install Harmony and see if it has the same problem and if so submit a patch to them to fix it :)

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913628#action_12913628 ] 

Robert Muir commented on LUCENE-2662:
-------------------------------------

Jason: what I am saying is if i look at the method in your patch:

public T add(BytesRef bytes)

the first thing it does is compute the hash, but this is already computed in the analysis chain.

why not have
{code}
public T add(BytesRef bytes, int hashCode)
{code}

and also:
{code}
public T add(BytesRef bytes) {
  return add(bytes, bytes.hashCode());
}
{code}

then we can avoid computing this twice, and keep the optimization in UnicodeUtil


> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2662:
------------------------------------

    Attachment: LUCENE-2662.patch

This patch fixes nulling out the recycled but not reused byte blocks in RecyclingByteBlockAllocator.

I thing we are ready to go I will commit to trunk soon. I don't think we need a CHANGES.TXT here - at least I can not find any section this refactoring would fit to. 

simon

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916799#action_12916799 ] 

Simon Willnauer commented on LUCENE-2662:
-----------------------------------------

bq. Are we using:...

yeah, look at TermsHashPerFields add() method
{code}
       termID = bytesHash.add(termBytesRef, termAtt.toBytesRef(termBytesRef));
{code}

simon

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2662:
-------------------------------------

    Priority: Minor  (was: Major)

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2662:
-------------------------------------

    Affects Version/s:     (was: Realtime Branch)
        Fix Version/s:     (was: Realtime Branch)

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916873#action_12916873 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

bq. Still, the resulting indices had identical structure (ie we seem to flush at exactly the same points), so I think bytes used is properly tracked.

Sorry, scratch that -- I was inadvertently flushing by doc count, not by RAM usage.  I'm re-running w/ flush-by-RAM to verify we flush at exactly the same points as trunk.

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914627#action_12914627 ] 

Simon Willnauer commented on LUCENE-2662:
-----------------------------------------

bq. In the class jdocs, I think state that this is basically a Map<BytesRef,int>?
yeah that simplifies it - will do.

bq. Maybe we also move ByteBlockPool --> oal.util?
yeah I did that already - that makes totally sense

bq. Maybe move out the ByteBlockAllocator to its own class (in util)? RecyclingByteBlockAllocator?
+1 yeah I like that - I also think we should allow to pass the blockpool to the byteshash instead of the allocator. From what I can tell now I think this is necessary for the refactoring anyway since we share pools with secondary TermsHash instances in the termvector case.

{quote}
Maybe rename ords -> keys? And hash -> values? (The key isn't
really an "ord" (I think?) because it increases by more than 1
each time... it's more like an address since it references an
address in the byte-pool space).
{quote}
yeah that depends how you see it - the array index really is the ord though. but I like those names. I will change.

{quote}
We should advertise the limits in the jdocs - limited to <= 2GB
total byte storage, each key must be <= BLOCK SIZE-2 in length.
{quote}
I think I have done the latter already but I will add the other too.

{quote}
Can we have sortedEntries() not allocate a new iterator object?
Ie, just return the sorted bytesStart int[]? (This is what's done
today, and, for term vectors on small docs, this method is pretty
hot). And the javadocs for this should be stronger - it's not
that the behaviour is undefined after, it's that you must .clear()
after you're done consume the sorted entries.
{quote}
Ah I see - good point. I think what you refer to is   public int[] sort(Comparator<BytesRef> comp) - the iterator one is just more convenient one. I will change though.

thanks mike!

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916355#action_12916355 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

{quote}we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it. {quote}

This option, makes the most sense.  ParallelByteStartsArray?





> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913638#action_12913638 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

Simon, when do you think you'll be posting?

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915723#action_12915723 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

{quote}
I don't like addByBytesStart I would like to keep offset since it really is an offset into the pool. addByPoolOffset?
The names ord and bytesStart are a good compromise  lets shoot for that.
{quote}

OK!

bq. we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it.

This seems good?  So, this would be the "store" that BRH manages... and by subclassing it you can have other parallel arrays storing anything, indexed by ord.

bq. I did exactly that in the not yet uploaded patch. But I figured that it would maybe make more sense to use that AtomicInt in the allocator as well as in THPF or is that what you mean?

I think we should use it everywhere to track bytes used ;)

bq. man, I am sorry - I thought I will use this in LUCENE-2186 in a single threaded env but if so I should change it there if needed. I was one step ahead though.
I will change and maybe have a second one if needed. Agree?

Ahh that's right I forgot the whole driver for this refactoring heh ;)  Yeah I think leave it sync'd for now and we can test if this hurts perf in LUCENE-2186?  "Supposedly" uncontended locks are low-cost (but I'm not sure...).

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924488#action_12924488 ] 

Simon Willnauer commented on LUCENE-2662:
-----------------------------------------

bq. Why is this issue still open, if the patch was already committed to trunk?

see my comment above: 

bq. I will keep it open until this is merged into Realtime Branch


> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916988#action_12916988 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

I instrumented trunk & the patch to see how many times we do new byte[bufferSize] while building 5M index, and they both alloc the same number of byte[] from the BBA.  So I don't think we have a memory issue...

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2662:
------------------------------------

        Fix Version/s: 4.0
    Affects Version/s: 4.0

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916882#action_12916882 ] 

Robert Muir commented on LUCENE-2662:
-------------------------------------

Simon, thank you for renaming the 'utf8' variables here. 


> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913622#action_12913622 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

The THPF is hashing tokens for use in the indexing RAM buffer and the creation of postings, ie, the lookup of term byte[]s to term ids.  The hash component is currently interwoven into THPF.  

Here's some of the variables being used in THPF.

{code}
private int postingsHashSize = 4;
private int postingsHashHalfSize = postingsHashSize/2;
private int postingsHashMask = postingsHashSize-1;
private int[] postingsHash;
{code}

Also there's the methods rehashPostings, shrinkHash, postingEquals, and add(int textStart) has the lookup.  

We'll probably also need to separate out the quick sort implementation in THPF, I'll add that to this issue.

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917416#action_12917416 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

Lets commit this to trunk.  We need to merge in all of trunk to the RT branch, or vice versa at some point anyways.  This patch could be a part of that bulk merge-in, or we can simply do it now.

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913589#action_12913589 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

The current hash implementation needs to be separated out of TermsHashPerField.  

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916965#action_12916965 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

I also ran a test w/ 5 threads -- they are close (22,402 docs/sec for patch, 22,868 docs/sec for trunk), and this time avgUsedMem is closer (811 MB for trunk, 965 MB for patch).

I don't think the avgUsedMem is that meaningful -- it takes the max of Runtime.totalMemory() - Runtime.freeMemory() (which includes garbage I think), after each completed task, and then averages across all tasks.  In my case I think it's averaging 1 measure per thread, so it's really sort of measuring how much garbage there happened to be at the time.

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915700#action_12915700 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

{quote}
bq. Maybe rename ords -> keys? And hash -> values? (The key isn't really an "ord" (I think?) because it increases by more than 1 each time... it's more like an address since it references an address in the byte-pool space).

yeah that depends how you see it - the array index really is the ord though. but I like those names. I will change.
{quote}

Duh, I agree -- the new names are confusing!!  Sorry.  I was
confused... you are right that what's now called "keys" are in fact
really ords!  They are always incr'd by one, on adding a new one.

How about renaming key back to ord?  And then maybe rename values to
bytesStart?  And in their decls add comments saying they are indexed
by hash code?  And maybe rename addByOffset -> addByBytesStart?


  * On the nocommit in ByteBlockPool -- I think that's fine?  It's an
    internal class....

  * The nocommit in BytesRefHash seems wrong?  (Ie, compact is used
    internally)... though maybe we make it private if it's not used
    externally?

  * On the "nocommit factor this out!" in THPF.java... I agree, the
    postingsArray.textStarts should go away right?  Ie, it's a
    [wasteful] copy of what the BytesRefHash is already storing?

  * Can we impl BytesRefHash.bytesUsed as an AtomicLong (hmm maybe
    AtomicInt -- none of these classes can address > 2GB)?  Then the
    pool would add in blockSize every time it binds a new block.  That
    method (DW.bytesUsed) is called *alot* -- at least once on every
    addDoc.

  * I'm confused again -- when do we use RecyclingByteBlockAllocator
    from a single thread...?  Ie, why did the sync need to be
    conditional for this class, again....?  It seems like we always
    need it sync'd (both the main pool & per-doc pool need this)?  If
    so we can simplify and make these methods sync'd?
    


> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916885#action_12916885 ] 

Simon Willnauer commented on LUCENE-2662:
-----------------------------------------

bq. Simon, thank you for renaming the 'utf8' variables here.
YW :)

bq. In RecyclingByteBlockAllocator.recycleByteBlocks, if we cannot recycle all of the blocks (ie because it exceeds maxBufferedBlocks), we are failing to null out the entries in the incoming array?
Ahh you are right - I will fix. 

bq. Also maybe rename pos -> freeCount? (pos is a little too generic?)
I mean its internal though but I see your point.

thanks for reviewing it closely. 

{quote}
The avgUsedMem was quite a bit higher (1.5GB vs 1.0GB), but, I'm not sure this stat is trustworthy.... I'll re-run w/ infoStream enabled to see if anything looks suspicious (eg, we are somehow not tracking bytes used correctly).
{quote}

hmm I will dig once I get back to my workstation.

simon

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913651#action_12913651 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

It'd be nice to get deletes working, ie, LUCENE-2655 and move forward in a way that's useful long term.  What changes have you made?

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Mathias Walter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924484#action_12924484 ] 

Mathias Walter commented on LUCENE-2662:
----------------------------------------

Why is this issue still open, if the patch was already committed to trunk?

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917372#action_12917372 ] 

Simon Willnauer commented on LUCENE-2662:
-----------------------------------------

bq. Simon, I'm going to get deletes working, tests passing using maps in the RT branch, then we can integrate. This'll probably be best.
Jason, I suggest you create a separate issue something like "Integrate BytesRefHash in Realtime Branch" and I will take care of it. I think this issue had a clear target to factor out the hash table out of TermsHashPerField and we should close it. lets use a new one to track the integration.

Thoughts?

Simon

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913636#action_12913636 ] 

Simon Willnauer commented on LUCENE-2662:
-----------------------------------------

jason, can you please hold off with this since I have newer / different versions of this class already with tests etc. I understand that you need that class but creating all these issues and rushing ahead is rather counter productive.

@Robert: this class is standalone in this patch and doesn't know about the analysis chain. But thanks for the comments I will incorporate them.

simon

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913632#action_12913632 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

Ah, ok, I didn't write this code, I extracted it from LUCENE-2186, and nice, you reviewed it can be improved.  I'll make changes to it shortly, hopefully.

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer reassigned LUCENE-2662:
---------------------------------------

    Assignee: Simon Willnauer

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917354#action_12917354 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

Simon, I'm going to get deletes working, tests passing using maps in the RT branch, then we can integrate.  This'll probably be best.

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914621#action_12914621 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

Patch looks good Simon -- some ideas:



  * In the class jdocs, I think state that this is basically a
    Map<BytesRef,int>?

  * Maybe we also move ByteBlockPool --> oal.util?

  * Maybe move out the ByteBlockAllocator to its own class (in util)?
    RecyclingByteBlockAllocator?

  * Can we have DocumentsWriter share the ByteBlockAllocator?  (Right
    now it's dup'd code since DW also implements this).

  * Maybe rename ords -> keys?  And hash -> values?  (The key isn't
    really an "ord" (I think?) because it increases by more than 1
    each time... it's more like an address since it references an
    address in the byte-pool space).

  * We should advertise the limits in the jdocs -- limited to <= 2GB
    total byte storage, each key must be <= BLOCK SIZE-2 in length.

  * Can we have sortedEntries() not allocate a new iterator object?
    Ie, just return the sorted bytesStart int[]?  (This is what's done
    today, and, for term vectors on small docs, this method is pretty
    hot).  And the javadocs for this should be stronger -- it's not
    that the behaviour is undefined after, it's that you must .clear()
    after you're done consume the sorted entries.


> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917537#action_12917537 ] 

Michael McCandless commented on LUCENE-2662:
--------------------------------------------

This was already committed to trunk...

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914452#action_12914452 ] 

Robert Muir commented on LUCENE-2662:
-------------------------------------

bq. I guess that is the first step towards factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first?

I think it would be better if this class were used in the patch... i wouldn't commit it by itself unused. Its difficult for people to review its behavior, since its just a standalone unused thing (for instance, the hashCode thing i brought up)


> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915079#action_12915079 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

Simon, the patch looks like it's ready for the next stage, ie, TermsHashPerField deparchment.  

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2662:
------------------------------------

    Attachment: LUCENE-2662.patch

Attaching my current state for feedback and iteration.

* factored out ByteBlockAllocator from DocumentsWriter
* moved ByteBlockPool to o.a.l.util
* added RecyclingByteBlockAllocator which can be used with or without synchronization. IMO the DummyConcurrentLock will be optimized away so that his might be super low cost. - feedback for that would more than welcome. 
* addressed all the comments from mike - thanks again
* added more tests
* cut over constants from DocumentsWriter to ByteBlockPool

TermsHashPerField is next.... feedback welcome.

simon

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914486#action_12914486 ] 

Simon Willnauer commented on LUCENE-2662:
-----------------------------------------

bq. We need to factor it out of THPF otherwise this patch isn't really useful for committing. Also, it'll get tested through the entirety of the unit tests, ie, it'll get put through the laundry.

Yeah, lets see this as the first baby step towards it. I will move ByteBockPool to o.a.l.utils and start cutting THPF over to it. We need to do benchmarking in any case just to make sure JIT doesn't play nasty tricks with us again.

simon

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2662) BytesHash

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914521#action_12914521 ] 

Jason Rutherglen commented on LUCENE-2662:
------------------------------------------

bq. make sure JIT doesn't play nasty tricks with us again.

What would we do if this happens?

> BytesHash
> ---------
>
>                 Key: LUCENE-2662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch, 4.0
>            Reporter: Jason Rutherglen
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch, 4.0
>
>         Attachments: LUCENE-2662.patch, LUCENE-2662.patch
>
>
> This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org