You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2010/04/07 18:18:33 UTC

[jira] Created: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

Add FieldCache.getTermBytes, to load term data as byte[]
--------------------------------------------------------

                 Key: LUCENE-2380
                 URL: https://issues.apache.org/jira/browse/LUCENE-2380
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Michael McCandless
             Fix For: 3.1


With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode string, but not necessarily), so we need to push this up the search stack.

FieldCache now has getStrings and getStringIndex; we need corresponding methods to load terms as native byte[], since in general they may not be representable as String.  This should be quite a bit more RAM efficient too, for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854594#action_12854594 ] 

Uwe Schindler commented on LUCENE-2380:
---------------------------------------

The structure should look like String and StringIndex, but I am not sure, if we need real BytesRefs. In my opinion, it should be an array of byte[], where each byte[] is allocated with the termsize from the enums BytesRef and copied over - this is. This is no problem, as the terms need to be replicated either way, as the BytesRef from the enum is reused. The only problem is that byte[] is mising the cool bytesref methods like utf8ToString() that may be needed by consumers.

getStrings and getStringIndex should be deprecated. We cannot emulate them using BytesRef.utf8ToString, as the String[] arrays are raw and allow no wrapping. If FieldCache would use accessor methods and not raw arrays, we would not have that problem...

> Add FieldCache.getTermBytes, to load term data as byte[]
> --------------------------------------------------------
>
>                 Key: LUCENE-2380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2380
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 3.1
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding methods to load terms as native byte[], since in general they may not be representable as String.  This should be quite a bit more RAM efficient too, for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854853#action_12854853 ] 

Toke Eskildsen commented on LUCENE-2380:
----------------------------------------

Working on LUCENE-2369 I essentially had to re-implement the FieldCache because of the hardwiring of arrays. Switching to accessor methods seems like the right direction to go.

> Add FieldCache.getTermBytes, to load term data as byte[]
> --------------------------------------------------------
>
>                 Key: LUCENE-2380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2380
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 3.1
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding methods to load terms as native byte[], since in general they may not be representable as String.  This should be quite a bit more RAM efficient too, for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854621#action_12854621 ] 

Yonik Seeley commented on LUCENE-2380:
--------------------------------------

bq. We could also do shared byte[] blocks (private), with a public method to retrieve the BytesRef for a given doc?

Absolutely!  Now that we are in control, it would be a crime not not share the byte[]
Seems like one should pass in a BytesRef to be filled in... that would be most efficient for people doing simple stuff like compare docid1 to docid2.  Returning a reused BytesRef could also work (as TermsEnum does) but it's less efficient for anything needing a state of more than 1 BytesRef since it then requires copying.

We can further save space by putting the length as a vInt in the byte[] - most would be a single byte.
Then we just need an int[] as an index into the byte[]... or potentially packed ints.

We'll also need an implementation that can span multiple byte[]s for larger than 2GB support.  The correct byte[] to look into is then simply a function of the docid (as is done in Solr's UnInvertedField).

We could possibly play games with the offsets into the byte[] too - encode as a delta against the average instead of an absolute offset.  So offset = average_size * ord + get_delta(ord)

> Add FieldCache.getTermBytes, to load term data as byte[]
> --------------------------------------------------------
>
>                 Key: LUCENE-2380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2380
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 3.1
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding methods to load terms as native byte[], since in general they may not be representable as String.  This should be quite a bit more RAM efficient too, for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854639#action_12854639 ] 

Uwe Schindler commented on LUCENE-2380:
---------------------------------------

This goes again in the direction of not having arrays in FieldCache anymore, but instead have accessor methods taking a docid and giving back the data (possibly as a reference). So getBytes(docid) returns a reused BytesRef with offset and length of the requested term. For native types we should also go away from arrays and only provide accessor methods. Java is so fast and possiby inlines the method call. So for native types we could also use a FloatBuffer or ByteBuffer or whatever from java.nio.

> Add FieldCache.getTermBytes, to load term data as byte[]
> --------------------------------------------------------
>
>                 Key: LUCENE-2380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2380
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 3.1
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding methods to load terms as native byte[], since in general they may not be representable as String.  This should be quite a bit more RAM efficient too, for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854615#action_12854615 ] 

Michael McCandless commented on LUCENE-2380:
--------------------------------------------

We could also do shared byte[] blocks (private), with a public method to retrieve the BytesRef for a given doc?  Standard codec's terms index does this -- we could share it I think.

A new byte[] per doc adds alot of RAM overhead and GC load.  (Of course, so does the String solution we use today, so it'd at least be no worse...).

> Add FieldCache.getTermBytes, to load term data as byte[]
> --------------------------------------------------------
>
>                 Key: LUCENE-2380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2380
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 3.1
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding methods to load terms as native byte[], since in general they may not be representable as String.  This should be quite a bit more RAM efficient too, for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org