You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Simon Willnauer (JIRA)" <ji...@apache.org> on 2011/05/18 10:20:49 UTC

[jira] [Issue Comment Edited] (LUCENE-3108) Land DocValues on trunk

    [ https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035234#comment-13035234 ] 

Simon Willnauer edited comment on LUCENE-3108 at 5/18/11 8:20 AM:
------------------------------------------------------------------

Mike thanks for the review!!!!!

bq. Phew been a long time since I looked at this branch!

its been changing :) 

{quote} We have some stale jdocs that reference .setIntValue methods (they
are now .setInt){quote}
True - thanks I will fix.

{quote} Hmm do we have byte ordering problems? Ie, if I write index on
machine with little-endian but then try to load values on
big-endian...? I think we're OK (we seem to always use
IndexOutput.writeInt, and we convert float-to-raw-int-bits using
java's APIs)?{quote}

We are ok here since we write big-endian (enforced by DataOutput) and read it back in as plain bytes. The created ByteBuffer will always use BIG_ENDIAN as the default order. I added a comment for this.

{quote}How come codecID changed from String to int on the branch?{quote}
due to DocValues I need to compare the ID to certain fields to see for what field I stored and need to open docValues. I always had to parse the given string which is kind of odd. I think its more natural to have the same datatype on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making a string out of it is way simpler / less risky than parsing IMO.

{quote} What are oal.util.Pair and ParallelArray for?{quote}
legacy I will remove

{quote} FloatsRef should state in the jdocs that it's really slicing a
double[]?{quote}

yep done!

{quote} Can SortField somehow detect whether the needed field was stored
in FC vs DV and pick the right comparator accordingly...? Kind of
like how NumericField can detect whether the ints are encoded as
"plain text" or as NF? We can open a new issue for this,
post-landing...{quote}

This is tricky though. You can have a DV field that is indexed too so its hard to tell if we can reliably do it. If we can't make it reliable I think we should not do it at all.


{quote}It looks like we can sort by int/long/float/double pulled from DV,
but not by terms? This is fine for landing... but I think we
should open a post-landing issue to also make FieldComparators for
the Terms cases?{quote}

Yeah true. I didn't add a FieldComparator for bytes yet. I think this is post landing!

{quote} Should we rename oal.index.values.Type -> .ValueType? Just
because... it looks so generic when its imported & used as "Type"
somewhere?{quote}

agreed. I also think we should rename Source but I don't have a good name yet. Any idea?

{quote} Since we dynamically reserve a value to mean "unset", does that
mean there are some datasets we cannot index? Or... do we tap
into the unused bit of a long, ie the sentinel value can be
negative? But if the data set spans Long.MIN_VALUE to
Long.MAX_VALUE, what do we do...?{quote}

Again, tricky! The quick answer is yes, but we can't do that anyway since I have not normalize the range to be 0 based since PackedInts doesn't allow negative values. so the range we can store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but to get around this we need to have a different impl I think or do I miss something? 

I will make the changes once SVN is writeable again.



      was (Author: simonw):
    Mike thanks for the review!!!!!

bq. Phew been a long time since I looked at this branch!

its been changing :) 

bq. We have some stale jdocs that reference .setIntValue methods (they
are now .setInt)
True - thanks I will fix.

bq. Hmm do we have byte ordering problems? Ie, if I write index on
machine with little-endian but then try to load values on
big-endian...? I think we're OK (we seem to always use
IndexOutput.writeInt, and we convert float-to-raw-int-bits using
java's APIs)?

We are ok here since we write big-endian (enforced by DataOutput) and read it back in as plain bytes. The created ByteBuffer will always use BIG_ENDIAN as the default order. I added a comment for this.

bq. How come codecID changed from String to int on the branch?
due to DocValues I need to compare the ID to certain fields to see for what field I stored and need to open docValues. I always had to parse the given string which is kind of odd. I think its more natural to have the same datatype on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making a string out of it is way simpler / less risky than parsing IMO.

bq. What are oal.util.Pair and ParallelArray for?
legacy I will remove

bq. FloatsRef should state in the jdocs that it's really slicing a
double[]?

yep done!

bq. Can SortField somehow detect whether the needed field was stored
in FC vs DV and pick the right comparator accordingly...? Kind of
like how NumericField can detect whether the ints are encoded as
"plain text" or as NF? We can open a new issue for this,
post-landing...

This is tricky though. You can have a DV field that is indexed too so its hard to tell if we can reliably do it. If we can't make it reliable I think we should not do it at all.


bq. It looks like we can sort by int/long/float/double pulled from DV,
but not by terms? This is fine for landing... but I think we
should open a post-landing issue to also make FieldComparators for
the Terms cases?

Yeah true. I didn't add a FieldComparator for bytes yet. I think this is post landing!

bq. Should we rename oal.index.values.Type -> .ValueType? Just
because... it looks so generic when its imported & used as "Type"
somewhere?

agreed. I also think we should rename Source but I don't have a good name yet. Any idea?

bq. Since we dynamically reserve a value to mean "unset", does that
mean there are some datasets we cannot index? Or... do we tap
into the unused bit of a long, ie the sentinel value can be
negative? But if the data set spans Long.MIN_VALUE to
Long.MAX_VALUE, what do we do...?

This is tricky though. The quick answer is yes, but we can't do that anyway since I have not normalize the range to be 0 based since PackedInts doesn't allow negative values. so the range we can store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but to get around this we need to have a different impl I think or do I miss something? 

I will make the changes once SVN is writeable again.


  
> Land DocValues on trunk
> -----------------------
>
>                 Key: LUCENE-3108
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3108
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: core/index, core/search, core/store
>    Affects Versions: CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: 4.0
>
>
> Its time to move another feature from branch to trunk. I want to start this process now while still a couple of issues remain on the branch. Currently I am down to a single nocommit (javadocs on DocValues.java) and a couple of testing TODOs (explicit multithreaded tests and unoptimized with deletions) but I think those are not worth separate issues so we can resolve them as we go. 
> The already created issues (LUCENE-3075 and LUCENE-3074) should not block this process here IMO, we can fix them once we are on trunk. 
> Here is a quick feature overview of what has been implemented:
>  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, Bytes (fixed / variable size each in sorted, straight and deref variations)
>  * Integration into Flex-API, Codec provides a PerDocConsumer->DocValuesConsumer (write) / PerDocValues->DocValues (read) 
>  * By-Default enabled in all codecs except of PreFlex
>  * Follows other flex-API patterns like non-segment reader throw UOE forcing MultiPerDocValues if on DirReader etc.
>  * Integration into IndexWriter, FieldInfos etc.
>  * Random-testing enabled via RandomIW - injecting random DocValues into documents
>  * Basic checks in CheckIndex (which runs after each test)
>  * FieldComparator for int and float variants (Sorting, currently directly integrated into SortField, this might go into a separate DocValuesSortField eventually)
>  * Extended TestSort for DocValues
>  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only sequential access) -> Source.java / DocValuesEnum.java
>  * Extensible Cache implementation for RAM-Resident DocValues (by-default loaded into RAM only once and freed once IR is closed) -> SourceCache.java
>  
> PS: Currently the RAM resident API is named Source (Source.java) which seems too generic. I think we should rename it into RamDocValues or something like that, suggestion welcome!   
> Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org