You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Simon Willnauer (JIRA)" <ji...@apache.org> on 2010/10/11 17:19:36 UTC
[jira] Issue Comment Edited: (LUCENE-2186) First cut at column-stride fields (index values storage)

    [ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919864#action_12919864 ] 

Simon Willnauer edited comment on LUCENE-2186 at 10/11/10 11:18 AM:
--------------------------------------------------------------------

{quote}
There are still many nocommits but most look like they could become TODOs?
Do you have a high level sense of what's missing before we can commit to trunk?
{quote}

Yes and No :), here is my roadmap for this issue. We have 47 nocommit pending where about the half of it can be TODOs while the other half of it are rather easy task and should be fixed before we go to trunk.
These are the major steps I would like to finish until we land this on trunk

* Implement bulk copies for merging where possible. Currently there are still some value types not bulk copied and the ones which are only do if there are no deletes. Yet, the deletes thing I would make a TODO for now - we can still make that more efficient once we are on trunk. If I recall correctly figuring out the next deleted document is still a linear problem (I need to iterate through deletes), right? I guess that would be easier if I could figure out the next one so see if bulks are reasonable - maybe an invalid concern though.

* Exposing the API via Fields / IndexReader. I think we should expose the Iterator API via Fields just like Terms is today. Currently it doesn't feel very natural to get the ValuesEnum via IR.
* Rethink the Source API - I get the feeling that we don't really need the Source class but could rather use a Random Access Enum like Terms where we can see back and forth depending on how we loaded the fields values. We could actually unify the iterator API and random access which would catch two birds with one stone. internally we simply use the *Refs to set the actual values, default values would no be needed anymore (would save some code / branches internally) and the user would not have to access two different APIs. Additionally we could expose bulk reads just like BulkReadResult in DocsEnum to obtain all values in an array. Maybe if we wanna populate FieldsCache from it. I think we won't have perf. losts due to that since there is not really an overhead compared to the get() call on Source. - Reminds me I need to think about how we use that with sorted values.... If we keep Source we should at least make it implement ValuesEnum so we can use it as enumeration if they are in mem already.

* To do merging for byte values correctly we need to figure out how to specify the comparator for each field. I don't have a concrete idea for this but I think this should somehow go into IndexWriterConfig in a per field map. Thougths?


Remaining nocommits could be converted into TODOs - I think we can do so with the following

* Evaluate if we can decide if a Bytes Payload should be stored as straight or as fixed which would make it easier for the user to use the byte variants.
* Evaluate if we need String variants or if they can simple be solved with the byte ones
* We should have some king of compatibility notion so that slightly different segments can be merged like fixed vs. var bytes float32 vs. float64.
* For a cleaner transition we should create a sep. SortField that always uses index values.
* explore a better way to obtain all dat / idx fiels in SegmentInfo to do segment merges for index values.
* BytesValueProcessor should be thread private but I will leave that as a todo since this code might change anyway once realtime lands on trunk though. Not super urgent for now.
* Fix some exception handling issues especially in MultiSource & MultiValuesEnum
* Fix the singed / unsigned limitations in Ints implementation
* Explore ways to preven Ints impl do two method calls maybe we can expose PackedInts directly somehow

bq. How about docvalues? You don't need the _branch part since it'll be at http://svn.apache.org.../branches/docvalues.
OK

      was (Author: simonw):
    {quote}
There are still many nocommits but most look like they could become TODOs?
Do you have a high level sense of what's missing before we can commit to trunk?
{quote}

Yes and No :), here is my roadmap for this issue. We have 47 nocommit pending where about the half of it can be TODOs while the other half of it are rather easy task and should be fixed before we go to trunk.
These are the major steps I would like to finish until we land this on trunk

* Implement bulk copies for merging where possible. Currently there are still some value types not bulk copied and the ones which are only do if there are no deletes. Yet, the deletes thing I would make a TODO for now - we can still make that more efficient once we are on trunk. If I recall correctly figuring out the next deleted document is still a linear problem (I need to iterate through deletes), right? I guess that would be easier if I could figure out the next one so see if bulks are reasonable - maybe an invalid concern though.

* Exposing the API via Fields / IndexReader. I think we should expose the Iterator API via Fields just like Terms is today. Currently it doesn't feel very natural to get the ValuesEnum via IR.
* Rethink the Source API - I get the feeling that we don't really need the Source class but could rather use a Random Access Enum like Terms where we can see back and forth depending on how we loaded the fields values. We could actually unify the iterator API and random access which would catch two birds with one stone. internally we simply use the *Refs to set the actual values, default values would no be needed anymore (would save some code / branches internally) and the user would not have to access two different APIs. Additionally we could expose bulk reads just like BulkReadResult in DocsEnum to obtain all values in an array. Maybe if we wanna populate FieldsCache from it. I think we won't have perf. losts due to that since there is not really an overhead compared to the get() call on Source. - Reminds me I need to think about how we use that with sorted values.... If we keep Source we should at least make it implement ValuesEnum so we can use it as enumeration if they are in mem already.

* To do merging for byte values correctly we need to figure out how to specify the comparator for each field. I don't have a concrete idea for this but I think this should somehow go into IndexWriterConfig in a per field map. Thougths?


Remaining nocommits could be converted into TODOs - I think we can do so with the following

* Evaluate if we can decide if a Bytes Payload should be stored as straight or as fixed which would make it easier for the user to use the byte variants.
* Evaluate if we need String variants or if they can simple be solved with the byte ones
* We should have some king of compatibility notion so that slightly different segments can be merged like fixed vs. var bytes float32 vs. float64.
* For a cleaner transition we should create a sep. SortField that always uses index values.
* explore a better way to obtain all dat / idx fiels in SegmentInfo to do segment merges for index values.
*BytesValueProcessor should be thread private but I will leave that as a todo since this code might change anyway once realtime lands on trunk though. Not super urgent for now.
* Fix some exception handling issues especially in MultiSource & MultiValuesEnum
* Fix the singed / unsigned limitations in Ints implementation
* Explore ways to preven Ints impl do two method calls maybe we can expose PackedInts directly somehow

bq. How about docvalues? You don't need the _branch part since it'll be at http://svn.apache.org.../branches/docvalues.
OK
  
> First cut at column-stride fields (index values storage)
> --------------------------------------------------------
>
>                 Key: LUCENE-2186
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2186
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Simon Willnauer
>             Fix For: 4.0
>
>         Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py
>
>
> I created an initial basic impl for storing "index values" (ie
> column-stride value storage).  This is still a work in progress... but
> the approach looks compelling.  I'm posting my current status/patch
> here to get feedback/iterate, etc.
> The code is standalone now, and lives under new package
> oal.index.values (plus some util changes, refactorings) -- I have yet
> to integrate into Lucene so eg you can mark that a given Field's value
> should be stored into the index values, sorting will use these values
> instead of field cache, etc.
> It handles 3 types of values:
>   * Six variants of byte[] per doc, all combinations of fixed vs
>     variable length, and stored either "straight" (good for eg a
>     "title" field), "deref" (good when many docs share the same value,
>     but you won't do any sorting) or "sorted".
>   * Integers (variable bit precision used as necessary, ie this can
>     store byte/short/int/long, and all precisions in between)
>   * Floats (4 or 8 byte precision)
> String fields are stored as the UTF8 byte[].  This patch adds a
> BytesRef, which does the same thing as flex's TermRef (we should merge
> them).
> This patch also adds basic initial impl of PackedInts (LUCENE-1990);
> we can swap that out if/when we get a better impl.
> This storage is dense (like field cache), so it's appropriate when the
> field occurs in all/most docs.  It's just like field cache, except the
> reading API is a get() method invocation, per document.
> Next step is to do basic integration with Lucene, and then compare
> sort performance of this vs field cache.
> For the "sort by String value" case, I think RAM usage & GC load of
> this index values API should be much better than field caache, since
> it does not create object per document (instead shares big long[] and
> byte[] across all docs), and because the values are stored in RAM as
> their UTF8 bytes.
> There are abstract Writer/Reader classes.  The current reader impls
> are entirely RAM resident (like field cache), but the API is (I think)
> agnostic, ie, one could make an MMAP impl instead.
> I think this is the first baby step towards LUCENE-1231.  Ie, it
> cannot yet update values, and the reading API is fully random-access
> by docID (like field cache), not like a posting list, though I
> do think we should add an iterator() api (to return flex's DocsEnum)
> -- eg I think this would be a good way to track avg doc/field length
> for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org