You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2011/04/01 17:46:06 UTC

[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core

    [ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014703#comment-13014703 ] 

Yonik Seeley commented on LUCENE-3003:
--------------------------------------

bq. Attached: 32-bit results

Ah, bummer.  It's every 8 bytes, but with a 4 byte offset!
I guess we could make it based on if we detect 32 vs 64 bit jvm... but maybe first see if anyone has any ideas about how to use something like pagedbytes instead.

> Move UnInvertedField into Lucene core
> -------------------------------------
>
>                 Key: LUCENE-3003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3003
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3003.patch, LUCENE-3003.patch, byte_size_32-bit-openjdk6.txt
>
>
> Solr's UnInvertedField lets you quickly lookup all terms ords for a
> given doc/field.
> Like, FieldCache, it inverts the index to produce this, and creates a
> RAM-resident data structure holding the bits; but, unlike FieldCache,
> it can handle multiple values per doc, and, it does not hold the term
> bytes in RAM.  Rather, it holds only term ords, and then uses
> TermsEnum to resolve ord -> term.
> This is great eg for faceting, where you want to use int ords for all
> of your counting, and then only at the end you need to resolve the
> "top N" ords to their text.
> I think this is a useful core functionality, and we should move most
> of it into Lucene's core.  It's a good complement to FieldCache.  For
> this first baby step, I just move it into core and refactor Solr's
> usage of it.
> After this, as separate issues, I think there are some things we could
> explore/improve:
>   * The first-pass that allocates lots of tiny byte[] looks like it
>     could be inefficient.  Maybe we could use the byte slices from the
>     indexer for this...
>   * We can improve the RAM efficiency of the TermIndex: if the codec
>     supports ords, and we are operating on one segment, we should just
>     use it.  If not, we can use a more RAM-efficient data structure,
>     eg an FST mapping to the ord.
>   * We may be able to improve on the main byte[] representation by
>     using packed ints instead of delta-vInt?
>   * Eventually we should fold this ability into docvalues, ie we'd
>     write the byte[] image at indexing time, and then loading would be
>     fast, instead of uninverting

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org