You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Busch (JIRA)" <ji...@apache.org> on 2009/04/04 05:36:12 UTC

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

    [ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695666#action_12695666 ] 

Michael Busch commented on LUCENE-1231:
---------------------------------------

For the search side we need an API similar to TermDocs and Payloads,
let's call it ColumnStrideFieldAccessor (CSFA) for now. It should have
next(), skipTo(), doc(), etc. methods.
However, the way TermPositions#getPayloads() currently works
is that it always forces you to copy the bytes from the underlying
IndexInput into the payload byte[] array. Since we usually use a
BufferedIndexInput, this is then an arraycopy from
BufferedIndexInput's buffer array into the byte array.

I think to improve this we could allow users to call methods like
readVInt() directly on the CSFA. So I was thinking about adding
DataInput and DataOutput as superclasses of IndexInput and
IndexOutput. DataIn(Out)put would implement the different read and
write methods, whereas IndexIn(Out)put would only implement methods
like close(), seek(), getFilePointer(), length(), flush(), etc.

So then CSFA would extend DataInput or alternatively have a
getDataInput() method. The danger here compared to the current
payloads API would be that the user might read too few or too many
bytes of a CSF, which would result in an undefined and possibly hard
to debug behavior. But we could offer e.g.:

{code}
  static ColumnStrideFieldsAccessor getAccessor(ColumnStrideFieldsAccessor in, Mode mode) {
    if (mode == Mode.Fast) {
      return in;
    } else if (mode == Mode.Safe) {
      return new SafeAccessor(in);
    }
{code}

The SafeAccessor would count for you the number of read bytes and
throw exceptions if you don't consume the number of bytes you should
consume. This is of course overhead, but users could use the
SafeAccessor until they're confident that everything works fine in
their system, and then switch to the fast accessor for better
performance.

If there are no objections I will open a separate JIRA issue for the
DataInput/Output patch.

> Column-stride fields (aka per-document Payloads)
> ------------------------------------------------
>
>                 Key: LUCENE-1231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1231
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.0
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting list
> that has one posting with payload for each document you can "simulate" a column-
> stride field. The performance is significantly better compared to stored fields,
> however still not optimal. The reason is that for each document the freq value,
> which is in this particular case always 1, has to be decoded, also one position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList | <VariableLengthList, SkipList> 
> FixedLengthList --> <Payload>^SegSize 
> VariableLengthList --> <DocDelta, PayloadLength?, Payload> 
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data structure. 
> This could work like this: When the DocumentsWriter writes a segment it checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org