You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Simon Willnauer (JIRA)" <ji...@apache.org> on 2010/12/13 13:42:00 UTC

[jira] Commented: (LUCENE-2810) Stored Fields Compression

    [ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970796#action_12970796 ] 

Simon Willnauer commented on LUCENE-2810:
-----------------------------------------

bq. For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. 

Grant, how would that be different to DocValues DerefVarBytes variant really. I don't think that the Stored Fields feature in lucene should be very much extended in its semantics. if somebody wants to use BDB or something else to store the values fine, other than that they should really use docValues and specialize for a certain usecase. Stored Fields support in Codec is far away IMO since we need to build quiet some API on top of the existing codec API to consume entire documents. Yet, that said - help is very welcome in the docValues branch....


simon

> Stored Fields Compression
> -------------------------
>
>                 Key: LUCENE-2810
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2810
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents.  For instance, simply compressing a typical log file often results in > 75% compression rates.  We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.)  For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy.  Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well.  
> It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org