You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2010/12/13 17:03:01 UTC
[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field
approaches for highly redundant data
[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970864#action_12970864 ]
Grant Ingersoll commented on LUCENE-2810:
-----------------------------------------
bq. I think though we'd want to differentiate fields - not all of them should be compressed, because it means they'll need to be de-compressed, which might be expensive for some apps.
Yes.
bq. Yet, I think that seems much more like something for a codec and I think that support is needed desperately.
Agreed. And also agreed that this is something for contrib. I never, ever thought it was something to be forced on the primary implementation.
> Explore Alternate Stored Field approaches for highly redundant data
> -------------------------------------------------------------------
>
> Key: LUCENE-2810
> URL: https://issues.apache.org/jira/browse/LUCENE-2810
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Store
> Reporter: Grant Ingersoll
> Assignee: Grant Ingersoll
>
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents. For instance, simply compressing a typical log file often results in > 75% compression rates. We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.) For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well.
> It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org