You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2015/09/03 23:03:46 UTC
[jira] [Comment Edited] (LUCENE-6779) Reduce memory allocated by CompressingStoredFieldsWriter to write large strings

    [ https://issues.apache.org/jira/browse/LUCENE-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729779#comment-14729779 ] 

Shalin Shekhar Mangar edited comment on LUCENE-6779 at 9/3/15 9:03 PM:
-----------------------------------------------------------------------

bq. Isn't this a mix of two things (buffering and coding)? I think it'd be nicer to have the DataOutput (or some decorator) take care of the buffering aspects and the routine could then focus on transcoding from UTF16 to UTF8.

Yes but that actually has better performance than writing bytes directly to the DataOutput. I tested this with JavaBinCodec and I don't think performance will be very different here (see JMH benchmark results in SOLR-7971). Presumably, the huge amount of invocations of writeByte don't perform well compared to setting a byte in a scratch array directly.

bq. Also, most of the hardcoded constants/ checks for surrogate pairs, etc. do have counterparts in Character.* methods (and they should inline very well).

I didn't know about that. The constants here are the same as the ones in the existing UnicodeUtil.UTF16toUTF8 method.


was (Author: shalinmangar):
bq. Isn't this a mix of two things (buffering and coding)? I think it'd be nicer to have the DataOutput (or some decorator) take care of the buffering aspects and the routine could then focus on transcoding from UTF16 to UTF8.

Yes but that actually has better performance than writing bytes directly to the DataOutput. I tested this with JavaBinCodec and I don't think performance will be very different here (see JMH benchmark results in SOLR-7971). Presumably, the huge amount of invocations of writeByte don't perform well compared to setting a byte in a scratch array directly.

> Reduce memory allocated by CompressingStoredFieldsWriter to write large strings
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-6779
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6779
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Shalin Shekhar Mangar
>         Attachments: LUCENE-6779.patch
>
>
> In SOLR-7927, I am trying to reduce the memory required to index very large documents (between 10 to 100MB) and one of the places which allocate a lot of heap is the UTF8 encoding in CompressingStoredFieldsWriter. The same problem existed in JavaBinCodec and we reduced its memory allocation by falling back to a double pass approach in SOLR-7971 when the utf8 size of the string is greater than 64KB.
> I propose to make the same changes to CompressingStoredFieldsWriter as we made to JavaBinCodec in SOLR-7971.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org